Regularized Posteriors in Linear Ill-posed Inverse Problems Jean-Pierre Florens†

Anna Simoni‡

Toulouse School of Economics

Toulouse School of Economics

(GREMAQ and IDEI)

(GREMAQ)



July 26, 2008

Abstract This paper studies Bayesian solution for a signal-noise problem. The infinite dimensional parameter of interest is characterized as the solution of a functional equation which is illposed because of compactness of the operator appearing in it. We restate the problem as a parameter estimation problem where inference is performed in a Bayesian way. The solution of this problem is the posterior distribution of such parameter, but the infinite dimension of the considered spaces causes a problem of non continuity of this solution. Our contribution is to propose a new method to deal with this problem: we adopt a Tikhonov regularization scheme for constructing a new posterior distribution that is continuous and that we call regularized posterior distribution. We prove that this regularized posterior distribution is consistent in a ”frequentist” sense. Our results agree with previous literature on infinite-dimensional Bayesian experiments, see for instance Diaconis and Freedman (1986).

AMS codes: 62C10, 65R32, 62G20. Keywords: Functional data, Inverse Problems, Tikhonov and Hilbert Scale regularization, Posterior Consistency, Bayesian estimation of density and regression.

∗ We thank Jan Johannes, Renauld Lestringand, Sebastien Van Bellegem, Anna Vanhems for helpful discussions. We thank the participants to conferences and seminars at Isaac Newton Institute for Mathematical Sciences Cambridge, Aussois - SFdS, Rencontre de Statistiques Math´ ematique - Marseille, Toulouse - LSP, CREST - PARIS for helpful comments. † [email protected][email protected]

1

1

Introduction

In this paper we deal with solving a functional equation describing an implicit relation between the object of interest x and the observed one Y : Y = Kx,

x ∈ X,Y ∈ Y

(1)

where K is an operator that uniquely maps elements of the set X onto elements of Y. Such a kind of problem is known in literature as an inverse problem and it can be encountered in very different fields. In physics, for instance, in time resolved fluorescence problem x represents the intensity of the light-pulse and Y is the observed fluorescence intensity. In signal and image processing, the deblurring image problem amounts to recover the true image x through the blurred image Y . Numerous examples of inverse problems can be also found in statistics, for example the classical example of density estimation where x stands for the density and Y for the cumulative distribution function, see Vapnik 1998 [41]. We will describe this and other applications more deeply in the following of the paper. Further applications of inverse problems in economics are provided by recent literature, see Carrasco et al. (2007) [5], that proposes to interpret many problems in structural econometric as inverse problems where the implicit relation is between some functional parameter x of interest and the cumulative distribution function F determining the data generating process (DGP). In most of applications, X and Y are infinite dimensional spaces and K is a compact integral operator so that equation (1) is known as Fredholm Integral Equation of type I. Otherwise, if K = I − A, where I and A are the identity and the integral operator, respectively, equation (1) is known as Fredholm Integral Equation of type II. Given the huge number of problems that are linear in x, we only deal with linear inverse problems in this paper. However, it is important to point out that there exists a lot of inverse problems that are nonlinear. The resolution technique that we propose cannot be extended to nonlinear models since the property of linearity is directly exploited in order to compute the posterior distribution of X. Following the traditions, the analysis in this paper is formulated in terms of Hilbert spaces, so that X and Y are two separable real or complex Hilbert spaces and K is supposed to be an Hilbert-Schmidt operator. Equation (1) must be solved in x. If an unique solution exists and depends continuously on data, namely Hadamard’s conditions are satisfied, the problem is said a well-posed inverse problem. On the contrary, the inverse problem is ill-posed if at least one of the Hadamard’s conditions is not fulfilled, see Engl et al.(2000) [12]. Actually, these are conditions on the operator K so that well-posedness is obtained if the operator is bijective and has a continuous inverse. In finite dimension, linear operators are continuous, but this property no longer holds in infinite dimension, this is for instance the case of compact integral operators. Moreover, il K is continuous and one to one, the range of K, R(K), can be equal to Y only if Y is finite dimensional. Ill-posedness of an inverse problem is also related to which data are considered admissible. In general we do not observe the exact transformation Y of x but a noisy transformation Yˆ . This implies, first of all, that we cannot be sure that Yˆ ∈ R(K), where R(·) denotes the range of an operator, so the assumption of surjectivity is violated. Moreover, if K −1 is not continuous then small observation errors in Y cause large deviations in the solution x ˆ. This can be understood by considering the inequality ||ˆ x − x|| ≤ ||K −1 ||||Yˆ − Y ||, in which ||K −1 || denotes the norm of K −1 . It is clear that, if ||K −1 || is very large, errors may be strongly amplified by the action of K −1 . The general idea, in order to solve an ill-posed inverse problem, is to restate the problem in such a way that the Hadamard’s conditions be satisfied. Restating the problem always implies reducing one’s ambition: it is not possible to restore all the information that an ideal solution would carry. The important thing is then to find the right trade-off between the quantity of information to be retrieved and the level of accuracy. Two different approaches to face an ill-posed inverse problem have been proposed in literature: the classical approach and the Bayesian one. The classical approach solves the lack of bijectivity of K by looking for the best approximate solution. In other words, equation (1) is transformed in a generalized inverse problem: x ˆ = arg minx∈X ||Y − Kx||2 2

and we keep the solution with minimal norm. This is equivalent to compute the Moore-Penrose generalized inverse of K. However, the generalized inverse is generally not continuous and it has to be regularized. Regularization techniques consist in replacing the sequence of explosive inverse sinq(α,µ ) gular values µ1j associated to operator K by a sequence of bounded inverse singular values µj j that are asymptotically unbiased (i.e. limα→0 q(α, µj ) = 1, ∀j), see Kress (1999, Theorem 15.21) [28]. The most common regularization schemes are the Spectral cut-off, the Landweber-Fridman scheme, the Tikhonov regularization and the iterated Tikhonov regulariztion. A completely different approach is the Bayesian approach that interprets every quantity in the problem as a random variable or a random function, so that both the parameter of interest x and the observed quantity Y induce some measure on the Hilbert spaces to which they belong. In Bayesian analysis, a functional equation of the type of problem (1) is called a Bayesian Statistical Inverse Problem. From a Bayesian Statistical Inversion Theory point of view, the solution to an inverse problem is the posterior distribution of the quantity of interest. In other words, Bayesian analysis considers an inverse problem in a different way with respect to the classical analysis since it restates functional equation (1) in a larger space of probability distributions. This reformulation of an inverse problem was due to Franklin (1970) [21]. The object of interest is also changed: we are no more interested in a punctual estimation of x, but in a distribution that incorporates both the information a priori and in the sample. After which, this distribution can be exploited to obtain a punctual estimation of x. We follow in this paper a Bayesian approach and we propose a solution to (1) for the Gaussian case by directly working on functional spaces. Before performing the measurement of Y , we suppose to have some information about x that we summarize by a prior distribution on the parameter space, and for a given realization of x we know the sampling probability of Y , namely the conditional distribution of Y conditioned to the σ-field of subsets of the parameter space. More precisely, to well describe noisy data Yˆ , an error term U has to be explicitly incorporated in (1) and the sampling probability is determined by the measure induced by this observation error. As usual in Bayesian analysis, we incorporate the error term additively, so that analysis can be simplified by using a conjugate prior distribution. Previous works in bayesian ill-posed problems literature have considered equations of type (1) in finite dimensional spaces, see for instance examples given in Kaipio et al. (2004) [27]. In finite dimension, ill-posedness is principally due to a problem of multicollinearity and the most commonly used method to solve this problem is Tikhonov regularization, also known as ridge regression. In this framework, classical and Bayesian approach are strongly related since Tikhonov regularization method can be justified from a Bayesian point of view. We can assume that both the prior and the sampling measures are gaussian with spherical variances. Under these assumptions, the Tikhonov regularized solution coincides with the posterior mean for a regularization parameter equal to the ratio of the noise and prior variance. Therefore, in finite dimension we can remove ill-posedness by incorporating the available prior information. On the contrary, in infinite dimension, Bayesian approach does not solve the ill-posedness of the problem since covariance matrices do not have the regularization properties that have in the finite dimensional case. In particular, being impossible to continuously inverse the covariance matrices we still need some regularization technique and the bayesian approach only lies in changing the nature of the problem. In previous literature on functional equations, two main directions has been used to address this problem. In the first one, a white noise model is adopted, so that the covariance operator of the disturbance has a suitable structure. Moreover, this model is projected on an orthonormal basis of the functional space Y and only projections of the noisy curve Yˆ are observed, see for instance Zhao (2000) [44] and Belitser et al. (2003) [4]. Another part of the literature make inference directly on infinite dimensional spaces without projecting the model on any basis. Franklin (1970) [21], makes the covariance operator invertible by assuming that it is positive definite and not merely positive semidefinite. Prenter et al. (1985) [35] solve the problem of non invertibility by working with white noise model. Moreover, they show that a correspondence between Bayesian solution and Tikhonov regularized solution exists also in infinite dimensional spaces, but they do not solve the problem of non continuity for a general model different than the white noise model. Mandelbaum (1984) [32] handles problems of unboundedness of the operator representing the solution of linear inverse problems by considering a class of measurable operators which are linear on a subspace of measure one. He solves the problem only

3

for Gaussian Hilbert-space-valued variables. Lehtinen et al. (1989) [30] extend this approach for the linear Gaussian case with random variables with values in spaces of generalised functions. This paper follows this second direction, but it proposes a different correction for the unboundendness of the posterior mean. We find the operator characterizing the posterior mean as the solution of a further inverse problem that again results to be ill-posed and that we solve by applying a Tikhonov regularization scheme. This characterizes a new object that we call regularized posterior distribution. We guess this object is the solution of the inverse problem 1. Our model can be applied to the case in which we observe the whole curve, (i.e. we have continuous observations) or to the case in which we have discrete observations and a way to transform them in an infinite dimensional object is available. The existence of a regular version of the posterior distribution, even in infinite dimensional spaces, is guaranteed by assuming that we are working in Polish spaces. Actually, Jirina Theorem states that this is a sufficient condition for the existence of a transition probability inducing the conditional probability, see Neveu (1965) [33]. We also analyze consistency of the regularized posterior distribution in a ”frequentist” sense. This means that we are supposing that there exists a true value of the parameter, as in classical statistics, and we check whether the regularized posterior distribution degenerates to a probability mass in a neighborhood of the true value as the number of observations grows indefinitely. Under the assumption that the true value of the parameter of interest belongs to the Reproducing Kernel Hilbert Space associated to the prior covariance operator we prove posterior consistency. Anyway, the supposed prior distribution is not able to generate a trajectory satisfying this regularity condition. This result is in line with the previous literature, see Diaconis & Freedman (1986) [9], and is due to the infinite dimension of the parameter of interest and to the impossibility for the support of the prior to cover all the domain of the true parameter. The support of the prior distribution is the closure of the Reproducing Kernel Hilbert Space associated to its covariance operator and so it is possible for the prior distribution to generate a trajectory as close as we want to the true value of the parameter. The paper is developed as follow. In Section 2 we present the model and the associated Bayesian experiment. Some example of possible applications is given, in particular we apply bayesian inversion analysis to density and regression function estimation. In Section 3 we study the existence, continuity and computability of the posterior distribution. To overcome problems of continuity and computability, a regularized version of the posterior distribution is defined and it is assumed to be the solution of the inverse problem. Frequency asymptotic properties of this solution are studied in Section 4. After enouncing conditions under which we would have posterior consistency, we prove the inconsistency of the prior distribution, namely the impossibility of the prior distribution to have generated the data.Finally, we give in Appendix A the proofs of all the results we get in the paper and in Appendix B we provide the results of numerical simulations of the basic model and of some example (density and regression function estimation). This paper is concentrated on the basic case where both the operator K and the covariance operator of the sampling measure are known. This situation can be not always satisfied in applications. In order to keep this paper reasonably short, all extensions to unknown operator, repeated operator (i.e. different operators for each observation) and unknown covariance operator are treated in a companion paper, see Florens & Simoni (2008) [19].

2 2.1

The Model Sampling Probability Measure

We consider a simple linear inverse problem. Let X , Y be infinite-dimensional Hilbert spaces over R that are supposed to be Polish with inner product < ·, · >X (resp., < ·, · >Y ). Let || · ||X denote the norm in X (resp., || · ||Y ). In the following, the notation will be lightened by eliminating the indices in the inner product and in the norm, the sense being clear from the contest. Our purpose is to recover the infinite dimensional parameter x from the functional equation:

4

Y = Kx

(2)

where Y ∈ Y, x ∈ X and K : X → Y is an Hilbert-Schmidt operator that is supposed to be known. K ∗ will denote the adjoint of K. As an example of spaces, we could take X = L2π and Y = L2ρ , where π and ρ are two measures on R, Z L2π = {x : R → R; x2 (t)π(t)dt < ∞} and Z L2ρ = {y : R → R;

y 2 (t)ρ(t)dt < ∞},

endowed with the L2 norm. Such spaces are used in the examples given below. It should be noted that an L2 space, endowed with a Gaussian measure defined on it, is a Polish space, see Hiroshi et al. (1975) [23]. Compactness of operator K and the infinite dimension of the range of K (i.e. dimR(K) = ∞), make equation (2) ill-posed. To put it better, instability of the solution is due to the spectrum of K, σ(K), consisting of a countable set of eigenvalues that accumulate only at 0. To recover x we exploit information contained in the whole curve {Y (t)} that represents the observed quantity. What renders instability a relevant problem is the fact that typically Y (t) is measured with error, so that only the approximate equation Y ≈ Kx holds and, due to the ill-posed nature of the problem, these small errors may cause errors of arbitrary size in the recovered x. We will denote the noisy observation of Y with Yˆ and the measurement error with U , U ∈ Y. Therefore, the corresponding statistical model is Yˆ = Kx + U.

(3)

A particular interpretation of the observation noise is found in Statistical Inverse Theory that substitutes the true Y with some estimation obtained from some sample and hence U is the estimation error. Quantities Yˆ , x and U in equation (3) have to be meant as Hilbert-random variables, namely as measurable maps from some probability space in an Hilbert space endowed with its Borel σ-field. Such a functional equation is exploited to characterize the conditional distribution P(Yˆ |x) of Yˆ given x. Let F denote the σ-field of subsets of the sample space Y, we endow the measurable space (Y, F) with this conditional distribution. In order this conditional probability be properly defined as a measure (in the sense that it represents a regular version of the conditional probability), it is assumed that a transition probability exists that associates to each x a probability measure P x on (Y, F) such that P(A|x) = P x (A) for every A ∈ F. We call P x the sampling measure and we suppose that it is gaussian whose mean function and covariance operator are determined by (3). Assumption 1 below characterizes in a rigorous way the measure P x induced by Yˆ . Assumption 1 Let P x be a conditional probability measure on (Y, F) given x such that E(||Yˆ ||2 ) < ∞. P x is a Gaussian measure that defines a mean element Kx ∈ Y and a covariance operator Σ : Y → Y. P x is gaussian if the probability distribution on the Borel sets of R induced from P x by every bounded linear functional on Y is Gaussian. More clearly, P x gaussian means that ∀B ∈ B(R) P(B) = P x {Yˆ ; < Yˆ , ψ >∈ B} is Gaussian for all ψ ∈ Y, see Baker (1973) [3]. The mean element Kx in Y is defined by Z < Kx, ψ1 > < Yˆ , ψ1 > dP x (Yˆ ) Y

5

and the operator Σ by Z < Yˆ − Kx, ψ1 >< Yˆ − Kx, ψ2 > dP x (Yˆ )

< Σψ1 , ψ2 >= Y

for every ψ1 , ψ2 ∈ Y. On the basis of this definition, Σ is correctly specified as a covariance operator in the sense that it is linear, bounded, nonnegative, selfadjoint and trace-class. A covariance operator need to be trace-class in order the associated measure be able to generate trajectories in the well suited space. Indeed, by Kolmogorov’s inequality a realization of a random function Yˆ is P 2 1 2 ˆ ˆ in Y if E(||Y ||Y |x) is finite .PSince E(||Y || |x) = j λj + ||Kx||2 and Kx ∈ Y, this is guaranteed if Σ is trace-class, that is if j λj < ∞, with {λj } the eigenvalues associated to Σ and E(·|x) the expectation taken with respect to P x . 1 Since the eigenvalues of Σ 2 are the square roots of the eigenvalues of Σ the fact to be trace-class 1 entails that Σ 2 is Hilbert-Schmidt. Hilbert-Schmidt operators are compact and the adjoint is still 1 Hilbert-Schmidt. Compacity of Σ 2 implies compacity of Σ. Compact operators are particularly attractive since they can be approximated by a sequence of finite dimensional operators and this is really useful when we do not known such an operator and we need to estimate it. In general, we can link the covariance operator Σ to some parameter n in such a way that this operator goes to 0 with n: Σn → 0. For instance, this will be the case if Y is a consistent estimation of the transformed signal Kx and consequently U is an estimation error that disappears as the sample size increases. Otherwise, in order to make inference we may want to consider the mean of the first n observations of an i.i.d. sample Yˆ1 , Yˆ2 , . . ., each of them being a realization of a conditional gaussian stochastic P process, conditioned to x, with conditional mean Kx and n conditional variance Σ. Thus, Yˆ = n1 i=1 Yˆi and model (3) can be written as n



= Kx +

1X Ui n i=1

= Kx + U

Ui ∼ i.i.d.GP(0, Σ)

U ∼ GP(0,

1 Σ). n

(4)

Since the sample mean is a sufficient statistics, this formulation and Yi = Kx + Ui , i = 1, . . . , n are statistically equivalent. For a proof of sufficiency of the sample mean, see Appendix 6.1. Henceforth, from now on we will denote the covariance operator with Σn , where the index n will be meant, in the most intuitive way, the sample size. It should result clear, after this remark, that we can have observational schemes of different type. In the most immediate situation, we are concerned with a sample of n curves Yˆi , i = 1, . . . , n, namely we observe an object of infinite dimension. This is different than the usual observational scheme in functional analysis literature where it is supposed to observe the curve only at certain points. In our setting it is supposed to really observe the whole curve, i.e. at every point, that could appear not very realistic since it seems difficult to observe this kind of object. However, we can easily obtain high-frequency data, like financial data, that can be seen as an approximation to this curve. Otherwise, the observations could be founded to be discrete and the curve Yˆ is a function obtained by transforming these data, like for instance a nonparametric estimator. This is the classical situation in Statistical Inverse Problems that we are going to consider: a sample of observations of a discrete object is available and from it we can obtain an estimate of Y in the problem, for instance the empirical characteristic function, the empirical cumulative distribution function or again the estimated integrated hazard function, see Examples 2 and 3 below. With this more realistic observational scheme, by starting from discrete observations we get objects of infinite dimension. Assumption 1 and functional form (3) implies that the error term U is a Gaussian process with zero mean and with variance Σn . Moreover, they imply that U is independent of x. Error term U is usually interpreted in Statistical Inverse Theory as a measurement error. In most of real situations the hypothesis of normality of an estimation error is justified only asymptotically, hence our model would seem not to rely over a valid foundation. Actually, this is not the case since hypothesis of normality only matters for determining our estimator, but it is not at all used in 1 Namely,

following Kolmogorov’s inequality P(||Yˆ ||2Y > ²n ) ∼ Op (1) if and only if E(||Yˆ ||2Y ) is finite.

6

study of asymptotic properties and our estimator will be consistent even if the normality hypothesis is not verified. In other words, the justification of our estimator is given from posterior consistency the proof of which does not rely on the fact that the sampling and prior measures are assumed to be gaussian.

2.2

Prior Specification and Identification

The next step in order to well define the Bayesian inverse problem is to define a probability measure µ on the parameter space X , this measure in induced by x and is called prior probability. We specify a conjugate prior for x that means we are considering a Gaussian measure on the Hilbert space X endowed with the σ-field E. Assumption 2 Let µ be a probability measure on (X , E) such that E(||x||2 ) < ∞. µ is a Gaussian measure that defines a mean element x0 ∈ X and a covariance operator Ω0 : X → X . Definition of the mean function x0 and covariance operator Ω0 are specular to above. Then, Ω0 φ := E(< φ, (x − x0 ) > (x − x0 )) is a linear, bounded, self-adjoint, positive semi-definite and 1

trace-class operator. Moreover, Ω02 is Hilbert-Schmidt and, exploiting same arguments as for Σn , Ω0 is compact. We introduce a new space denoted with H(Ω0 ) and called Reproducing Kernel Hilbert Space assoΩ0 0 ciated to the covariance operator Ω0 (R.K.H.S.(Ω0 )). Let {λΩ j , ϕj }j be the eigensystem of the compact self-adjoint operator Ω0 , see Kress (1999) [28] for a definition of eigensystem and singular value decomposition. We define the R.K.H.S.(Ω0 ) embedded in X as: H(Ω0 ) = {ϕ : ϕ ∈ X

∞ 2 0 X | < ϕ, ϕΩ j >|

and

j=1

0 λΩ j

< ∞}

(5) 1

and, following Proposition 3.6 in Carrasco et al. [5], we have the relation H(Ω0 ) = R(Ω02 ). Let x∗ denote the true value of the parameter having generated the data Yˆ , we assume that 1

Assumption 3 (x∗ − x0 ) ∈ H(Ω0 ), i.e. there exists δ∗ ∈ X such that x∗ − x0 = Ω02 δ∗ . This assumption is only a regularity condition and it will be exploited for proving asymptotic results. It is well known, see for instance Van der Vaart et al. (2000) [40], that the support of a centered Gaussian process, taking its values in an Hilbert space X is equal to the closure in X of the Reproducing Kernel Hilbert Space associated with the covariance operator of this process, i.e. µ{x; x ∈ H(Ω0 )} = 1. It results clear how the covariance operator has the role of modifying the support of a gaussian measure; in particular, if Ω0 is injective then the support of µ is the whole space X (in the same way injectivity of Σn implies that the support of P x coincides with Y). On the contrary, if the covariance operator is not one-to-one the support is any subset of X ; henceforth, a particular choice of the covariance operator allows to incorporate in the prior distribution constraints on the parameter of interest. From a Bayesian point of view we say that a model is identified if the posterior distribution completely revises the prior distribution, for what we do not need to introduce strong assumptions, see Florens et al. (1990) [16] Section 4.6 for an exhaustive explanation of this concept. Anyway, the interest of this paper is not only the posterior distribution but mainly the consistency of it. We will give in the following the definition of posterior consistency, i.e. of frequentist consistency. Therefore, asymptotic motivations make necessary the following assumption for identification. 1

Assumption 4 The operator KΩ02 : X → Y is one-to-one on X . This assumption guarantees continuity of the posterior mean and in particular of the regularized posterior mean that we shall define below. The classical hypothesis for identification of x in model (3) requires that K be one-to-one that 1

1

is a stronger condition. Actually, if Ω02 is one-to-one, K one-to-one implies KΩ02 one-to-one, but the reverse is not true. Therefore, frequentist consistency in a Bayesian model requires a weaker identification condition than a classical model. 7

2.3

Prior inconsistency

We have stressed that the support of a centered Gaussian process, taking its values in an Hilbert space X , is equal to the closure in X of the Reproducing Kernel Hilbert Space associated with the covariance operator of this process. Then, we might think that (x − x0 ) can be seen as a distribution over H(Ω0 ), but this is not correct. In fact, for any version of the process (x − x0 ) generated by the prior distribution µ we have (x − x0 ) ∈ H(Ω0 ), but with µ− probability 1 its trajectories are not in H(Ω0 ). This implies that the prior distribution is not able to generate a trajectory x that satisfies Assumption (3) or, in other words, the true value x∗ having generated Yˆ cannot have been drawn from µ. The following Lemma and its proof state this concept in a more detailed way. Lemma 1 Let (x − x0 ) be a zero-mean Gaussian process with covariance operator Ω0 . Let H(Ω0 ) be the R.K.H.S. associated to Ω0 embedded in X . Then, µ{x|(x − x0 ) ∈ H(Ω0 )} = 0. Even if µ puts zero probability on H(Ω0 ), this space is dense and therefore µ can generate trajectories as close as possible to the true value x∗ . We find a similar result for a Dirichlet process, in nonparametric probabilities estimation, in the sense that it puts zero probability mass on absolutely continuous probability measures but it is possible to drawn from it a probability functions close to them. This kind of problem is due to the fact that, because of the infinite dimensionality of the parameter space, the support of the prior can cover only a very ”small” part of it. Anyway, the prior becomes less and less important as n → ∞ because the data swamp the prior. This is exactly what happens here: the posterior distribution converges to the true value of the parameter even if the true value is not in the support of the prior.

2.4

Construction of the Bayesian Experiment

We construct in this section the Bayesian experiment associated to inverse problem (2) and based on the prior and sampling distributions specified in preceding sections. This is accomplished by defining the relevant probability space, that is the probability space associated to (2), with the joint measure determined by recomposing the prior and sampling distributions as joint probability measure. The relevant space we are working in is the real linear product space X × Y that is defined as the set X × Y := {(x, y); x ∈ X , y ∈ Y} with addition and scalar multiplication defined by (x1 , y1 ) + (x2 , y2 ) = (x1 + x2 , y1 + y2 ) and h(x1 , y1 ) = (hx1 , hy1 ), h ∈ R. X × Y is a separable Hilbert space under the norm induced by the scalar product defined as < (x1 , y1 ), (x2 , y2 ) >X ×Y :=< x1 , x2 >X + < (y1 , y2 ) >Y , ∀ (xi , yi ) ∈ X × Y, i = 1, 2. As already stressed, in the following we eliminate the indices in the scalar product and in the norm. We have already introduced E and F for the σ-fields of subsets of the parameter space X and the sample space Y, respectively. We introduce now the product σ-field of E and F, denoted with E ⊗ F and defined as the σ-field generated by the algebra of measurable rectangles R1 × R2 , with R1 ∈ E, R2 ∈ F. The probability measure Π on the product space (X × Y, E ⊗ F ) is constructed by first endowing X and Y with two probability measures µ and P x , respectively (see Assumptions 1 and 2) and by then recomposing them in the following way: Z Π(B × C) = P x (C)µ(dx), B ∈ E, C ∈ F. (6) B

After that, function Π is extended to E ⊗ F . Let Υ denote the covariance operator associated to Π defined as Υ(ϕ, ψ) = (Ω0 ϕ + Ω0 K ∗ ψ, (Σn + KΩ0 K ∗ )ψ + KΩ0 ϕ), for all (ϕ, ψ) in X × Y. We have already pointed out that a covariance operator need to be trace-class in order to have an Hilbert space-valued random function. The following Lemma shows that Υ and the covariance operator of the predictive distribution Υyy = (Σn + KΩ0 K ∗ ) are trace-class. 8

Lemma 2 The covariance operators Υ and (Σn + KΩ0 K ∗ ) are trace class. In particular, (Σn + KΩ0 K ∗ ) trace class is a necessary condition for Υ being trace class. We then state that the joint probability Π is gaussian. Theorem 1 Let X × Y be a separable Hilbert space endowed with the σ-field E ⊗ F . Under Assumptions 1 and 2, the joint measure Π on (X × Y, E ⊗ F) is Gaussian with mean function mxy = (x0 , Kx0 ) ∈ X × Y and covariance operator Υ such that Υ(ϕ, ψ) = (Ω0 ϕ + Ω0 K ∗ ψ, (Σn + KΩ0 K ∗ )ψ + KΩ0 ϕ), for all (ϕ, ψ) in X × Y. A similar argument will be used to prove that the marginal distribution P of Yˆ on (Y, F) is gaussian. Theorem 2 Let P be a gaussian measure on (Y, F) with mean function my = Kx0 in Y and covariance operator Υyy = Σn + KΩ0 K ∗ . Then, P is the marginal distribution on the Hilbert measurable space (Y, F) of the joint gaussian measure Π defined in Theorem 1. Distribution P is called predictive probability. Summarizing, we started from the inverse problem (2) and we characterized the Bayesian Inverse Problem associated to it by making two distributional assumptions. In the following of the paper we will denote the Bayesian Experiment as: Ξ = (X × Y, E ⊗ F , Π = P x ⊗ µ).

(7)

This Bayesian Experiment constitutes the object of our study and it is now evident how the object analyzed in the Bayesian approach is changed with respect to that one analyzed in the classical approach. The aim will be to determine the inverse decomposition of Π into the marginal P and posterior distribution µF = P(x|Yˆ ). Determination of this distribution raises some problems, in particular, we have to check for the regularity of the Bayesian Experiment, namely for the existence of a transition characterizing the posterior distribution.

2.5

Examples

Functional equation (2) with noisy observation of Y can be encountered in different fields, as we have already discussed in the introduction. Just for citing a few examples, in digital image analysis, see Chalmond (2003) [7], we have image-type data from which we wish to extract information that is of interest to the user. For instance, if the image is perturbed by an illumination fault, this fault must be removed to provide a more readable image. For a pair of similar images, we wish to estimate the deformation field that transforms one image into the other. An other example is given in tomography, where from radiographic projections of an object, we wish to create a threedimensional reconstruction of the contents of the object. We deal with inverse scattering problems when we wish to determine the support of an inhomogeneous medium from a knowledge of the far field pattern of the scattered field. Applications are given by the problem of detecting leukemia in the human body or by the problem of reconstructing object boundaries in shallow water using sonar data. In cancer therapy, linear inverse problem theory is used to recover functionals which can be used for computing hyperthermia treatment plans. Solving inverse problems is also required with spectral data. Take for instance the example of a two dimensional membrane. The goal is to find properties of the membrane or properties of a force on the membrane. The data are natural frequencies or mode shape measurements. Several other examples of application of linear inverse problems could be provided, but a complete review is beyond the scope of this paper. In this subsection, we just develop some example of application of our model encountered in statistic and econometric field.

Example 1: Density estimation

9

We consider the classical problem in statistic of density estimation. A review of bayesian nonparametric and semiparametric density estimates is provided by Hjort (1996) [25]. Bayesian density estimation based on mixture of Dirichlet processes has been proposed for instance by Escobar et al. (1995) [13] and by Florens et al. (1992) [17]. Density estimation using Bernstein polynomials as priors was proposed by Petrone (1999) [36]. Alternatively, Polya tree prior has been used for instance by Ferguson (1974) [14] and Lavine (1992) [29]. We consider a different approach. Let X = L2π (R) and Y = L2ρ (R), with π and ρ two measures on R different than the Lebeasgue measure. We consider a real-valued random variable ξ with a ¯ = P(ξ ≤ ξ). ¯ If this distribution admits a density, distribution characterized by the c.d.f. F , F (ξ) then it exists a nonnegative function f (ξ) known as the probability density function of ξ such that f (ξ) = dF dξ . Function f (ξ) belongs to X and it is characterized as the solution of an inverse problem. To obtain an estimate of the probability density we should solve an integral equation of first kind under the fact that F is unknown and an i.i.d. sample ξ1 , . . . , ξn from F is available. By using the random sample we would approximate F for instance with the empirical distribution function n

X ¯ = 1 ¯ Fˆn (ξ) 1{ξi ≤ ξ}. n i=1 The inverse problem we have to solve becomes Z ¯ = Fˆn (ξ)

ξ¯

f (u)du + Un , −∞

¯ 1 and Un the estimation with K : L2π (R) → L2ρ (R) an integral operator with kernel 1{u ≤ ξ} π(u) ¯ 1 is square integrable with respect to the product of measures π(u)ρ(ξ), error. If 1{u ≤ ξ} π(u) K is an Hilbert-Schmidt operator and then it is compact. The choice of the spaces of reference entails K is a compact operator. Solution of this inverse problem can be computed also in the classical way by applying for instance a Tikhonov regularization scheme. The regularized solution would be fαn = (αn I + K ∗ K)−1 K ∗ Fˆn , where αn denotes a regularization parameter that declines ¯ ξ≥u} to zero with n and K ∗ : L2ρ (R) → L2π (R) is the adjoint of K with kernel 1{ρ(ξ) . The solution fαn is continuous in Fˆn then, by Glivenko-Cantelli theorem, it converges toward the true density function, see Vapnik (1998) Theorem 7.1 [41]. To get a Bayesian estimation of f exploiting our approach, we infer the sampling probability P f from asymptotic properties of the empirical distribution function. It is a well known re√ sult that n(Fˆn − F ) weakly converges toward a zero mean gaussian process GF with covariance kernel E(GF (tj ), GF (tl )) = F (tj ∧ tl ) − F (tj )F (tl ) and such that GF (±∞) = 0 and GF is tight. Therefore, P f is asymptotically a Gaussian measure with mean F and covariance operator R 1 Σn = n R¯ E(GF (tj ), GF (tl ))dtj . The prior distribution is specified as a Gaussian measure centered at a density function. It can be remarked that a Gaussian prior distribution does not necessarily generate density functions (i.e. non-negative function integrating to one). A prior distribution generating sample path that are densities is obtained by Lenk (1988) [31] through a logistic transform of a Gaussian process. However, this more complicated prior makes computations of the posterior quite hard since we are no more in the linear case. The fact that part of the available prior information is not exploited is the price to pay for having a model that is tractable and a posterior distribution easily computable. Example 2: Regression estimation Another example of inverse problem is given by the regression function estimation. This problem is of particular interest in statistic, econometrics and machine learning, see Rasmussen et al. (2006) [37]. Different approaches can be found in Bayesian literature, see for instance Hanson et al. (2002) [22] that uses a mixture of Polya tree or Smith et al. (1996) [39] that uses spline models. Let (ξ, w) be a R1+p -valued random vector with distribution characterized by a cdf F and L2F (w) be the space of square integrable functions of w, we define the regression function of ξ given w as a function m(w) ∈ L2F (w) such that

10

ξ = m(w) + ε,

E(ε|w) = 0

E(ε2 |w) = σ 2 .

In other words, m(w) = E(ξ|w) and estimate the regression function is equivalent to estimate the conditional density f (ξ|w) as proposed in Vapnik (1998) [41], Section 1.9 and to plug it in the integral defining m(w). We follow here another approach and we characterize m(w) directly as a solution to an inverse problem. Let g(w, t) be a known function on Rp × R in R defining an Hilbert-Schmidt integral operator with respect to w, then E(g(w, t)ξ) = E(g(w, t)m(w)), where the expectation is taken with respect to F . The fact that K is Hilbert-Schmidt ensures that Km ∈ L2π (R), with π a measure on R; moreover, the fact that ξ has finite second moment, it also ensures that E(g(w, t)ξ) ∈ L2π (R). We suppose F (ξ|w) is unknown while F (·, w) is known; this implies that the LHS of the functional equation must be estimated but the operator K = R g(w, t)dF (·, w) is known. If we dispose of a random sample (ξi , wi ) we can replace the LHS of the functional equation with the consistent estimator n

1X ˆ E(g(w, t)ξ) := g(wi , t)ξi . n i=1 The statistical inverse problem with estimated LHS becomes ˆ E(g(w, t)ξ) = Km(t) + Un (t), with the operator K : L2F (w) → L2π (R) the integral operator with kernel g(w, t) and π a measure √ ˆ on R. The empirical process n(E(g(w, t)ξ) − E(g(w, t)ξ)) weakly converges toward a zero mean gaussian process with covariance operator Z Z Λ = (σ 2 g(w, t)g(w, s)f (w)dw − E(g(w, t)ξ)E(g(w, s)ξ))π(s)ds. R

Rp

So, the sampling measure P m is approximately gaussian with mean E(g(w, t)ξ) and variance n1 Λ. In most of the cases the cdf F is completely unknown and also operator K must be estimated to compute the posterior distribution. However, under some regularity assumption this does not affect the speed of convergence of our estimator to the true solution. We refer to section ?? for a more deep analysis of this extension. Example 3: Hazard rate function estimation with Right-Censored Survival data Let X1 , . . . , Xn be i.i.d. survival times with absolutely continuous distribution function, charF0 acterized by the cdf F , hazard rate function h = 1−F and integrated hazard function A(t) = Rt h(u)du. We consider a sequence of survival times X1n , X2n , . . . , Xnn . The observational scheme 0 is particular in the sense that we do not observe X1n , . . . , Xnn but only the right-censored sample ˜ in , Din ), i = 1, . . . , n, where X ˜ in = Xin ∧ Uin and Din = 1(X ˜ in = Xin ) for some sequence of (X censoring times U1n , . . . , Unn from a distribution function Gin . We suppose that the survival times X1n , . . . , Xnn and the censoring times U1n , . . . , Unn are mutually independent for each n. The aim is to get an estimate of the hazard rate function h, given an estimate of A(t), by solving the functional equation Z t Aˆn (t) = h(u)du + Un (t) 0

where Un (t) is introduced to account for the estimation error. We propose to estimate A(t) with the Nelson-Aalen estimator that takes the form Aˆn (t) =

X ˜ in ≤t i:X

11

Din , ˜ in ) Yn (X

Pn ˜ in ≥ t), see Andersen et al. (1993) [2]. An approximate of the sampling with Yn (t) = i=1 1(X distribution can be inferred from the asymptotic properties of the Nelson-Aalen estimator. Aˆn is uniformly consistent on a compact interval and √

n(Aˆn − A) →d W

as n → ∞.

Here, WR is a Gaussian martingale with W (0) = 0 and Cov(W (s1 ), W (s2 )) = σ 2 (s1 ∧ s2 ), where s σ 2 (s) = 0 {h(u)/y(u)}du with y(s) = (1 − F (s))(1 − G(s−)). Rt The structure of the bayesian inverse problem (3) has been restated, with the operator K = 0 ·du, and the sampling distribution given by a zero mean gaussian measure with covariance operator R Σn ψ = σ 2 (s1 ∧ s2 )ψ(s1 )ρ(s1 )ds1 , with ψ ∈ Y and ρ a measure on the domain of the functions in Y. The method that we just proposed to deal with inferential problems in survival analysis with rightcensored data is really new with respect to previous bayesian literature. Alternative approaches can be found in Hjort (1990) [24] that proposes to use a beta process as prior for the cumulative hazard process, in Ferguson et al. (1979) [15] that use processes neutral to the right as priors for nonparametric estimation of the cdf F or in Walker et al. (1997) [42] where the cdf F is supposed to be drawn from a Beta-Stacy process. Bayesian models for hazard rates has been proposed by Dykstra et al. (1981) [11] and by Ishwaran et al. (2004) [26]. A semiparametric estimation of proportional hazards models, where the parameter of interest is that one involved in the relation between a duration and explanatory variables and the data distribution is treated as a nuisance parameter has been proposed by Ruggiero (1994) [38]. Example 4: Deconvolution Let (X, Y, Z) be a random vector in R3 such that Y = X + Z, X be independent of Z and ϕ(·), f (·), g(·) be the marginal density functions of X, Y and Z respectively. The density f (y) is defined to be the convolution of ϕ(·) and g(·) Z f (y) = ϕ ∗ g := ϕ(x)g(y − x)dx. We assume that ϕ(·), f (·), g(·) are elements of L2π (R) where π is a symmetric measure assigning a weight decreasing to zero to points far from the median. Our interest is to recover density ϕ(x). We suppose g(·) is known and x is not observable but a sample of realizations of the random variable Y is available. Density f (y) is estimated nonparametrically, for instance with a kernel smoothing. The corresponding statistical model is R

fˆ(y) = Kϕ(y) + U,

where K = g(y − x)dx is the known operator of our model in a certain L2 space and U is the estimation error. Distribution of process U should be inferred from asymptotic properties of the nonparametric estimator fˆ(y). This is not possible for a nonparametric estimation since a nonparametric estimator defines an empirical process with trajectories that are discontinuous and independent at each point. To solve this problem, we propose to transform the model. Let A be a known operator with the property Rof smoothing the nonparametric estimate. For instance, it could be an integral operator A = a(y, t)dy, between Hilbert spaces, characterized by the kernel function a(y, t). The transformed deconvolution model becomes: Ey (a(y, t))(t) = AKϕ(t), where Ey denotes the expectation taken with respect to y. If we substitute f (y) with the kernel esR P timator we get the error term V defined as V = a(y, t)fˆ(y)dy −AKϕ = n1 (a(yi , t)−E(a(y, t))). √ nV weakly converges toward a gaussian process with zero mean and covariance operator with kernel E(a(yi , t) − E(a(y, t)))(a(yi , τ ) − E(a(y, τ ))). Example 5: Instrumental Regression Model

12

Let (Y, Z, W ) be a random vector whose components Y ∈ R, Z ∈ Rp and W ∈ Rq . Let F denote its cumulative distribution function and L2F be the Hilbert space of square integrable functions of (Y, Z, W ). We consider the econometric model Y = ϕ(Z) + ε,

E(U |W ) = 0

(8)

that defines the instrumental regression function ϕ(Z) ∈ L2F (Z), where L2F (Z) ⊆ L2F is the space of square integrable functions depending on Z. ε is an homoskedastic error term with variance σ 2 . ϕ(Z) is the parameter of interest and, by exploiting condition in (8) and under some regularity assumption, it can be seen as the solution of an integral equation of first kind: E(Y |W ) = E(ϕ(Z)|W ). If we want to stay completely nonparametric, the estimator of the LHS gives an empirical process with discontinuous trajectories. We have the same kind of problem as in deconvolution to determine the (asymptotic) distribution of the estimation error. Hence, we need to transform the model by re-projecting it on L2F (Z). The instrumental regression is now characterized as the solution of E(E(Y |W )|Z) = Kϕ with K = E(E(·|W )|Z). By substituting the LHS with a nonparametric estimator, we get a model like (3) ˆ E(Y ˆ |W )|Z) = Kϕ + U. E( We can make a different assumption concerning the degree of knowledge of F . If F is partially known, namely F (Z, W ) is known, then only the conditional expectation E(Y |W ) is to be estimated. In the alternative case with F completely unknown, both the whole LHS and the operator need to be estimated. The extension of the basic model to the case with unknown operator is treated in Section ?? of this paper. In both the cases, the (approximated) distribution of the estimation error will be a centered gaussian with covariance operator n1 σ 2 K ∗ K, where K ∗ denotes the adjoint of K. We refer to Florens and Simoni (2007) [18] for a complete treatment of this model.

3

Solution of the Ill-Posed Inverse Problem

In this section we compute the solution of Statistical Inverse Problem (3). We remind that the solution to inverse problem (2) restated in a larger space of probability measures is the posterior distribution of the quantity of interest x(t). We will use µF to denote this posterior distribution. Due to infinite dimension of the Bayesian experiment, application of Bayes theorem is not evident and we have to be careful in defining and computing the posterior distribution. Three points require to be discussed: the existence of a regular version of the conditional probability on E given F, the fact that it is a gaussian measure and its continuity.

3.1

Regularity of Bayesian Experiment: existence of a regular version of the posterior probability

In constructing the Bayesian experiment we have defined Π as the product of µ and P x . To make this experiment operational, it is compulsory the existence of an inverse decomposition of Π: Π = P ⊗ µF , that is ensured if a regular version of the posterior probability exists. Consider the two probability spaces (X , E, µ) and (Y, F, P ) defining the Bayesian experiment. First of all, we prove that the conditional probability on E given F exists. Then, we find that it can be characterized by a transition so that P(x|Yˆ ) is well defined and, if recomposed with P , gives the joint measure Π. Let L2 (X × Y) be the Hilbert space of square integrable random variables defined on X × Y with the inner product < ϕ(x, Yˆ ), ψ(x, Yˆ ) >L2 = E(ϕ(x, Yˆ )ψ(x, Yˆ )),

∀ϕ, ψ ∈ L2 (X × Y),

where the expectation is taken with respect to Π. Consider the subset L2 (Y) ⊂ L2 (X ×Y) of square integrable functions defined on Y. The conditional probability of a measurable set A ∈ E given 13

F, µF (A) is the projection of 1(x ∈ A) on L2 (Y). L2 (Y) is a closed convex subset of L2 (X × Y) (closed with respect to the scalar product on L2 (X × Y)), that implies that ∀ϕ ∈ L2 (X × Y) the conditional expectation E(ϕ|F) exists and is unique and so the conditional probability exists. A conditional probability is called regular if a transition probability characterizing it exists. Anyway, conditional probability does not need to be a transition since relations characterizing a probability hold only outside of a negligible set. In particular, countable additivity is difficult to satisfy. Nevertheless, the existence of such a transition inducing conditional probability is guaranteed by an important theorem, known as Jirina Theorem (see Neveu (1965) [33]): Proposition 1 Let (X × Y, E ⊗ F, Π) be a probability space that is Polish. Then for every σsubalgebras G1 ⊂ E ⊗ F and G2 ⊂ E ⊗ F, there exists at least one ”regular conditional probability” on G2 given G1 . This result implies that if we pick as σ-subalgebras E and F, P(A|F) ≡ µF (A), ∀A ∈ E is a regular conditional probability which entails the existence of the posterior distribution and of the inverse decomposition Π = P ⊗ µF . The crucial hypothesis to having a regular Bayesian experiment is that the space we are working in is Polish (complete separable metric space). This is not an unrealistic assumption since a lot of spaces satisfy this property. As we have remarked before, the frequently used L2 space, with a Gaussian measure defined on it, is a Polish space. A last remark need to be stressed. Given the existence of the solution of Bayesian Statistical Inverse Problem (3), we are interested in checking whether it is continuous in the sample Yˆ . Continuity is crucial for asymptotic properties of the estimator. In particular we are interested in the posterior consistency, whose definition will be given in Section 4. Problems of inconsistency are frequent in nonparametric Bayesian experiments, see Diaconis & Freedman (1986) [9]. For circumventing that, we have to guarantee that the posterior mean, characterizing the measure µF , be continuous in Yˆ .

3.2

Gaussian Posterior Probability

In this section, we briefly recover results that are well known in literature, see for instance Mandelbaum (1984) [32], but that are very important for our study. Given two jointly distributed gaussian processes x and Yˆ , the conditional expectation E(x|Yˆ ), determined by the relation E(< x, h > |Yˆ ) =< h, E(x|Yˆ ) >

a.s.,

∀h ∈ X ,

(9)

exists and it is an affine transformation of Yˆ . Existence is shown in Neveu (1975) [34] together with the convergence, as k → ∞, E(x| < Yˆ , ψ1 >, . . . , < Yˆ , ψk >) → E(x|Yˆ )

a.s.,

(10)

true for any orthonormal basis {ψj }. Let {λj , ψj }j be the eigensystem of Υyy , it follows that < Yˆ , ψj >, j = 1, 2, . . . are i.i.d., N (< Kx0 , ψj >, λj ) and ∀h ∈ X , E(< x, h > | < Yˆ , ψ1 >, . . . , < Yˆ , ψk >) is equal to

< x0 , h > +

k X < Ω0 K ∗ ψj , h > j=1

λj

< Yˆ − Kx0 , ψj > =

Pk ˆ that in turn is equal to < h, Ak Yˆ + bk >, with Ak Yˆ = j=1 Ω0 K ∗ Υ−1 yy ψj < Y , ψj > and bk = Pk ˆ ˆ ˆ x0 − j=1 Ω0 K ∗ Υ−1 yy ψj < Kx0 , ψj >. Then, by (9), E(x| < Y , ψ1 >, . . . , < Y , ψk >) = Ak Y + bk and by (10), E(x|Yˆ ) = AYˆ + b, where A = lim Ak and b = lim bk . Furthermore, it is trivial to show that the conditional distribution of x given Yˆ , µF , is gaussian. Let ε = x − E(x|Yˆ ), then, by definition of conditional expectation, ε is independent of Yˆ and < ε, h >∼ N (0, < (Ω0 − AKΩ0 )h, h >), for each h ∈ X . Consider the characteristic function of < x, h >:

14

E(ei |Yˆ ) =

ˆ E(ei<ε,h> ei |Yˆ ) ˆ

=

E(ei<ε,h> )ei

=

ei− 2 <(Ω0 −AKΩ0 )h,h>

ˆ

1

that it is the characteristic function of a gaussian random variable. Moreover, we will show in the next section that (Ω0 − AKΩ0 ) = V ar(x|Yˆ ). Given the one-to-one correspondence between distribution and characteristic function, we can say that < x, h > |Yˆ is gaussian for every h ∈ X and, from definition of Gaussian process, we conclude that µF is gaussian with mean AYˆ + b.

3.3

Computation of the Posterior Probability

We have proved the posterior distribution of the parameter x exists, is a transition probability and is a Gaussian measure: x|Yˆ ∼ GP(AYˆ + b, V ), with A : Y → X , V : X → X and b ∈ E two operators and a measurable function, respectively, to be determined. Computation of these quantities is trivial in finite dimensional space, but it may rise problems of continuity when spaces are infinite dimensional causing operator A to not be well-defined. Consider the covariance operator of the stochastic process (x, Yˆ ) ∈ X × Y and in particular the covariance between the two components of this process: Cov(< x, ϕ >, < Yˆ , ψ >) = = = = = =

Cov(E(< x, ϕ > |Yˆ ), < Yˆ , ψ >) Cov(< E(x|Yˆ ), ϕ >, < Yˆ , ψ >) Cov(< AYˆ + b, ϕ >, < Yˆ , ψ >) Cov(< AYˆ , ϕ >, < Yˆ , ψ >) Cov(< Yˆ , A∗ ϕ >, < Yˆ , ψ >) < (Σn + KΩ0 K ∗ )A∗ ϕ, ψ >

(11) ˆ for any ϕ ∈ X and ψ ∈ Y. By definition of covariance operator, Cov(< x, ϕ >, < Y , ψ >) =< Υ12 ϕ, ψ >, where Υ12 is a component of operator Υ determined in Theorem 1. Exploitation of this equality allows to obtain a functional equation defining operator A∗ (Σn + KΩ0 K ∗ )A∗ ϕ = KΩ0 ϕ,

(12)

for any ϕ ∈ X , or equivalently, A(Σn + KΩ0 K ∗ )ψ = Ω0 K ∗ ψ where ψ ∈ Y. Without paying too much attention to equation (12), we could define A, in an erroneous way, as A = Ω0 K ∗ (Σn + KΩ0 K ∗ )−1 . This solution is clearly not well-defined since (Σn + KΩ0 K ∗ ) is compact2 and its inverse is not continuous. Moreover, dimR(Σn + KΩ0 K ∗ ) = ∞, then the eigensystem associated to operator (Σn + KΩ0 K ∗ ) is such that there is a countable set of eigenvalues that has (only) zero as accumulation point. Therefore, we are dealing with an ill-posed inverse problem. In other words, restating the inverse problem in a larger space of probability distributions does not remove the ill-posedness since it is not possible to compute the posterior distribution of x and even if it was possible, the posterior mean would not be continuous in Yˆ causing problems of posterior inconsistency. The past literature concerning Bayesian inverse problems, see Mandelbaum (1984) [32] and Lehtinen et al. (1989) [30], proposed, as strategy to solve this problem of non-continuity, to restraint the space of the observable Yˆ . It was implicitly assumed that Yˆ ∈ R(Σn + KΩ0 K ∗ ) or that Yˆ belongs to a subspace of it. We do not wish to make this kind of restriction since we admit any trajectory Yˆ in R(Σn + KΩ0 K ∗ ). Thus, a different strategy, based on Tikhonov regularization, will be proposed in the next paragraph. 2 As we have shown in Section 2, operators Σ and Ω are compact. Operator K is compact by assumption. It n 0 follows that KΩ0 K ∗ is compact and (Σn + KΩ0 K ∗ ) is compact since it is a linear combination of compact operators (Kress, 1999, [28] Theorems 2.15, 2.16).

15

3.4

Regularized Posterior distribution of Gaussian Processes

The unboundedness of A prevents the posterior distribution from being continuous in Yˆ and therefore the posterior distribution from being consistent. While in the past inverse problem literature this problem has not been explicitly considered we propose to solve it by applying a Tikhonov regularization scheme to equation (12). We define the regularized operator Aα as: Aα = Ω0 K ∗ (αn I + Σn + KΩ0 K ∗ )−1

(13)

where αn > 0 is a regularization parameter appropriately chosen such that αn → 0 with n. We could interpret the Tikhonov regularized Aα as the operator that we would obtain if we considered a new Bayesian experiment Yˆ = Kx + U + η, with η a further error term with variance αn I. In this case the sampling measure would define a covariance operator αn I + Σn . This covariance operator is not nuclear so that the trajectories generated by this distribution would not be in the Hilbert space Y. This interpretation is useful since it provides us with a new Bayesian method for selecting the regularization parameter. In particular, we could specify a prior distribution on α and take some function of its posterior distribution as an estimator for it. We do not develop this point here but it will be the object of future research. In the following, we recover expressions for b and V ; since they are dependent on A we can compute them only if we replace A with its regularized version Aα . To identify function b of the posterior mean we use an iterated expectation law argument: E(x) x0

= E(E(x|Yˆ )) = AE(Yˆ ) + b = AKx0 + b.

Then, b = (I − AK)x0 and the corresponding regularized version of b is obtained by substituting A with Aα : bα = (I − Aα K)x0 .

(14)

To identify operator V = V ar(x|Yˆ ) we make the assumption that V is homoskedastic and we use the relation for the unconditional variance: V ar(x) = Ω0 = =

E(V ar(x|Yˆ )) + V ar(E(x|Yˆ )) V + A(Σn + KΩ0 K ∗ )A∗ V + AKΩ0 ,

where the last equality has been obtained by using (12) and gives the same expression for V recovered in 3.2. By using regularized version of A we get regularized V : Vα = Ω0 − Ω0 K ∗ (αn I + Σn + KΩ0 K ∗ )−1 KΩ0 .

(15) ˆ These regularized objects characterize a new distribution that is normal with mean (Aα Y + bα ) and covariance operator Vα . This distribution is called regularized posterior distribution and is F F denoted with µF α . Note that µα differs from the posterior distribution µ . Actually, it is a new object that we define to be the solution of the signal-noise problem and that we will show, see Section 4, converges toward the true posterior distribution. Moreover, we keep as punctual estimator of x the regularized posterior mean Eα (x|Yˆ ) = Ω0 K ∗ (αn I + Σn + KΩ0 K ∗ )−1 Yˆ + (I − Ω0 K ∗ (αn I + Σn + KΩ0 K ∗ )−1 K)x0 .

(16)

From this expression we clearly see that, without a regularization technique, the posterior mean would not be continuous in Yˆ . Therefore, when the measurement error goes to zero, Yˆ would go to the exact transformation Y but the solution of the inverse problem (i.e. the posterior mean) would not converge toward the true solution. 16

3.4.1

Alternative regularized solutions

Definition of operator A has called to solve an ill-posed inverse problem: (Σn + KΩ0 K ∗ )A∗ ϕ = KΩ0 ϕ. Actually, there exists two different way to obtain a regularized solution of this functional equation and consequently to define A. The first option concerns regularization of the inverse of operator (Σn + KΩ0 K ∗ ). This is what we have done in the previous section. Otherwise, we could regularize the Moore-Penrose generalized inverse of (Σn + KΩ0 K ∗ ) that is defined as the unique solution of minimal norm of the normal equation (Σn + KΩ0 K ∗ )2 A∗ ϕ = (Σn + KΩ0 K ∗ )KΩ0 ϕ. In this case the regularized operator A∗ is A∗α = (αn I + (Σn + KΩ0 K ∗ )2 )−1 (Σn + KΩ0 K ∗ )KΩ0 . Consideration of asymptotic properties leads us to prefer the first way of regularization. In fact, to have the same speed of convergence of the regularized posterior mean, that we will determine in the next section, regularization of the Moore-Penrose inverse requires the stronger condition 1 1 δ∗ ∈ R(Ω02 K ∗ KΩ02 )β be satisfied. 3.4.2

Tikhonov regularization in the Prior Variance Hilbert scale

We propose in this subsection an alternative regularization scheme, for recovering A, based on Tikhonov regularization in the Hilbert scale induced by the inverse of the prior covariance operator, −1 see Engl et al. (2000) [12] for general theory. Let L = Ω0 2 be a densely defined unbounded self-adjoint strictly positive operator in the Hilbert space X 3 . The norm || · ||s is defined as ||x||s := ||Ls x||. We define the Hilbert Scale Xs induced by L as the completion of the domain of Ls , D(Ls ), with respect to the norm || · ||s previously defined; moreover Xs ⊆ Xs0 if s0 ≤ s, ∀s ∈ R. Usually, when a regularization scheme in Hilbert Scale is adopted, the operator L, and consequently the Hilbert Scale, is created ad hoc. On the contrary, in our case the Hilbert Scale is not created ad-hoc but is suggested by the prior information we have and this represents a big difference and advantage with respect to the standard methods. For the theory to work it is necessary the first of the following assumptions to be satisfied. Assumption 5

1

a

(i) ||KΩ02 x|| ∼ ||Ω02 x||, ∀x ∈ X ; β+1

(ii) (x∗ − x0 ) ∈ Xβ+1 , i.e. ∃ ρ∗ ∈ X such that (x∗ − x0 ) = Ω0 2 ρ∗ (iii) a ≤ s ≤ β + 1 ≤ 2s + a. Some remarks about this assumption are in order. Assumption (i) is equivalent to say that in specifying the prior distribution we take into account the sampling model, hence the prior variance is linked to the sampling model 2 we are studying and, in particular, to operator K. This kind of prior specification is not new in Bayesian literature since it is similar to the Zellner’s g-prior, 1 see Zellner (1986) [43] or Agliari et al. (1988) [1] 4 . The link between the prior covariance Ω02 and operator K is affected by parameter a that can be interpreted as a degree of ill-posedness. Therefore, the prior is specified not only by taking into account the sampling model but also the degree of ill-posedness of the problem. Assumption (ii) is known as a source condition and is formulated in order to reach a certain speed β

of convergence of the regularized solution. Given Assumption 3, it says that δ∗ ∈ R(Ω02 ), hence β+1

Xβ+1 ≡ R(Ω0 2 ) ≡ D(Lβ+1 ). The meaning of such an assumption is that the prior distribution contains information about the regularity of the true value of x. In fact, parameter β is interpreted as the regularity parameter. These two remarks stress the fact that we are not taking whatever Hilbert Scale, but the Hilbert Scale linked to the prior. Either we first choose the Hilbert Scale and then we use the information contained in it to specify the prior distribution or we use the information contained in the prior distribution to specify the Hilbert Scale. −1

clearly, L = Ω0 2 is a closed operator in X satisfying: D(L) = D(L∗ ) is dense in X , < Lx, y >=< x, Ly > for all x, y ∈ D(L), and there exists γ > 0 such that < Lx, x > ≥ γ||x||2 for all x ∈ D(L). 4 A g-prior for parameter β in the normal Multiple Regression Model y = Xβ + u with u ∼ N (0, σ 2 I ) consists n in choosing a prior variance matrix of the type τ 2 (X 0 X)−1 , for some τ ∈ R. 3 More

17

The restriction β + 1 ≥ s means that the centered value of the true value x∗ has to be at least an element of Xs and it guarantees that the norm ||Ls x|| exists ∀x ∈ Xβ+1 . Under such assumptions the regularized solution in Xs to equation (12) is: As = Ω0 K ∗ (αn L2s + Σn + KΩ0 K ∗ )−1 .

(17)

The regularized posterior distribution is thus defined similarly as in Section 3.4 with Aα substituted by As and is denoted with µF s . The regularized posterior mean and variance are Es (x|Yˆ ) = As Yˆ + (I − As )x0 Vs = Ω0 − As KΩ0 . This regularization method has the advantage that it permits to better exploit the regularity of the true function x∗ . A classical Tikhonov regularization method allows to obtain a rate of convergence to zero of the regularization bias that is at most of order 2; on the contrary with a Tikhonov scheme in an Hilbert Scale the smoother the function x∗ is, the faster the rate of convergence to zero of the regularization bias will be. Moreover, we will show in Section 4.1 to reach a faster speed of convergence of the regularized posterior distribution toward the true solution.

4

Asymptotic Analysis

We study here the consistency of the regularized posterior mean (as it can be taken as punctual estimator of the parameter of interest x) and of the regularized posterior distribution. For the moment, we consider the regularized posterior distribution µF α defined in paragraph 3.4 and obtained through a classical Tikhonov regularization scheme in Hilbert spaces. The consistency of the Hilbert Scale regularized posterior distribution µF s defined in 3.4.2 will be analyzed in subsection 4.1. A very important result, due to Doob (1949) [10], see Florens et al. (1990) [16], states that for any prior, the posterior distribution is consistent in the sense that it converges to a point mass at the unknown parameter that is outside a set of prior mass zero. Actually, no one can be so certain about the prior and values of the parameter for which consistency is not verified may be obtained. To move around this problem it is customary to use a frequentist notion of consistency, the idea of which consists in thinking the data as generated from a distribution characterized by the true value of the parameter and in checking the accumulation of the posterior distribution in a neighborhoods of this true value. The aim of this section is to analyze ”frequentist” consistency of the recovered posterior distribution. If P x denotes the sampling probability, this means that we analyze convergence P x -a.s., or convergence in probability with respect to the measure P x , of the regularized version of the posterior distribution that we have defined. Following Diaconis & Freedman (1986) we give the following definition of posterior consistency: Definition 1 The pair (x, µF ) is consistent if µF converges weakly to δx as n → ∞ under P x probability or P x -a.s., where δx is the Dirac measure in x. The posterior probability µF is consistent if (x, µF ) is consistent for all x. If (x, µF ) is consistent in the previous sense, the Bayes estimate for x, for a quadratic loss function (i.e. the posterior mean), is consistent too. The meaning of this definition is that, for any neighborhood U of the true parameter x, the posterior probability of the complement of U converges toward zero when n → ∞: µF (U c ) → 0 in P x -probability, or P x -a.s. Therefore, since distribution expresses one’s knowledge about the parameter, consistency stands for convergence of knowledge towards the perfect knowledge with increasing amount of data. In general, in an identified i.i.d. model with final dimensional parameter space we have posterior consistency, or consistency in the sampling sense, if the true value of the parameter is in the support of the prior distribution. On the contrary, when the parameter space is of infinite dimension, these conditions are no more sufficient to guarantee the consistence of the posterior, as it is remarked 18

by Diaconis et al. (1986) [9]. Besides the problem of infinite dimension of the parameter space, we also encounter the difficulty that we are dealing with the regularized posterior distribution, µF α . Then, we are going to extend the concept of posterior consistency in order to be applied to the regularized posterior distribution and it make sense to speak about regularized posterior consistency. The principal result that we find is the consistency of the posterior distribution if the true parameter satisfies Assumption 3. However, as we have already reminded in subsection 2.3, the prior is not able to generate a trajectory of x that satisfies the necessary condition given in this assumption. This finding is in line with previous literature about Bayesian nonparametrics. To prove posterior consistency in the case of a Gaussian posterior measure, it is sufficient to prove consistency of the posterior mean and convergence to zero of the posterior variance. In fact, let x∗ be the true value of the parameter characterizing the DGP of Yˆ , by using Chebyshev’s Inequality and for any sequence Mn → ∞

µF α {x : ||x − x∗ ||X ≥ Mn εn }

≤ = ≤

Eα (||x − x∗ ||2X |Yˆ ) (Mn εn )2 < Vα (x(t)|Yˆ ), 1 >X +||Eα (x|Yˆ ) − x∗ ||2X (Mn εn )2 ||Vα (x|Yˆ )||X + ||Eα (x|Yˆ ) − x∗ ||2 X

(Mn εn )2

(18)

with π a measure on R. The RHS of (18) goes to 0 if both the terms in the numerator converge to zero. Firstly, we consider consistency of the regularized posterior mean: ||Eα (x|Yˆ ) − x∗ ||X → 0 P x∗ -a.s. when n → ∞. For any true value x∗ ∈ X , the Bayes estimation error is Eα (x|Yˆ ) − x∗ = Ω0 K ∗ (αn I + Σn + KΩ0 K ∗ )−1 K(x∗ − x0 ) + Ω0 K ∗ (αn I + Σn + KΩ0 K ∗ )−1 U − (x∗ − x0 ) and it converges to 0 under conditions given in the theorem below. Let Φβ denote the β- regularity 1

1

1

β

space of the operator KΩ02 , i.e. Φβ ≡ R(Ω02 K ∗ KΩ02 ) 2 for some β > 0. Theorem 3 Under Assumption 3 and 4 if αn → 0, (i) E(x|Yˆ ) →P

x∗

1 αn trΣn

→ 0 and

1 2 α3n ||Σn ||

∼ Op (1), then:

0 in X norm;

(ii) moreover, if δ∗ ∈ Φβ , for some β > 0, the bias is of order 1 1 ||Eα (x|Yˆ ) − x∗ ||2 = Op (αnβ∧2 + 4 ||Σn ||2 αn(β+1)∧2 + trΣn ). αn αn The larger β is, the smoother the function δ∗ ∈ Φβ will be and faster the regularization bias will converge to zero. However, for a Tikhonov regularization scheme β cannot be grater than 2, this is the reason for what we bound it by 2 in αnβ . As we have already remarked in 3.4.2, it is useless with classical Tikhonov regularization scheme to have a function x∗ with a degree of smoothness larger than or equal to 2. In the remaining of this section, for simplifying writing, we will not explicitly write β ∧ 2, but it will be implicit that we are assuming β ≤ 2 and if β > 2 it must be set at 2. (β+1)∧2 → 0 since for every Condition α13 ||Σn ||2 ∼ Op (1) is sufficient to guarantee that α14 ||Σn ||2 αn n

n

(β+1)∧2

β, (β + 1) ∧ 2 > 1 and then αn converges to zero even after having been simplified with the αn in the denominator. Furthermore, if we assume that trΣn is of the same order as ||Σn ||, for instance trΣn ∼ Op ( n1 ) and ||Σn || ∼ Op ( n1 ), convergence to zero of the second and third rates in the bias require satisfac3

tion of conditions αn → 0 and αn2 n → ∞. Classical conditions for convergence of the solution of 19

stochastic ill-posed problems are αn → 0 and αn2 n → ∞ (see Vapnik (1998)[41]). Therefore, we require weaker conditions to get optimal speed of convergence. If trΣn is of the same order as ||Σn || the fastest global rate of convergence is obtained when αnβ = α1n ||Σn ||, that is, when the optimal regularization parameter αn∗ is proportional to 1

αn∗ ∝ ||Σn || (β)+1 . With the optimal value αn∗ the condition

1 2 α3n ||Σn ||

∼ Op (1) is ensured if β ≥ 21 . Hence the speed β

of convergence of the regularized posterior mean is proportional to ||Σn || β+1 . Assuming the trace and the norm of the covariance operator be of the same order is not really stringent. For instance, we can assume, as above, they are both of order n1 and this is true in almost all real examples. Let us proceed now to the study of the regularized posterior variance. We want to check that ||Vα ϕ|| → 0 for all ϕ ∈ X . We present below a result of convergence to zero of the regularized posterior variance. 1 2 α3n ||Σn ||

Theorem 4 Under Assumption 4, if αn → 0 and

∼ Op (1) then

x∗ (i) Vα (x|Yˆ )ϕ →P 0 in X norm; 1

1

1

β

(ii) moreover, if the posterior variance is applied to ϕ ∈ X such that Ω02 ϕ ∈ R(Ω02 K ∗ KΩ02 ) 2 , for some β > 0, it is of order ||Vα (x|Yˆ )ϕ||2 = Op (αnβ +

1 ||Σn ||2 αn(β+1)∧2 ). αn4

With the optimal αn∗ , under the conditions in the above theorem and if β ≥ 21 , the squared norm β

of the regularized posterior variance converges to zero at the speed of ||Σn || β+1 . Its norm is slower β

and is of order ||Σn || 2(β+1) Finally, from inequality (18) it follows that when the regularized posterior variance converges to zero in P x∗ -probability and the posterior mean is consistent for the true value x∗ , the regularized posterior distribution of x degenerates to the Dirac measure in x∗ . Thus, under the fundamental assumption (x∗ − x0 ) ∈ H(Ω0 ), the regularized posterior probability of the complement of any neighborhood of the true parameter x∗ , µF α {x : ||x − x∗ ||X ≥ Mn εn }, goes to zero and, if β

trΣn ∼ Op (||Σn ||), it is of optimal order ||Σn || 2(β+1) . We have in this way proved the posterior consistency of µF α . The term that matters in determining the speed of convergence of the regularized posterior distribution is the norm of the variance since it converge at a slower rate with respect to the squared of the norm of the bias. Lastly, we wish to compare the speed of convergence that we find with the Bayesian method with the rate founded by applying a classical Tikhonov resolution method to equation (2) (that is suggested by the classical literature on inverse problems). In the following, we shall call these two methods Bayesian method and classical method, respectively; we refer to Engl et al. (2000) [12] or Carrasco et al. (2007) [5] for a review of the classical method. For simplifying, we set x0 = 0. To make this comparison possible we have to consider a particular case for the prior covariance operator: Ω0 = c1 (K ∗ K)γ , with c1 a constant of proportionality. In this particular case the fastest rate of convergence of the regularized posterior distribution is slower than the rate of convergence that would be obtained with the classical method. The regularity condition required γ γ in the classical method is x∗ ∈ R((K ∗ K) 2 and the optimal speed of convergence is (trΣn ) γ+1 , with γ ≤ 2 or γ set equal to 2 if γ ≥ 2. Therefore, if we choose β in order to have the same (γ+1)β γ γ regularity condition, i.e. R((K ∗ K) 2 ) = R((K ∗ K) 2 ) and then β = γ+1 , the fastest rate of γ

convergence in the Bayesian method is proportional to (trΣn ) 2γ+1 that is slower with respect to the classical one. This result is due to the fact that the Bayesian method increases the degree of ill-posedness. However, no comparison can be done outside of this particular form taken by Ω0 . In the following subsection we show that if we use a Tikhonov regularization in an Hilbert scale 20

instead of a classical Tikhonov regularization, for the computation of the regularized posterior distribution, we can improve the speed of convergence, and more exactly we obtain the same speed of convergence as with the classical method.

4.1

Speed of convergence with Tikhonov regularization in the Prior Variance Hilbert Scale

We compute in this subsection the speed of convergence for the regularized posterior distribution with Tikhonov regularization in Hilbert scale, under Assumption 5. The speed obtained in this case is faster than that one with a simple Tikhonov regularization scheme and it is the same speed as we would have obtained if we had solved the functional equation directly in an Hilbert scale without applying the bayesian method. In this section we suppose Assumption 5 holds, the attainable speed of convergence is given in the following theorem, the proof of which is provided in Appendix 6.8. Theorem 5 Let Es (x|Yˆ ) and Vs be as in (18). Under Assumptions 3, 4 and 5 β+1

1−a

||Es (x|Yˆ ) − x∗ ||2 ∼ Op (αna+s + αna+s trΣn ). β

1

Moreover, if the covariance operator Vs is applied to elements ϕ ∈ X such that Ω02 ϕ ∈ R(Ω02 ), then β+1

||Vs ϕ||2 ∼ Op (αna+s +

β−a 1 ||Σn ||2 αna+s ). 2 αn

The optimal αn is obtained by equating the two rates of convergence of the posterior mean: a+s

αn∗ ∝ (trΣn ) a+β β+1

and the corresponding optimal speed is proportional to (trΣn ) a+β . With this choice of the regularization parameter the remaining rates goes to zero if β > a+2s 3 . This constraint is binding with respect to the constraint in Assumption 5 (iii), i.e. a+2s ≥ s − 1 if the ill-posedness parameter 3 satisfies a ≥ s − 3. It should be noted that parameter s characterizing the norm in the Hilbert scale does not play any role on the speed of convergence. An advantage of the Tikhonov regularization in Hilbert Scale is that we can even obtain a rate of convergence for other norms, namely || · ||r for −a ≤ r ≤ β + 1 ≤ a + 2s. Actually, the speed of convergence of these norms gives the speed of convergence of the estimate of the r-th derivative of the parameter of interest x. If we directly solved functional equation (3) with a Tikhonov regularization in an Hilbert scale we would obtain a solution xs = (αn L2s + K ∗ K)−1 K ∗ Yˆ and a speed of convergence of order u (trΣn ) a¯+u , under the hypothesis ||Kx|| ∼ ||L−¯a x|| and x ∈ Xu , with a ¯ the degree of ill-posedness. 1

1

By comparing these assumptions to the bayesian ones it results that ||KΩ02 x|| ∼ ||L−¯a Ω02 x|| and, a ¯ +1

−1

substituting to L the operator Ω0 2 , this norm is equivalent to ||Ω0 2 x||, that implies that the degree of ill-posedness in the Bayesian problem is greater than the degree of ill-posedness in the classical problem: a = a ¯ + 1. Moreover, if we take the same regularity condition in the two problems, i.e. β + 1 = u, the rate of convergence of the regularized posterior and of the Tikhonov regularized solution in Hilbert scale would be the same. This confirms the improvement, in terms of speed of convergence, of the Tikhonov regularization in Hilbert scale with respect to the classical Tikhonov regularization. Take for instance the particular case with Ω0 = (K ∗ K) and impose the same regularity condition in X and in the Hilbert scale Xs . 1

1

γ

The regularity condition in Theorem 3 requires that δ∗ ∈ R(Ω02 K ∗ KΩ02 ) 2 ≡ R((K ∗ K)γ ) for a 1 certain γ > 0 5 , that implies (x∗ − x0 ) ∈ R((K ∗ K)γ+ 2 ). The regularity condition for the Hilbert β+1

scale regularization is (x∗ − x0 ) ∈ R(Ω0 2 ) ≡ R((K ∗ K)

β+1 2

); henceforth the conditions are equal

5 Note that for differentiate with respect the regularity parameter in the Hilbert scale we use letter γ instead of β, as used in Theorem 3 for the regularity on X .

21

if 2γ = β. Taking this value for β, the rate of convergence in the Hilbert scale Xs is proportional to 2γ+1 γ (trΣn ) 2γ+2 that is faster than the speed of convergence in X (that is proportional to (trΣn ) γ+1 ). Even without restricting to this particular form for Ω0 it is possible to show the improvement in term of speed of convergence obtained with an Hilbert scale. To this end, it is sufficient that aγ 1 γ 1 Assumption 5 (i) holds since it implies the equivalence ||(Ω02 K ∗ KΩ02 ) 2 v|| ∼ ||Ω02 v||, for some β



v ∈ X . Then, ||Ω02 v|| ∼ ||Ω02 v|| if and only if β = aγ (or β = (¯ a + 1)γ). The optimal bayesian aγ+1 speed of convergence with an Hilbert scale is (trΣn ) a+aγ that is fastest than the bayesian speed of γ convergence with a classical Tikhonov: (trΣn ) γ+1 , ∀γ > 0.

5

Conclusions

This paper analyzes Bayesian solution for a functional equation in Hilbert Spaces. We study Bayes Theorem and the properties of the posterior distribution in spaces of infinite dimension. When the parameter of interest is of infinite dimension its posterior distribution, and in particular the posterior mean, is not continuous. What is new in this paper is the construction of a new kind of posterior distribution that we call Regularized Posterior Distribution and that has the important property to be continuous in the observed quantity. With such object it is possible to construct estimators of the quantity of interest that are consistent, for instance we propose to take the regularized posterior mean as a possible estimator. We have computed the regularized posterior distribution in two ways: through a classical Tikhonov regularization scheme and through a Tikhonov regularization in the prior variance Hilbert Scale. This second way of regularization is particularly interesting since the Hilbert Scale that we use is naturally suggested by the prior distribution and it is not chosen ad-hoc as usually happens in inverse problems literature. In this paper we have considered only the basic case where both K and Σn are known. Florens & Simoni (2008) [19]extend the basic model and consider the case where K is unknown, the case where operator K is specific to every observation and the case with unknown Σn . Before concluding we wish to point out some remarks. First, all the regularization technique that we have used requires a regularization parameter αn that has to satisfy some properties. In practice, this parameter is unknown so that an estimation method for it should be envisaged. An efficient method is a data-driven method discussed in Engl et al. (2000) [12] Ch. 4., and implemented, among others, in Florens & Simoni (2007) [18]. This method proposes to select the value of αn that minimizes the residuals of the estimation previously scaled by a certain power of αn . Alternatively, it would be possible to insert the regularization parameter in the Bayesian method, specify a prior distribution on it and obtain an estimator from the its posterior distribution. This method is new and we are actually working for developing it. The second remark concerns the variance Σn of the sampling measure. It could be the case that this variance is unknown or partially unknown. Therefore, according to classical Bayesian literature, it would seem natural to specify a prior on it. There exist different ways in which it could be implemented. We could for instance suppose that Σn is of the form σ 2 Λn , with σ 2 ∈ R+ , and only σ 2 is unknown so that a classical inverse gamma prior on it will be used. This case is developed in Florens & Simoni (2007) [18]. However, if Λn is known and has a spectrum constituted by a countable number of eigenvalues different than zero, we actually know σ 2 since, by using the projection of Yˆ on the basis formed by the eigenfunctions associated to Λn , we can obtain it. More clearly, let {λj , ϕj } be the eigensystem associated to Λn , then |x, σ 2 ∼ N (Kx, σ 2 ) and λj PJ <(Yˆ −Kx),ϕj >2 <(Yˆ −Kx),ϕj >2 |x, σ 2 ∼ σχ21 . Therefore, J1 j=1 → σ 2 as j → 0. This case is then λ2 λ2 j

j

carefully analyzed in Florens & Simoni (2008) [19]; moreover, this argument is typically used in the literature on estimation of diffusion process, see Florens-Zmirou (1989) . Alternatively, we could suppose that the entire operator is unknown. This would require to specify a prior measure on compact operators, but at the best of our knowledge a complete and easily manageable theory about it does not exists, except the study provided by Cox et al. (2007) [8] that represents an example of study in this direction.

22

6 6.1

Appendix A Sufficiency of the sample mean

Suppose to observe n trajectories of a Gaussian Process Yˆi , i = 1, . . . , n, satisfying statistical model (3): Yˆ = Kx + U, with all the usual assumptions of Section 2. Let E = σ(x) and F = σ({Yˆi }i=1,...,n ) be the σ-fields of the parameter space and of the sample Pn space respectively and T = σ( n1 i=1 Yˆi ) be the σ-field generated by the sample mean. We prove sufficiency of the sample mean, i.e. E||F|T , by considering the projected model. Let {λj ; ψj }j be the singular system of the covariance operator Σ of the error term, the projected model is therefore < Yˆi , ψj > = < Kx, ψj > + < Ui , ψj >, | {z } | {z } ≡yij

i = 1, . . . , n

j = 1, 2, . . .

≡xj

with < Ui , ψj >∼ N (0, λj ), ∀i, j and Cov(< Ui , ψj >, < ui , ψj 0 >) = 0, ∀j 6= j 0 . We only consider a finite number k of projections and, from normality of the error term, I get the loglikelihood:

ln L({yij } i≤n |{xj }j≤k ) ∝ j≤k



X 1 2 (yij + x2j − 2yij xj ) λ j ij X 1 X 1 2 yij +n (< x, K ∗ ψj >)2 λ λ j j ij j −2



X 1 X < Yˆi , ψj >< x, K ∗ ψj > λ j j i

X 1 X 1 X 1 2 yij +n x2j − 2n < Y¯ , ψj > xj , λj λj λj ij j j

P where Y¯ = n1 i Yˆi denotes the sample mean. We have obtained ln L({yij }ij |{xj }j≤k ) equal to ln L({yij }ij |{< Y¯ , ψj >}j≤k ) + ln L({< Y¯ , ψj > }j≤k |{xj }j≤k ) that, on the basis of factorization principle, shows Ek ⊥ Fk |Tk , where Ek = σ({< x, K ∗ ψj >}j≤k ), Fk = σ({< Yˆi , ψj >} i=1,...,n ) and Tk = σ({< Y¯ , ψj >}j≤k ) are three filtrations. j≤k Theorem ??, with the tail σ-field (Tk )T given by T∞ = σ(Y¯ ), allows to conclude.

6.2

Proof of Lemma 1

0 Let (λjΩ0 , ϕΩ j ) be the eigensystem associated to Ω0 and || · ||Ω0 denote the norm in H(Ω0 ). Then P∞ ||2 ∼ Op (1). By Kolmogorov’s theorem (x − x0 ) ∈ H(Ω0 ) if ||x − x0 ||2Ω0 := j=1 Ω0

λj

this condition is satisfied if E||x − x0 ||2Ω0 < ∞. The following development shows that this is not the case:

E||x − x0 ||2Ω0

=

∞ X 1 2 0 E(< x − x0 , ϕΩ j >) Ω0 λ j j=1

=

∞ X 1 Ω0 0 < Ω0 ϕΩ j , ϕj > Ω0 λ j=1 j

=

∞ X

1 = ∞.

j=1

23

6.3

Proof of Lemma 2

We begin by proving that (Σn + KΩ0 K ∗ ) is trace-class. Note that tr(Σn + KΩ0 K ∗ ) = trΣn + tr(KΩ0 K ∗ ) by linearity of the trace operator. Since Σn is trace class, we only have to prove that 1

KΩ0 K ∗ is trace class, or that Ω02 K ∗ is an Hilbert-Schmidt operator6 . 1 R R Let Ω02 = R a(z, t)g(t)dt and K ∗ = R b(s, t)f (s)ds with g(·) and f (·) two measures on R, then 1 2

Ω0 K

Z ∗

=

a(z, t)b(s, t)g(t)f (s)dsdt.

(19)

R×R

1

We prove that Ω02 K ∗ is Hilbert-Schmidt: ¯2 ¯Z ¯ ¯ ¯ a(z, t)b(s, t)g(t)dt¯ f (s)h(z)dsdz ZR×R ³ ZR ´2 |a(z, t)b(s, t)|g(t)dt f (s)h(z)dsdz Z



R×R

Z

R

³³ Z

=

´ 12 ³ Z ´ 12 ´2 a2 (z, t)g(t)dt b2 (s, t)g(t)dt f (s)h(z)dsdz R×R R R Z Z Z a2 (z, t)g(t)dt b2 (s, t)g(t)dtf (s)h(z)dsdz R ZR×R Z R Z Z a2 (z, t)g(t)h(z)dtdz b2 (s, t)g(t)f (s)dsdt

<



≤ =

R

R

R

R

1

1

since both Ω02 and K ∗ are Hilbert Schmidt operators. This prove that Ω02 K ∗ is Hilbert Schmidt and the result follows. Let now consider Υ: · ¸ Ω0 Ω0 K ∗ Υ= . KΩ0 Σn + KΩ0 K ∗ Let ej = (e1j , e2j ) be a basis in X × Y, the trace of Υ is: tr(Υ)

=

X

< Υej , ej >

j

=

X

(< Ω0 e1j , e1j > + < Ω0 K ∗ e2j , e1j > + < KΩ0 e1j , e2j >

j

+ < (Σn + KΩ0 K ∗ )e2j , e1j >). For the above part of this proof and since Ω0 is trace-class, the infinite sum of the first and last terms are finite. We only have to consider the two terms in the center: X

(< Ω0 K ∗ e2j , e1j > + < KΩ0 e1j , e2j >) =

2

j

X

1

1

< Ω02 K ∗ e2j , Ω02 e1j >

j



2

X

1

1

||Ω02 K ∗ e2j ||||Ω02 e1j ||

j



2

X

1

j

j



1

||Ω02 K ∗ e2j || sup ||Ω02 e1j ||

1 2

2||Ω0 ||

X

||Ω0 ||||K ∗ e2k ||

j 6 The

equivalence is due to the fact that 1

1

1

tr(KΩ0 K ∗ ) =< Ω02 K ∗ , Ω02 K ∗ >HS = ||Ω02 K ∗ ||2HS .

24

1 2

1

that is finite since Ω02 is bounded and K ∗ is Hilbert-Schmidt. The necessity of Υyy being trace-class to have Υ trace-class is evident and this complete the proof.

6.4

Proof of Theorem 1

Let (˜ x, y˜) be an element in X × Y. Assumptions 1 and 2 define a conditional probability on X given E and a marginal probability on Y and imply that y˜ can be decomposed in y˜1 + y˜2 , with y˜1 ∈ R(K) and y˜2 ∈ R.K.H.S.(Σn ). More precisely, y˜1 is a linear transformation, through operator K, of a realization x ˜ of µ. In a specular way, y˜2 is a realization of a centered gaussian measure with covariance operator Σn . Therefore, y˜1 and y˜2 are independent and for all (ϕ, ψ) ∈ X × Y < (˜ x, y˜), (ϕ, ψ) > = = =

+ < y˜1 + y˜2 , ψ > + < K x ˜, ψ > + < y˜2 , ψ > ∗ + < y˜2 , ψ >

and < x ˜, ϕ + K ∗ ψ > + < y˜2 , ψ > is distributed as

=

N (< x0 , ϕ + K ∗ ψ >, < Ω0 (ϕ + K ∗ ψ), (ϕ + K ∗ ψ) >) + N (0, < Σn ψ, ψ >) N (< x0 , ϕ + K ∗ ψ >, < Ω0 (ϕ + K ∗ ψ), (ϕ + K ∗ ψ) > + < Σn ψ, ψ >).

We have proved that the joint measure Π on X × Y is gaussian. The mean mxy is defined through < mxy , (ϕ, ψ) >= EΠ < (˜ x, y˜), (ϕ, ψ) > and since < x0 , ϕ + K ∗ ψ >=< (x0 , Kx0 ), (ϕ, ψ) > we get mxy = (x0 , Kx0 ). In the same way, the covariance operator Υ is defined by < Υ(ϕ, ψ), (ϕ, ψ) > = = =

V ar(< (˜ x, y˜), (ϕ, ψ) >) < Ω0 ϕ, ϕ > + < (Σn + KΩ0 K ∗ )ψ, ψ > + < KΩ0 ϕ, ψ > + < Ω0 K ∗ ψ, ϕ > < (Ω0 ϕ + Ω0 K ∗ ψ, (Σn + KΩ0 K ∗ )ψ + KΩ0 ϕ), (ϕ, ψ) >

that concludes the proof.

6.5

Proof of Theorem 2

Let Q be the projection of Π on (Y, F) with mean function mQ and covariance operator RQ . Since Π is gaussian, the projection must be gaussian. Moreover, ∀ψ ∈ Y < mQ , ψ >

=

< mxy , (0, ψ) >

= =

< (x0 , Kx0 ), (0, ψ) > < Kx0 , ψ >

and < RQ ψ, ψ >

=

< Υ(0, ψ), (0, ψ) >

= =

< (Ω0 0 + Ω0 K ∗ ψ, (Σn + KΩ0 K ∗ )ψ + KΩ0 0), (0, ψ) > < (Σn + KΩ0 K ∗ )ψ, ψ > .

Hence, mQ = my and RQ = Υyy . This implies Q ≡ P since there is an unique correspondence between a gaussian measure and its covariance operator and mean element.

25

6.6

Proof of Theorem 3

The Bayes estimation error can be decomposed in the following way: Eα (x|Yˆ ) − x∗ I

z }| { = − [I − Ω0 K ∗ (αn I + KΩ0 K ∗ )−1 K](x∗ − x0 ) + [Ω0 K ∗ (αn I + Σn + KΩ0 K ∗ )−1 K − Ω0 K ∗ (αn I + KΩ0 K ∗ )−1 K](x∗ − x0 ) {z } | II

+ Ω0 K ∗ (αn I + Σn + KΩ0 K ∗ )−1 U . {z } |

(20)

III

The first term looks very similar to the regularization bias of the solution of a functional equation. More properly, to obtain such a kind of object we should use Assumption 3 for which 1 (x∗ − x0 ) ∈ H(Ω0 ) and there exists a δ∗ ∈ X such that (x∗ − x0 ) = Ω02 δ∗ . Therefore, 1

= [I − Ω0 K ∗ (αn I + KΩ0 K ∗ )−1 K]Ω02 δ∗

I

1

1

= [Ω02 − Ω0 K ∗ (αn I + KΩ0 K ∗ )−1 KΩ02 ]δ∗ 1

1

1

= Ω02 [I − Ω02 K ∗ (αn I + KΩ0 K ∗ )−1 KΩ02 ]δ∗ , where in the last equality we have used the fact that, since Ω0 is positive definite and self-adjoint, 1 1 it can be rewritten as Ω0 = Ω02 Ω02 . We take the norm in X of I: 1

1

1

||I||2 ≤ ||Ω02 ||2 ||(I − Ω02 K ∗ (αn I + KΩ0 K ∗ )−1 KΩ02 )||2 ||δ||2 . 1

1

Note that operator (I − Ω02 K ∗ (αn I + KΩ0 K ∗ )−1 KΩ02 ) has the same eigenvalues as 1

1

1

1

[I − (αn I + Ω02 K ∗ KΩ02 )−1 Ω02 K ∗ KΩ02 ].

(21)

that appears as the regularization bias associated to the regularized solution of the ill-posed inverse 1 problem KΩ02 δ∗ = r computed using Tikhonov regularization scheme. It converges to zero when the regularization parameter αn goes to zero and therefore the second norm in ||I||2 is bounded. This way to rewrite the above operator justifies the identification condition in Assumption 4. Injec1 1 1 tivity of KΩ02 ensures that the solution of KΩ02 δ = r is identified and therefore, if Ω02 is injective, that (x∗ − x0 ) is identified and that the convergence of the regularized posterior mean is towards the right true value. 1 1 The speed of convergence to zero of ||(I − Ω02 K ∗ (αn I + KΩ0 K ∗ )−1 KΩ02 )||2 depends on the regularity of δ∗ , and consequently of (x∗ − x0 ). If the true solution δ∗ lies in the β-regularity space 1

1

1

β

Φβ of the operator KΩ02 , i.e. δ∗ ∈ R(Ω02 K ∗ KΩ02 ) 2 , the squared regularization bias is at most of order αnβ . We admit without proof the following lemma. Sketch of the proof can be found in Carrasco, Florens and Renault (2005) and in Kress (1999). 1

1

β

Lemma 3 If (x∗ − x0 ) ∈ H(Ω0 ) and if δ∗ ∈ R(Ω02 K ∗ KΩ02 ) 2 then ||I||2 = Op (αnβ ). The larger β is, the smoother the function δ∗ ∈ Φβ will be and faster the regularization bias will converge to zero. However, since for Tikhonov regularization scheme, β cannot be grater than 2 we implicitly assume that δ∗ ∈ Φβ for β ≤ 2. Now, let us consider the II and III terms. The squared norm ||II||2 is equal to:



||Ω0 K ∗ (αn I + Σn + KΩ0 K ∗ )−1 (−Σn )(αn I + KΩ0 K ∗ )−1 K(x∗ − x0 )||2 ||Ω0 K ∗ ||2 ||(αn I + Σn + KΩ0 K ∗ )−1 ||2 ||Σn ||2 ||(αn I + KΩ0 K ∗ )−1 K(x∗ − x0 )||2 26

where the first norm is bounded and the second and the third ones are Op ( α12 ) and Op (||Σn ||2 ) n respectively. The last norm can be written as: 1

||(αn I + KΩ0 K ∗ )−1 KΩ02 δ∗ ||2 , Moreover, to find the speed of convergence, we use the hypothesis that δ∗ ∈ Φβ , so that 1 1 β 1 1 ||α(αn I + KΩ0 K ∗ )−1 KΩ02 (Ω02 K ∗ KΩ02 ) 2 ρ||2 , 2 α for some ρ ∈ X and it is at least of order α12 αβ+1 . As a consequence of the fact that, with a Tikhonov regularization, a degree of smoothness greater than or equal to 2 may be useless, we get (β+1)∧2 ||(αn I + KΩ0 K ∗ )−1 K(x∗ − x0 )||2 ∼ Op ( α12 αn ). n To find speed of convergence of term III we write it in the following equivalent way: 1

||(αn I + KΩ0 K ∗ )−1 KΩ02 δ∗ ||2 =

= Ω0 K ∗ [(αn I + Σn + KΩ0 K ∗ )−1 − (αn I + KΩ0 K ∗ )−1 ]U + | {z }

III

A

Ω0 K ∗ (αn I + KΩ0 K ∗ )−1 U . | {z } B

By standard computation and by Kolmogorov theorem, it is trivial to determine that ||A||2 ∼ Op ( α13 ||Σn ||2 trΣn ) and ||B||2 ∼ Op ( α1n trΣn ). Kolmogorov Theorem entails that if E||U ||2 < ∞ n then ||U ||2 is bounded in probability; moreover, E||U ||2 = trΣn . We summarize the results of convergence of the II and III term in X in the following lemma: 1

1

β

Lemma 4 If (x∗ − x0 ) ∈ H(Ω0 ) and δ∗ ∈ R(Ω02 K ∗ KΩ02 ) 2 then (β+1)∧2

(i) ||II||2 = Op ( α14 ||Σn ||2 αn n

)

and (ii) ||III||2 = Op ( α13 ||Σn ||2 trΣn + n

1 αn trΣn ).

The first term of ||III||2 is negligible with respect to the other terms in ||II||2 and ||III||2 .

6.7

Proof of Theorem 4

By recalling expression (15), we can rewrite the regularized posterior variance as IV



=

z }| { Ω0 − Ω0 K ∗ (αn I + KΩ0 K ∗ )−1 KΩ0 + Ω0 K ∗ (αn I + KΩ0 K ∗ )−1 KΩ0 − Ω0 K ∗ (αn I + Σn + KΩ0 K ∗ )−1 KΩ0 . | {z } V 1

1

Since Ω0 is a positive definite self-adjoint operator, it can be decomposed as Ω0 = Ω02 Ω02 and the operator IV in (22) applied to an element ϕ in X can be rewritten as 1

1

1

1

Ω02 (I − Ω02 K ∗ (αn I + KΩ0 K ∗ )−1 KΩ02 )Ω02 ϕ. 1

1

1

β

Following the same reasoning done for term I in (20), we conclude that, if Ω02 ϕ ∈ R(Ω02 K ∗ KΩ02 ) 2 then ||IV ϕ||2 = Op (αnβ ). Operator V in (22) applied to ϕ ∈ X is equivalently rewritten as 1

1

Ω0 K ∗ (αn I + Σn + KΩ0 K ∗ )−1 Σn (αn I + KΩ0 K ∗ )−1 KΩ02 Ω02 ϕ and its squared norm is bounded and of order: ||V ||2 = Op (

1 ||Σn ||2 αn(β+1)∧2 ), αn4

by using the same proof as for term II in (20). 27

6.8

Proof of Theorem 5

We admit the following Lemma that will be useful for the proof of the theorem. Lemma 5 Let Xs , s ∈ R, be a Hilbert scale induced by L and let T : X → Y be a bounded operator satisfying ||x||−a ∼ ||T x|| on X for some a > 0. Then for B := T L−s , s ≥ 0 and |ν| ≤ 1 ν

||x||−ν(a+s) ∼ ||(B ∗ B) 2 ||. ν

Moreover, R((B ∗ B) 2 ) = Xν(a+s) . Proof:ee proof of Corollary 8.22 in Engl et al. (2000) [12].

We begin the proof of Theorem 5 by proving the rate of convergence of the regularized bias. We use the following decomposition: Es (x|Yˆ ) − x∗ I

'

}| { z [I − Ω0 K ∗ (αn L2s + KΩ0 K ∗ )−1 K](x∗ − x0 ) + Ω0 K ∗ [(αn L2s + Σn + KΩ0 K ∗ )−1 K − (αn L2s + KΩ0 K ∗ )−1 K](x∗ − x0 ) | {z } II

+ Ω0 K ∗ (αn L2s + Σn + KΩ0 K ∗ )−1 U . {z } | III

Let start by considering term I, note that ||I|| = =

||[I − Ω0 K ∗ (αL2s + KΩ0 K ∗ )−1 K](x∗ − x0 )|| 1

1

1

β

1

∗ 2 2 −1 Ω02 K ∗ KΩ02 ]Ω02 ρ∗ || ||[I − (αΩ−s 0 + Ω0 K KΩ0 ) 1

1

1

1

−s+ 12

∗ 2 2 −1 Ω02 K ∗ , i.e. Ω0 if Ω0 is such that Ω02 K ∗ (αL2s + KΩ0 K ∗ )−1 = (αΩ−s 0 + Ω0 K KΩ0 ) β+1 2

1 2

Ω0 K ∗ L2s . By using Assumption 5 (ii), i.e. the fact that (x∗ − x0 ) = Ω0 s+1 2

B = KΩ0

K∗ =

ρ∗ , and the notation

, the norm ||I|| can be rewritten

||I|| = =

β−s

s+1

||Ω0 2 (I − (αn I + B ∗ B)−1 B ∗ B)Ω0 2 ρ∗ || s+1

β−s

||Ω0 2 (I − (αn I + B ∗ B)−1 B ∗ B)(B ∗ B) 2(a+s) v|| β+1

∼ ||(B ∗ B) 2(a+s) αn (αn I + B ∗ B)−1 v|| β+1

∼ Op (αn2(a+s) ) β−s

β−s

where the second equality follows from the fact that R(Ω0 2 ) ≡ Xβ−s ≡ R((B ∗ B) 2(a+s) ), then β−s

β−s

Ω0 2 ρ∗ = (B ∗ B) 2(a+s) v, for some v ∈ X . The third equivalence is a consequence of Lemma 5. It β+1

follows that ||I||2 ∼ Op (αna+s ). We use similar steps for obtaining the convergence of the other terms, so that we omit any redundant comment. 1

||II|| ≤ ||Ω0 K ∗ (αn L2s + Σn + KΩ0 K ∗ )−1 ||||Σn ||||(αn L2s + KΩ0 K ∗ )−1 KΩ02 δ∗ || and the norm in the last term can be developed as

28

1

1

1

1

∗ 2 2 −1 ||KΩ02 (αn Ω−s δ∗ || 0 + Ω0 K KΩ0 )

||(αn L2s + KΩ0 K ∗ )−1 KΩ02 δ∗ || =

s+β

||B(αn I + B ∗ B)−1 Ω0 2 v||

=

2s+β+a

||(B ∗ B) 2(a+s) (αn I + B ∗ B)−1 v|| 1 2s+β+a ∼ Op ( αn2(a+s) ). αn



³ 2s+β+a ´ Thus, ||II||2 ∼ Op α14 ||Σn ||2 αn2(a+s) . n We proceed with term III that can be decomposed as III

=

Ω0 K ∗ [(αn L2s + Σn + KΩ0 K ∗ )−1 − (αn L2s + KΩ0 K ∗ )−1 ]U + | {z } IIIA ∗

2s

Ω0 K (αn L |

∗ −1

+ KΩ0 K ) {z

U, }

IIIB

where the squared norm ||IIIA||2 of the first term is less or equal then ||Ω0 K ∗ (αn L2s + KΩ0 K ∗ )−1 ||2 ||Σn ||2 ||(αn L2s + Σn + KΩ0 K ∗ )−1 ||2 ||U ||2 s+1

s+1

s+1

s+1

≤ ||Ω0 2 (αn I + Ω0 2 K ∗ KΩ0 2 )−1 Ω0 2 K ∗ ||2 ||Σn ||2 ||(αn L2s + Σn + KΩ0 K ∗ )−1 ||2 ||U ||2 a+2s+1

∼ ||(B ∗ B) 2(a+s) (αn I + B ∗ B)−1 ||2 ||Σn ||2 ||(αn L2s + Σn + KΩ0 K ∗ )−1 ||2 ||U ||2 ³ 1 a+2s+1 ´ ∼ Op 4 ||Σn ||2 trΣn αn a+s . αn The norm of the term IIIB is: ||IIIB|| =

||Ω0 K ∗ (αn L2s + KΩ0 K ∗ )−1 U || 1

1

1

1

=

∗ 2 2 −1 Ω02 K ∗ U || ||Ω02 (αn Ω−s 0 + Ω0 K KΩ0 )

=

||Ω0 2 (αn I + B ∗ B)−1 B ∗ U ||



||(B ∗ B) 2(a+s) (αn I + B ∗ B)−1 B ∗ U ||



||(B ∗ B) 2(a+s) (αn I + B ∗ B)−1 ||||U ||



Op (αn2(a+s) ||U ||).

s+1

s+1

2s+a+1

1−a

1−a

Thus ||IIIB||2 ∼ Op (αn(a+s) trΣn ). 1

β

s+1

β−s

The variance Vs is applied to an element ϕ ∈ X such that Ω02 ϕ ∈ R(Ω02 ) and Ω0 2 ϕ ∈ R(Ω0 2 ). Then the variance can be decomposed as IV

Vs ϕ

=

z }| { [Ω0 − Ω0 K ∗ (αn L2s + KΩ0 K ∗ )−1 KΩ0 ]ϕ + Ω0 K ∗ [(αn L2s + KΩ0 K ∗ )−1 − (αn L2s + Σn + KΩ0 K ∗ )−1 ]KΩ0 ϕ . {z } | V

Computation of ||IV || is specular to that one for term ||I|| above and computation of ||V || to β+1

that one for term ||II||, therefore we give only the result: ||IV ||2 ∼ Op (αna+s ) and ||V ||2 ∼ ³ β+a ´ Op α12 αn(a+s) . n The result follows. 29

7 7.1

Appendix B Monte Carlo Simulations

We develop in this appendix some numerical simulation for testing the performance of the Bayesian method we have proposed for solving functional equations. We take the regularized posterior mean as estimator for the solution of ill-posed inverse problem (2). In the first numerical simulation we implement an arbitrary data generating process where the parameter of interest take the form of a parabola. The second and third simulations are implementation of the density and regression estimation examples given in Section 2.5.

7.2

Functional equation with a parabola as solution

We take L2 spaces as spaces of reference: X = L2π and Y = L2ρ , with measures π and ρ taken equal to the uniform measure on [0, 1]. The data generating process we have considered is Z Yˆ

1

=

x(s)(s ∧ t)ds + U, 0

Σn

Z

1

= n−1

U ∼ GP(0, Σn )

(22)

exp{−(s − t)2 }ds

0

x x0 Ω0 ϕ(t)

∼ GP(x0 , Ω0 ) = −2.8s2 + 2.8s Z 1 = ω0 exp(−(s − t)2 )ϕ(s)ds. 0

The true value of the parameter of interest x has been taken to be a parabola: x = −3s2 + 3s; this choice is not completely arbitrary but it is consistent with the specification of the prior mean and variance. The covariance operators are correctly chosen in the sense that they are self-adjoint, positive semi-definite and nuclear. In particular, their eigenvalues are O(e−j ). The regularization parameter α has been set to 2.e−03, the discretization step is 0.01 and the sample size is n = 1000. We show in Figure 1a the true function x (continuous line) and the regularized posterior mean estimation (dotted line) for the prior given above with ω0 = 2. Moreover, we propose, in Figure 1b a comparison between our estimator and the estimator obtained by solving equation (2) with a classical Tikhonov regularization method (small dotted line). To get good results for the Tikhonov estimator we have chosen α = 2.e − 04. The choice of the prior distribution is deeply affecting the estimation. To analyze its role we have replicated the same simulation experiment for different specifications of prior distribution. It should be noted that the far the prior mean is from the true parameter the bigger should be the prior covariance operator. The way in which the prior covariance is specified allows to easily increase or decrease it by only changing parameter ω0 . In the first variation of the prior, we consider a prior mean x0 = −2s2 + 2s, a scale covariance parameter ω0 = 40 and a covariance kernel equal R1 to the brownian bridge covariance: Ω0 ϕ(t) = 40 0 ((s ∧ t) − st)ϕ(s)ds. This covariance operator has eigenvalues of order O(j −2 ), i.e. λj = 1/(π 2 j 2 ), j ∈ N. The estimated curve, pictured in Figure 1c, shows that, despite a prior mean (dashed-dotted line) far from the true curve, we still get a pretty good estimation. This is caused by having chosen an high variance so that the prior information get less importance. The second variation of the prior specification is the following: we set the prior mean at x0 = −2.22s2 + 2.67s − 0.05Rand the kernel of covariance operator is now 1 a parabola with scale parameter ω0 = 100, Ω0 = 100 0 (0.9(s − t)2 − 1.9|s − t| + 1)ds. The results are shown in Figure 1d. Here we need a large variance to compensate the bad prior information contained in the prior mean. The choice of the regularization parameter αn is particular troublesome. We have adopted here an ad-hoc choice of it, the object of the simulation being to verify the good performance of our estimator, that it the aim of this section is not the optimal choice of αn but only analysis of the fit of our estimator for a given αn . Determination of this parameter is a problem concerning all the existing methods for solving inverse problems. The evolution of the regularized posterior mean estimator for different values of the regularization parameter α is shown in Figure 2a. Figure 2b 30

is a particular of the previous one. Finally, in Figure 3 results of a Monte Carlo experiment with 100 iterations are shown. Panels (3a), (3c) and (3d) are Monte Carlo experiment conducted for the three different priors distribution considered. The dotted line represents the mean of the regularized posterior means obtained for each iteration. Panel (3b) shows the Monte Carlo mean of the regularized posterior means for the first specification of the prior distribution (dotted line) and of the classical Tikhonov solutions (small dotted line). 0.8

0.8 True Curve x Posterior Mean Prior Mean x0

0.7

0.7

0.6

0.5

0.5

x(s)

0.6

0.4

0.4

0.3

0.3

0.2

0.2

True curve x Posterior Mean Tikhonov Solution

0.1

0.1 0

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

0.1

0.2

0.3

0.4

1

2

0.5 s

0.6

0.7

0.8

0.9

1

0.7

0.8

0.9

1

(b)

(a) x0R = −2.8s + 2.8s, Ω0 ϕ(t) = 2 01 exp(−(s − t)2 )ϕ(s)ds 0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5 0.4

0.4 0.3

0.3 0.2

0.2 0.1

True Curve x Posterior Mean Prior Mean x0

0.1

0

0

0.1

0.2

0.3

0.4

0.5

True Curve x Posterior Mean Prior Mean x0

0

0.6

0.7

0.8

0.9

−0.1

1

0

2

Ω0 ϕ(t) = 40

0

0.1

0.2

0.3

0.4

0.5

0.6

(d) xR0 = −2.22s2 + 2.67s − 0.05,

(c) x0R = −2s + 2s, 1

((s ∧ t) − st)ϕ(s)ds

Ω0 = 100

1 (0.9(s 0

− t)2 − 1.9|s − t| + 1)ds

Figure 1: Panels (1a), (1c) and (1d) show the regularized posterior mean estimation and the true solution. Panel (1b) show the comparison between the performance of our estimator and the classical Tikhonov regularized solution.

0.8

0.8

0.78

0.7

0.76

0.6 0.74

0.5 0.72

0.4

0.7

0.68

0.3 α = 10−7 α = 10−5 α = 10−3 α = 5*10−1 True x

0.2



0.64

0.1

0

α = 10 7 α = 10−5 α = 10−3 α = 5*10−1 True x

0.66

0.62

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.6

1

(a)

0.4

0.45

0.5

0.55

0.6

(b)

R1 Figure 2: Prior specification: x0 = −2.8s2 + 2.8s, Ω0 ϕ(t) = 2 0 exp(−(s − t)2 )ϕ(s)ds. Panel (2a) represents the regularized posterior mean estimator for different values of α. Panel (2b) is a zoom of the previous panel. 31

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0

0

0.1

0.2

0.3

0.4

0.5

true parameter Mean Tikhonov solution Mean posterior mean

0.1

true parameter Mean posterior mean

0.6

0.7

0.8

0.9

0

1

0

0.1

0.2

0.3

0.4

(a) x0R = −2.8s2 + 2.8s, 1 2

Ω0 ϕ(t) = 2

0

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0

0.1

true parameter Mean posterior mean

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

1

(c) x0R = −2s2 + 2s, 1

Ω0 ϕ(t) = 40

0.6

0.7

0.8

0.9

1

0.6

0.7

0.8

0.9

1

exp(−(s − t) )ϕ(s)ds

0.8

0.1

0.5

(b)

0

true parameter Mean posterior mean

0

0.1

0.2

0.3

0.4

0.5

(d) xR0 = −2.22s2 + 2.67s − 0.05,

((s ∧ t) − st)ϕ(s)ds

Ω0 = 100

1 (0.9(s 0

− t)2 − 1.9|s − t| + 1)ds

Figure 3: Monte Carlo experiment.

7.3

Density Estimation

Let ξ1 , . . . , ξn be an observed random sample generated from an unknown distribution F assumed to be absolutely continuous with respect to the Lebeasgue measure. The aim is to obtain an estimation of the associated density function f (ξ) by solving the functional equation given in Example 1 of subsection 2.5. The notation and all the technicalities are the same as in Example 1 and then they are not repeated. The true parameter of interest f is chosen to be the density of a standard gaussian measure on R and the measures π and ρ, defining the L2 spaces, are set equal to an uniform measure on [−3, 3] (i.e. U[−3, 3]). We use the sample ξ1 , . . . , ξn to estimate F and the sampling variance Σn . The operator K is known and does not need to be estimated. The prior distribution is specified as follows:

f0 Ω0 ϕ(t)

1 1 exp{− 2 (ξ − θ)2 } 2σ 2πσ Z 3 1 = ω0 exp(−(s − t)2 )ϕ(s) ds. 6 −3 =



Parameters (σ, θ, ω0 ) have been differently set to see the effect of prior changes on the estimated solution. The regularization parameter αn has been set equal to 0.05 and the sample size is of n = 1000. Figure 4 shows the regularized posterior mean estimator for different specification of the parameters. In panels (a) and (c) the true density (continuous line), the prior mean (dotted line) and the regularized posterior mean estimator (dashed-dotted line) are drawn; panels (b) and (d) show the comparison between our estimator and the classical Tikhonov solution (dotted line). We can have an idea about the behavior of the prior distribution by drawing a sample from it. Figure 5 represents a sample of curves dawn from the prior distribution together with the prior 32

mean (continuous line) and the true density (dotted line). Lastly, in Figure 6, the results of a Monte Carlo experiment are shown. The dashed-dotted line is the mean of the regularized posterior means obtained in each replication, the dashed line is the mean of Tikhonov solutions for each Monte Carlo iteration and the solid line is the true density function. 0.4

0.45

Reg. Posterior Mean True density Prior mean

0.4

0.35

0.35

0.3

0.3

0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1

0.05

Regularized Posterior Mean Tikhonov Solution True density

0.05

0 −4

−3

−2

−1

0

1

2

3

0 −3

4

−2

(a) σ = 1, θ = 0.5, ω0 = 10

−1

0.4

0.4

0.35

0.35

0.3

0.3

0.25

0.25

0.2

0.2

0.15

0.15

0.1

0.1

0.05

0.05 Regularized Posterior Mean True density Prior mean

0

−0.05 −5

−4

−3

−2

−1

0

1

2

3

Regularized Posterior Mean Tikhonov Solution True density Kernel Estimator

0

1

0

(b) σ = 1, θ = 0.5, ω0 = 10

2

3

−0.05 −4

4

−3

(c) σ = 1.5, θ = 0.5, ω0 = 10

−2

−1

0

1

2

3

4

(d) σ = 1.5, θ = 0.5, ω0 = 10

Figure 4: Regularized posterior mean and Tikhonov estimators of a density function.

1

0.6 True Density 0.4

0.2 0.5

True Density 0 Prior Mean −0.2

0 Prior Mean

−0.4

−0.6

−0.5 −4

−3

−2

−1

0

1

2

3

−0.8 −4

4

(a) σ = 1, θ = 0.5, ω0 = 10

−3

−2

−1

0

1

2

3

4

(b) σ = 1.5, θ = 0.5, ω0 = 10

Figure 5: Drawn from the prior distribution.

7.4

Regression Estimation

Consider a real random variable w ∈ R with a known normal distribution F with mean 2 and variance 1 and a Gaussian white noise ε ∼ N (0, 2) independently drawn. The regression function m(w) is defined as an element of L2F (w) such that ξ = m(w) + ε and E(ε|w) = 0. We follow the method outlined in Exemple 2 of Section 2.5 so that, for a given function g(w, t), we can equivalently define m(w) as the solution of the functional equation 33

E(g(w, t)ξ) = E(g(w, t)m(w)). We specify a function g(w, t) defining an Hilbert Schmidt operator K : L2F (w) → L2π , with π ∼ N (2, 1). g(w, t) is alternatively specified as an exponential function, g(w, t) = exp(−(w − t)2 ), or as an indicator function, g(w, t) = 1{w ≤ t}. The true regression function is m∗ (w) = cos(w)sin(w) and we specify the prior distribution as a Gaussian process: m(w) ∼ GP(m0 (w), Ω0 ϕ(w)), for any ϕ ∈ L2F (w). After having drawn a sample of (ξ, w) we can estimate E(g(w, t)ξ) for any t by using the sample mean. The regularization parameter α is set equal to 0.05, the sample size is n = 1000 for a single estimation and n = 500 for Monte Carlo simulations. In Monte Carlo Simulation we have done 50 replications. CASE I: g(w, t) = 1{w ≤ t}. R The prior covariance operator is defined through an exponential kernel, Ω0 ϕ(w1 ) = ω0 exp(−(w1 − w2 )2 )ϕ(w2 )f (w2 )dw2 , with ω0 = 2 or ω0 = 10, and we have considered three different prior mean specifications: m0 (w) = m∗ (w), m0 (w) = 0.067w − 0.2, or m0 = 0. Figure 7 shows the results for these prior specifications. Despite to the very bad prior mean, we are able to find noteworthy estimation. Panels (a), (c) and (e) shows the estimation for only one replication, Panels (b), (d) and (f ) shows the estimation for each Monte Carlo replication and the mean over all the replications (dashed-dotted line). CASE II: g(w, t) = exp(−(w − t)2 ). We have conducted single estimations and Monte Carlo experiments for the same prior specifications as in Case I, but the parameter ω0 takes alternatively the values ω0 = 2 and ω0 = 20. In Figure 8 the results that we have obtained are illustrated.

34

References [1] Agliari, A. and C.C. Parisetti (1988), A g-reference Informative Prior: a Note on Zellner’s g-prior, The Statistician, Vol.37, 3, 271 - 275. [2] Andersen, P.K., Borgan, O., Gill, R.D. and N. Keiding (1993), Statistical Models Based on Counting Processes, Springer-Verlag [3] Baker, C.R. (1973), Joint Measures and Cross-Covariance Operators, Transactions of the American Mathematical Society, 186, 273-289. [4] Belitser, E., and S., Ghosal (2003), Adaptive Bayesian Inference on the Mean of an Infinite Dimensional Normal Distribution, Annals of Statistics, 31, 536-559. [5] Carrasco, M., Florens, J.P., and E., Renault (2005), Estimation based on Spectral Decomposition and Regularization, forthcoming in Handbook of Econometrics, J.J. Heckman and E. Leamer, eds., 6, Elsevier, North Holland. [6] Cavalier, L., Golubev, G.K., Picard, D., and A.B., Tsybakov (2002), Oracle Inequalities for Inverse Problems, Annals of Statistics, 30, 843-874. [7] Chalmond, B. (2003), Modeling and Inverse Problems in Image Analysis, Springer. [8] Cox, D.D. (2007), Useful Priors for Covariance Operators, working paper presented at Workshop for Construction and Properties of Bayesian Nonparametric Regression Models, Isaac Newton Institute for Mathematical Sciences, Camridge UK. [9] Diaconis, F., and D., Freedman (1986), On the Consistency of Bayes Estimates, Annals of Statistics, 14, 1-26. [10] Doob, J.L. (1949), Application of the Theory of Martingales, in Le Calcul des Probabilit´es et ses Applications, Colloques Internationaux du Centre National de la Recherche Scientifique, 13, 23-27. [11] Dykstra, R.L. and P.W., Laud (1981), A Bayesian Nonparametric Approach to Reliability, The Annals of Statistics, 9, 356-367. [12] Engl, H.W., Hanke, M. and A., Neubauer (2000), Regularization of Inverse Problems, Kluwer Academic, Dordrecht. [13] Escobar, M.D. and M., West (1995), Bayesian Density Estimation and Inference Using Mixtures, Journal of the American Statistical Association, Vol. 90, 430, 577-588. [14] Ferguson, T.S. (1974), Prior Distributions on Spaces of Probability Measures, The Annals of Statistics, Vol.2, 4, 615-629. [15] Ferguson, T.S. and E.G. Phadia (1979), Bayesian Nonparametric Estimation Based on Censored Data, Annals of Statistics, 7, 163-186. [16] Florens, J.P., Mouchart, M., and J.M., Rolin (1990), Elements of Bayesian Statistics, Dekker, New York. [17] Florens, J.P., Mouchart, M. and J.M., Rolin (1992) Bayesian Analysis of Mixture: Some Results on Exact Estimability and Identification, in Bayesian Statistics IV, edited by J.Bernardo, J.Burger, D. Lindley and A.Smith, North Holland, 127-145. [18] Florens, J.P., and A., Simoni (2007), Nonparametric Estimation of Instrumental Regression: a Bayesian Approach Based on Regularized Posterior, mimeo. [19] Florens, J.P., and A., Simoni (2008), Regularized Posteriors in Linear Ill-Posed Inverse Problems: Extensions, mimeo. [20] Florens-Zmirou, D. (1989), Estimation de la Variance d’une Diffusion ` a Partir d’une Observation Discr´eetis´ee, C.R. Acad. Sci. Paris, 309, 195 - 200. 35

[21] Franklin, J.N. (1970), Well-posed stochastic extension of ill-posed linear problems, Journal of Mathematical Analysis and Applications, 31, 682 - 716. [22] Hanson, T. and W.O., Johnson (2002), Modeling Regression Error With a Mixture of Polya Trees, Journal of the American Statistical Association, 97, 1020-1033. [23] Hiroshi, S. and O., Yoshiaki (1975), Separabilities of a Gaussian Measure, Annales de l’I.H.P., section B, tome 11, 3, 287 - 298. [24] Hjort, N.L. (1990), Nonparametric Bayes Estimators Based on Beta Processes in Models for Life History Data, The Annals of Statistics, Vol.18, 3, 1259-1294. [25] Hjort, N.L. (1996), Bayesian Aprroaches to Non- and Semiparametric Density Estimation, Bayesian Statistics 5 (J.M. Bernardo et al., eds.), 223-253. [26] Ishwaran, H. and L., James (2004), Computational Methods for Multiplicative Intensity Models Using Weighted Gamma Processes: Proportional Hazards, Marked Point Processes, and Panel Count Data, Journal of the American Statistical Association, vol.99, 465, 175-190. [27] Kaipio, J., and E., Somersalo (2004), Statistical and Computational Inverse Problems, Applied Mathematical Series, vol.160, Springer, Berlin. [28] Kress, R. (1999), Linear Integral Equation, Springer. [29] Lvine, M. (1992) Some Aspects of Polya Tree Distributions for Statistical Modelling, The Annals of Statistics, Vol.20, 3, 1222-1235. [30] Lehtinen, M.S., P¨aiv¨arinta, L. and E., Somersalo (1989), Linear Inverse Problems for Generalised Random Variables, Inverse Problems, 5, 599-612. [31] Lenk, P.J. (1988), The Logistic Normal Distribution for Bayesian, Nonparametric, Predictive Densities, Journal of the American Statistical Association, Vol.83, 402, 509-516. [32] Mandelbaum, A. (1984), Linear Estimators and Measurable Linear Transformations on a Hilbert Space, Z. Wahrcheinlichkeitstheorie, 3, 385-98. [33] Neveu, J. (1965), Mathematical Fundations of the Calculus of Probability, San Francisco: Holden-Day. [34] Neveu, J. (1975), Discrete Parameter Martingales, Amsterdam: North-Holland. [35] Prenter, P.M. and C.R., Vogel (1985), Stochastic Inversion of Linear First Kind Integral Equations. I. Continuous Theory and the Stochastic Generalized Inverse, Journal of Mathematical Analysis and Applications, 106, 202 - 212. [36] Petrone S. (1999) Bayesian Density Estimation Using Bernstein Polynomials, The Canadian Journal of Statistics, Vol.27, 1, 15-126. [37] Rasmussen, C.E. and C.K.I., Williams (2006), Gaussian Processes for Machine Learning, The MIT Press. [38] Ruggiero, M. (1994), Bayesian Semiparametric Estimation of Proportional Hazards Models, Journal of Econometrics, 62, 277 - 300. [39] Smith, M. and R., Kohn (1996), Nonparametric Regression Using Bayesian Variable Selection, Journal of Econometrics, 75, 317-343. [40] Van der Vaart, A.W., and J.H., Van Zanten (2000), Rates of Contraction of Posterior Distributions Based on Gaussian Process Priors, Working paper. [41] Vapnik, V.N. (1998), Statistical Learning Theory, John Wiley & Sons, Inc. [42] Walker, S. and P., Muliere (1997),Beta-Stacy Processes and a Generalization of the Polya-Urn Scheme, The Annals of Statistics, Vol. 25, 4, 1762-1780. 36

[43] Zellner, A. (1986), On Assessing Prior Distributions and Bayesian Regression Analysis with g-prior Distribution, in: Goel, P.K. and Zellner, A. (Eds) Bayesian Inference and Decision Techniques: essays in honour of Bruno de Finetti, pp. 233-243 (Amsterdam, North Holland). [44] Zhao, L.H. (2000), Bayesian Aspects of Some Nonparametric Problems, Annals of Statistics, 28, 532-552.

37

0.4

0.4

Monte Carlo mean Prior Mean True Regression

0.35

0.35

Monte Carlo mean Prior Mean True Regression

0.3

0.3

0.25

0.25 0.2

0.2 0.15

0.15 0.1

0.1 0.05

0.05

0 −4

0

−3

−2

−1

0

1

2

−0.05 −4

3

(a) σ = 1, θ = 0.5, ω0 = 10

−3

−2

−1

0

1

(b) σ = 1.5, θ = 0.5, ω0 = 10

Figure 6: Monte Carlo simulation.

38

2

3

0.4 Regularized Posterior Mean True regression Prior mean

1.5

0.3

1

0.2 True Regression & Prior Mean 0.5

0.1

0

0

−0.5

−0.1

Monte Carlo Mean (Regularized Posterior Mean) −1

−0.2

−0.3 −2

−1

0

1

2

3

4

5

−1.5 −1

6

(a)R m0 (w) = cos(w)sin(w),

1

2

3

4

5

(b)R m0 (w) = cos(w)sin(w),

exp(−(w1 − w2 )2 )ϕ(w2 )f (w2 )dw2

Ω0 ϕ(w1 ) = 2

0

Ω0 ϕ(w1 ) = 2

0.3

exp(−(w1 − w2 )2 )ϕ(w2 )f (w2 )dw2

1.5

0.2 1 True Regression

0.1 0.5

0 0

−0.1

−0.5

−0.2

Posterior Mean (Regularized Posterior Mean)

Prior Mean −1

Regularized Posterior Mean True regression Prior mean

−0.3

−0.4 −1

0

1

2

3

4

−1.5 −0.5

5

(c) Rm0 (w) = 0.067w − 0.2,

0.5

1

1.5

2

2.5

3

3.5

4

4.5

(d)Rm0 (w) = 0.067w − 0.2,

exp(−(w1 −w2 )2 )ϕ(w2 )f (w2 )dw2

Ω0 ϕ(w1 ) = 10

0

Ω0 ϕ(w1 ) = 10

0.4

exp(−(w1 −w2 )2 )ϕ(w2 )f (w2 )dw2

1.5

Regularized Posterior Mean True regression Prior mean 0.3

1 Trues Regression

0.2 0.5

0.1 0

0

−0.5

−0.1

Prior Mean

−0.2

−0.3 −2

Monte Carlo mean (Regularized Post. Mean)

−1

−1

Ω0 ϕ(w1 ) = 10

0

1

2

3

4

5

−1.5 −0.5

6

(e) m0 (w) = 0, R 2

exp(−(w1 −w2 ) )ϕ(w2 )f (w2 )dw2

0

0.5

Ω0 ϕ(w1 ) = 10

1

1.5

2

2.5

3

3.5

4

4.5

(f) m0 (w) = 0, R 2

exp(−(w1 −w2 ) )ϕ(w2 )f (w2 )dw2

Figure 7: Panels (7a), (7c) and (7e): estimation for different prior means. Panels (7b), (7d) and (7f): Monte Carlo Experiment with N = 100, α = 0.05, 50 iterations.

39

0.3

0.8

0.6

0.2

Monte Carlo Mean (Regularized Posterior Mean) 0.4

0.1 0.2 0 0 −0.1 −0.2 True Regression & Prior Mean

−0.2 −0.4 Regularized Posterior Mean True regression Prior mean

−0.3

−0.4 −2

−1

0

−0.6

1

2

3

4

−0.8 −1

5

(a)R m0 (w) = cos(w)sin(w),

Ω0 ϕ(w1 ) = 2

0

1

2

3

4

5

(b)R m0 (w) = cos(w)sin(w),

exp(−(w1 − w2 )2 )ϕ(w2 )f (w2 )dw2

Ω0 ϕ(w1 ) = 2

0.3

exp(−(w1 − w2 )2 )ϕ(w2 )f (w2 )dw2

1

0.8 True Regression

0.2 0.6 0.1

0.4

0.2 0 0 −0.1 −0.2

−0.4

−0.2

−0.6 Regularized Posterior Mean True regression Prior mean

−0.3

−0.4 −2

−1

0

1

2

3

4

−1 −0.5

5

(c) Rm0 (w) = 0.067w − 0.2,

Ω0 ϕ(w1 ) = 20

Monte Carlo Mean (Regularized Posterior Mean)

−0.8

0

0.5

1

1.5

2

2.5

3

3.5

4

(d)Rm0 (w) = 0.067w − 0.2,

exp(−(w1 −w2 )2 )ϕ(w2 )f (w2 )dw2

Ω0 ϕ(w1 ) = 20

exp(−(w1 −w2 )2 )ϕ(w2 )f (w2 )dw2

0.8

0.25 Regularized Posterior Mean True regression Prior mean

0.2

0.6

0.15

0.4

0.1

0.2

0.05

0 0

−0.2 −0.05

−0.4 −0.1

−0.6

−0.15

−0.8

−0.2

−0.25 −2

−1

Ω0 ϕ(w1 ) = 20

0

1

2

3

4

−1 −0.5

5

(e) m0 (w) = 0, R 2

exp(−(w1 −w2 ) )ϕ(w2 )f (w2 )dw2

0

0.5

Ω0 ϕ(w1 ) = 20

1

1.5

2

2.5

3

3.5

4

(f) m0 (w) = 0, R 2

exp(−(w1 −w2 ) )ϕ(w2 )f (w2 )dw2

Figure 8: Panels (8a), (8c) and (8e): estimation for different prior means. Panels (8b), (8d) and (8f): Monte Carlo Experiment with N = 100, α = 0.05, 50 iterations.

40

Regularized Posteriors in Linear Ill-posed Inverse ...

Jul 26, 2008 - linear in x, we only deal with linear inverse problems in this paper. However, it is .... This characterizes a new object that we call regularized.

795KB Sizes 1 Downloads 158 Views

Recommend Documents

Regularized Posteriors in Linear Ill-posed Inverse ...
We show that, as the number of observations grows indefinitely, our proposed solution .... integrable with respect to Π. A conditional probability is called regular if.

regularized posteriors in linear ill-posed inverse problems
The resulting distribution is called regularized posterior distribution and we prove ... data, gaussian priors, inverse problems, posterior consistency, Tikhonov and ...... center: ∑ j(< Ω0K∗e2j,e1j > + < KΩ0e1j,e2j >)=2. ∑ j < Ω. 1. 2. 0 Kâ

Programming Exercise 5: Regularized Linear Regression ... - GitHub
where λ is a regularization parameter which controls the degree of regu- larization (thus ... Finally, the ex5.m script should also plot the best fit line, resulting in an image similar to ... When you are computing the training set error, make sure

Sparse Linear Models and l1−Regularized 2SLS with High ...
High-Dimensional Endogenous Regressors and Instruments. Ying Zhu ... (2) for all j. Our primary interest concerns the regime where p ≥ (n ∨ 2), β∗ and π∗ ..... quantity erra accounts for the remaining error from π∗ j,Sc τj ..... (2013).

Regularizing Priors for Linear Inverse Problems
g-prior and we show that, under mild assumptions, this prior is able to ... problems, the existence of a regular version of the posterior distribution, see. (23).

Inverse Functions and Inverse Trigonometric Functions.pdf ...
Sign in. Loading… Whoops! There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying.

Feature Adaptation Using Projection of Gaussian Posteriors
Section 4.2 describes the databases and the experimental ... presents our results on this database. ... We use the limited memory BFGS algorithm [7] with the.

Mixtures of Inverse Covariances
class. Semi-tied covariances [10] express each inverse covariance matrix 1! ... This subspace decomposition method is known in coding ...... of cepstral parameter correlation in speech recognition,” Computer Speech and Language, vol. 8, pp.

cert petition - Inverse Condemnation
Jul 31, 2017 - COCKLE LEGAL BRIEFS (800) 225-6964. WWW. ...... J., dissenting).3. 3 A number of trial courts and state intermediate appellate ...... of Independent Business Small Business Legal Center filed an amici curiae brief in support ...

Opening Brief - Inverse Condemnation
[email protected] [email protected] [email protected] [email protected] [email protected]. Attorneys for Defendants and Appellants. City of Carson and City of Carson Mobilehome Park Rental Review Board. Case: 16-56255, 0

Amicus Brief - Inverse Condemnation
dedicated to advancing the principles of individual liberty, free markets, and limited government. Cato's. Center for Constitutional Studies was established in.

Opening Brief - Inverse Condemnation
of Oakland v. City of Oakland, 344 F.3d 959, 966-67 (9th Cir. 2003);. Buckles v. King Cnty., 191 F.3d 1127, 1139-41 (9th Cir. 1999). The Court in Del Monte Dunes neither held nor implied that a. Penn Central claim must be decided by a jury; Penn Cent

sought rehearing - Inverse Condemnation
On Writ of Review to the Fourth Circuit Court of Appeal, No. 2016-CA-0096 c/w 2016-CA-0262 and 2016-CA-0331, and the Thirty-Fourth Judicial District Court,. Parish of St. Bernard, State of Louisiana, No. 116-860,. Judge Jacques A. Sanborn, Presiding.

Amicus Brief - Inverse Condemnation
S.C. Coastal Council,. 505 U.S. 1003 ..... protect scenic and recreational use of Oregon's ocean shore. .... Burlington & Quincy Railroad Co., 166 U.S. 226. In.

full brochure - Inverse Condemnation
Local, State & National Trends. April 25-26, 2013 > Tides Inn > Irvington. 7th ANNUAL CONFERENCE. Enjoy the luxurious Tides. Inn at special discount rates.

Enhancing Exemplar-Based Posteriors for Speech ...
ral Network using exemplar-based posteriors as inputs. This produces ... Various tech- niques have ..... native Features for Phone Recognition,” in Proc. ICASSP ...

Inverse Kinematics
later, the power of today's machines plus the development of different methods allows us to think of IK used in real-time. The most recommendable reference ... (for example, certain part of a car) [Lan98]. The node that is wanted to .... goal and Σ

cert petition - Inverse Condemnation
Jul 31, 2017 - isiana Court of Appeal, App. 38, is reported at 192 So. 3d. 214. The trial ..... afterwards it had to start over building its business from scratch.

Cert Petition - Inverse Condemnation
Apr 28, 2017 - Supreme Court of the United States. Ë ..... application to extend the time to file this Petition to, and including, April 28 .... development. Homes fill ...

GRAPH REGULARIZED LOW-RANK MATRIX ...
also advance learning techniques to cope with the visual ap- ... Illustration of robust PRID. .... ric information by enforcing the similarity between ai and aj.

Alternative Regularized Neural Network Architectures ...
collaboration, and also several colleagues and friends for their support during ...... 365–370. [47] D. Imseng, M. Doss, and H. Bourlard, “Hierarchical multilayer ... identity,” in IEEE 11th International Conference on Computer Vision, 2007., 2

Feature Selection via Regularized Trees
Email: [email protected]. Abstract—We ... ACE selects a set of relevant features using a random forest [2], then eliminates redundant features using the surrogate concept [15]. Also multiple iterations are used to uncover features of secondary

Linear and Linear-Nonlinear Models in DYNARE
Apr 11, 2008 - iss = 1 β while in the steady-state, all the adjustment should cancel out so that xss = yss yflex,ss. = 1 (no deviations from potential/flexible output level, yflex,ss). The log-linearization assumption behind the Phillips curve is th