Lecture notes on empirical process theory

Viewer
Transcript

Lecture notes on empirical process theory∗ Kengo Kato† October 30, 2017

∗

First version: September 2012. These notes are only lightly proofread and there could be a lot of (hopefully small) mistakes. † Graduate School of Economics, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan. E-mail: [email protected]

1

Contents 1 Symmetrization 6 1.1 Symmetrization inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 The contraction principle . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 L´evy and Hoffmann-Jørgensen inequalities . . . . . . . . . . . . . . . . . . 10 2 Maximal inequalities 14 2.1 Young moduli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Maximal inequalities based on covering numbers . . . . . . . . . . . . . . 17 2.3 Applications to empirical processes . . . . . . . . . . . . . . . . . . . . . . 22 3 Limit theorems 3.1 Weak convergence of sample-bounded stochastic processes 3.2 Uniform law of large numbers . . . . . . . . . . . . . . . . 3.3 Uniform central limit theorem . . . . . . . . . . . . . . . . 3.4 Application: CLT in C-space . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

29 29 41 43 48

4 Covering numbers 51 4.1 VC subgraph classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 VC type classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5 Gaussian processes 5.1 Gaussian concentration inequality . . . . . . . . . . 5.2 Second proof of Borell-Sudakov-Tsirel’son theorem 5.3 Proof of Gross’s log-Sobolev inequality . . . . . . . 5.4 Size of expected suprema . . . . . . . . . . . . . . 5.5 Absolute continuity of Gaussian suprema . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

6 Rademacher processes

59 59 64 65 68 74 80

7 Talagrand’s concentration inequality 7.1 Two technical theorems . . . . . . . . . . . . . 7.2 Proof of Talagrand’s concentration inequality . 7.3 A “statistical version” of Talagrand’s inequality 7.4 A Fuk-Nagaev type inequality . . . . . . . . . . 8 Rudelson’s inequality

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

83 84 90 93 96 98

2

Preface These lecture notes are written for an introduction of modern empirical process theory to students majoring in statistics and econometrics who are familiar with measure theoretic probability. The materials of these notes are mostly gathered from Gin´e’s lecture notes (Gin´e, 2007) and the textbook by van der Vaart and Wellner (1996). Billingsley (1968), Ledoux and Talagrand (1991), Dudley (1999), and the recent monograph by Gin´e and Nickl (2016) are also indispensable references on this topic. I will also review some basic results on Gaussian processes (cf. Adler, 1990; Davydov et al., 1998; Li and Shao, 2001), measure concentration (cf. Ledoux, 2001; Boucheron et al., 2013), and non-asymptotic analysis of random matrices (cf. Vershynin, 2010; Tropp, 2012b). The notes consist of eight sections. The main materials covered are: • the maximal inequalities due essentially to Dudley (1967) and Pisier (1986), with applications to empirical processes (Section 2); • the characterization of weak convergence of sample-bounded stochastic processes, due essentially to Hoffmann-Jørgensen (1991) and Andersen and Dobri´c (1987) (Section 3); • the Dudley-Koltchinskii-Pollard (Dudley, 1978; Koltchinskii, 1981; Pollard, 1982) uniform central limit theorem for empirical processes (Section 3); • the Gaussian concentration inequality due to Borell (1975) and Sudakov and Tsirel’son (1978), with proofs due to Pisier (1989) and Ledoux (1996) (Section 5); • Talagrand’s (1996) concentration (deviation) inequality for general empirical processes, with a proof due to Ledoux (1996), Massart (2000) and Boucheron et al. (2003) (Section 7); • Rudelson’s inequality, with a proof due to Oliveira (2010) and Tropp (2012b) (Section 8). I did not try to make these notes complete nor self-contained, and so state some intermediate results without proofs. But otherwise I tried to provide as elementary proofs as possible, and I believe that potential readers (if exist) can follow the notes without much effort. In these notes, in order to keep the exposition simple, I exclusively assume that classes of functions are (essentially) countable (more precisely, pointwise measurable). This allows us to help going into measurability problems. However, I must clarify that in most cases (except for Talagrand’s inequality, for which pointwise measurability is essential) this condition can be totally dispensed or replaced by other (mild) conditions. See Section 2.3 of van der Vaart and Wellner (1996) and Chapter 5 of Dudley (1999). In any case, in these notes, we are bit loose about measurability.

3

Notation and setting • Let (Ω, A, P) be an underlying (complete, if necessary) probability space that should be understood in the context. • For any non-empty set T , let `∞ (T ) denote the space of all bounded functions T → R, equipped with the uniform norm kf kT := supt∈T |f (t)|. For a given nonempty set T , a non-negative function d : T × T → R+ is call a semi-metric (or pseudo-metric) if it satisfies the following three properties for all s, t, u ∈ T : (i) d(t, t) = 0; (ii) (symmetry) d(s, t) = d(t, s); (iii) (triangle inequality) d(s, u) ≤ d(s, t) + d(t, u). If in addition d(s, t) = 0 ⇒ s = t, then d is a metric. Equipped with a semi-metric d, (T, d) is called a semi-metric space. For any semi-metric space (T, d), let Cu (T, d) denote the space of all bounded uniformly d-continuous functions f : T → R, equipped with the uniform norm k · kT . If (T, d) is totally bounded, then any uniformly continuous function of T is bounded, and so Cu (T, d) is just the set of all uniformly continuous functions on T . • For any probability measure Q on a measurable space (S, S) andRany measurable function f : S → R = [−∞, ∞], we use the notation Qf := f dQ whenever R f dQ exists. Further, for 1 ≤ p < ∞, let Lp (Q) denote the space of all measurable functions f : S → R such that kf kQ,p := (Q|f |p )1/p < ∞. We also use the notation kf k∞ := supx∈S |f (x)|. • The standard norm on aqEuclidean space is denoted by | · |; that is, for a = Pn 2 (a1 , . . . , an )> ∈ Rn , |a| = i=1 ai . For a matrix A, let kAkop denote the operator norm of A, that is, when A has d columns, kAkop := supx∈Rd ,|x|=1 |Ax|. • For a, b ∈ R, let a ∨ b = max{a, b} and a ∧ b = max{a, b}. Further, let a+ = a ∨ 0 and a− = (−a) ∨ 0, so that a = a+ − a− . w

P

• Let → denote weak convergence, and let → denote convergence in probability. Unless otherwise stated, we shall obey the following setting. • Let (S, S, P ) be a probability space. Let X1 , X2 , . . . be i.i.d. S-valued random variables with common distribution P . We think of X1 , X2 , . . . , when they appear, as the coordinates of the infinite product probability space (S N , S N , P N ), which may be embedded in a larger probability space (e.g. when the symmetrization is used). For n ∈ N, the empirical probability measure is defined by n

1X Pn := δXi . n i=1

For example,

n

Z Pn f =

f dPn =

1X f (Xi ). n i=1

4

• Let F be a non-empty collection of measurable functions S → R, to which a measurable envelope F : S → R+ = [0, ∞) is attached. An envelope F of F is a function S → R+ such that F (x) ≥ supf ∈F |f (x)| for all x ∈ S. Unless otherwise stated, we at least assume that F ⊂ L1 (P ). Further, to avoid measurability problems, we assume that F is pointwise measurable, that is, it contains a countable subset G such that for every f ∈ F there exists a sequence gm ∈ G with gm (x) → f (x) for all x ∈ S. We note here that if F ∈ L1 (P ), then by the dominated convergence theorem, {f − P f : f ∈ F } is also pointwise measurable. See Section 2.3 of van der Vaart and Wellner (1996) for the discussion on pointwise measurability. The existence of a measurable envelope is indeed an assumption. Under pointwise measurability, a measurable envelope exists if and only if F is pointwise bounded (that is, supf ∈F |f (x)| < ∞ for each x ∈ S); indeed the necessity is obvious, and for the sufficiency take F = supf ∈F |f | = supf ∈G |f |. The function F = supf ∈F |f | is the minimal envelope but we allow for other choices.

5

1

Symmetrization

The main object of these lecture notes is to study probability estimates of the random quantity kPn − P kF := sup |Pn f − P f |, f ∈F

and limit theorems for the empirical process (Pn −P )f, f ∈ F. To do so, the symmetrization technique plays an essential role. The symmetrization replaces Pn (or randomization)P n (f (X ) − P f ) by i i=1 i=1 εi f (Xi ) with independent Rademacher random variables ε1 , . . . , εn independent of X1 , . . . , Xn . A Rademacher random variable ε is a random variable taking ±1 with equal probability, that is, 1 P(ε = 1) = P(ε = −1) = . 2 The advantage of the symmetrization lies in the fact that the symmetrized process is typically easier to control than Pn the original process, as we will find out in several Pn places. For example, even though i=1 (f (Xi ) − P f ) has only low order moments, i=1 εi f (Xi ) is sub-Gaussian conditionally on X1 , . . . , Xn . In what follows, Eε denotes the expectation with respect to ε1 , ε2 , . . . only; likewise, EX denotes the expectation with respect to X1 , X2 , . . . only.

1.1

Symmetrization inequalities

The following is a simplest symmetrization inequality. Theorem 1. Suppose that P f = 0 for all f ∈ F. Let ε1 , . . . , εn be independent Rademacher random variables independent of X1 , . . . , Xn . Let Φ : R+ → R+ be a non-decreasing convex function, and let µ : F → R be a bounded functional such that {f + µ(f ) : f ∈ F} is pointwise measurable. Then

!#

!# " " n n

X 1

X

E Φ ≤E Φ εi f (Xi ) f (Xi )

2 i=1 i=1 F

!# " F n

X

≤ E Φ 2 εi (f (Xi ) + µ(f )) . (1)

i=1

Proof. We begin with proving the left inequality. sets A, B ⊂ {1, . . . , n},

!# " "

X

f (Xi ) ≤E Φ E Φ

i∈A

F

F

We claim that for any disjoint index

!#

X

. f (Xi )

i∈A∪B

(2)

F

Indeed, because of pointwise measurability, there exists a countable subset G ⊂ F such that for any f ∈ F there exists a sequence gm ∈ G with gm → f pointwise. Then

" #

X

X

X X

f (X ) = f (X ) + E f (X ) = f (X )

i i i i

i∈A

F

i∈A

i∈A

G

6

i∈B

G

because P f = 0 for each f ∈ F. Fixing any xi ∈ S, i ∈ A, we have that 

 " #

X

X X X

 f (xi ) + E f (Xi ) ≤ E f (xi ) + f (Xi )  ,

i∈A

i∈B

i∈A

G

i∈B

G

and since Φ is non-decreasing and convex,   

 " # 

X

X

X X

Φ  f (xi ) + E f (Xi )  ≤ Φ E  f (xi ) + f (Xi ) 

i∈A i∈B i∈A i∈B G G    

X

X

≤ E Φ  f (xi ) + f (Xi )  ,

i∈A

i∈B

G

where the second inequality follows from Jensen’s inequality (formally if the expectation inside Φ is infinite, apply Jensen’s inequality after truncation, P and then take the P limit). Applying Fubini’s theorem and using the fact that k i∈A∪B f (Xi )kG = k i∈A∪B f (Xi )kF , we obtain the inequality (2). From this, we have

!# " n

X

EX Φ εi f (Xi )

i=1 F  



X

X

= EX Φ  f (Xi ) − f (Xi ) 

εi =1 εi =−1 F      



X X 1 1

≤ EX Φ 2 f (Xi )  + EX Φ 2 f (Xi ) 

2 2 εi =1 εi =−1 F

!#F " n

X

≤ E Φ 2 . f (Xi )

i=1

F

An application of Fubini’s theorem leads to the left inequality in (1). For the opposite inequality, using the argument used to prove the inequality (2), we have that

n

!#

n

!# " "

X

X

E Φ f (Xi ) = E Φ (f (Xi ) − E[f (Xn+i )])

i=1 i=1 F F

!# " n

X

. (3) ≤ E Φ (f (Xi ) − f (Xn+i ))

i=1

d

F

Because (Xi , Xn+i ) = (Xn+i , Xi ) for each 1 ≤ i ≤ n, and (Xi , Xn+i ), 1 ≤ i ≤ n are

7

independent, the last expression in (3) is equal to

!# " n

X

E Φ εi (f (Xi ) − f (Xn+i ))

i=1 F

n " !#

X 1

≤ E Φ 2 εi (f (Xi ) + µ(f ))

2 i=1 F

!# " n

X 1

εi (f (Xn+i ) + µ(f )) + E Φ 2

2 i=1 F

!# " n

X

= E Φ 2 εi (f (Xi ) + µ(f )) .

i=1

F

This completes the proof. We will often use the symmetrization inequality with Φ(x) = xp for some p ≥ 1 and µ(f ) = P f when F is not P -centered. In that case, we have

p #

p #

p # " n " n " n

X

X

X 1

p E ε (f (X ) − P f ) ≤ E (f (X ) − P f ) ≤ 2 E ε f (X ) .

i i i i i

2p i=1

i=1

F

i=1

F

F

There is an analogous symmetrization inequality for probabilities. Theorem 2. Let ε1 , . . . , εn be independent Rademacher random variables independent of X1 , . . . , Xn . Let µ : F → R be a bounded functional such that {f + µ(f ) : f ∈ F} is pointwise measurable. Then for every x > 0,

( n ) ( n )

X

X

βn (x)P f (Xi ) > x ≤ 2P 4 εi (f (Xi ) + µ(f )) > x ,

i=1

i=1

F

F

Pn

where βn (x) is any constant such that βn (x) ≤ inf f ∈F P {| i=1 f (Xi )| < x/2}. In particular, when P f = 0 for all f ∈ F, we may take βn (x) = 1 − (4n/x2 ) supf ∈F P f 2 . Proof. The second Pn assertion follows from Markov’s inequality. ˜We shall prove the first assertion. If k i=1 f (Xn+i )kFP> x, then there is a function f ∈ F (that may depend on Xn+1 , . . . , X2n ) such that | ni=1 f˜(Xn+i )| > x. Fix Xn+1 , . . . , X2n . For such f˜, we have ( n ) x X f˜(Xi ) < | Xn+1 , . . . , X2n βn (x) ≤ P 2 i=1 ) ( n X x ≤ P (f˜(Xi ) − f˜(Xn+i )) > | Xn+1 , . . . , X2n 2 i=1

( n )

X x

≤ P (f (Xi ) − f (Xn+i )) > | Xn+1 , . . . , X2n .

2 i=1

F

8

˜ The far left and right hand Pn sides do not depend on f , and the inequality between them is valid on the event {k i=1 f (Xn+i )kF > x}. Hence we have

) ( n ) ( n

X

X

x

. βn (x)P f (Xn+i ) > x ≤ P (f (Xi ) − f (Xn+i )) >

2 i=1

i=1

F

F

d

Because (Xi , Xn+i ) = (Xn+i , Xi ) for each 1 ≤ i ≤ n, and (Xi , Xn+i ), 1 ≤ i ≤ n are independent, the last expression is equal to

( n )

X

x

P εi (f (Xi ) − f (Xn+i )) >

2 i=1

F ) ( n

X

x

≤P εi (f (Xi ) + µ(f )) >

4 i=1 F ( n )

X

x

+P εi (f (Xn+i ) + µ(f )) >

4 i=1

F ) ( n

X

x

. = 2P εi (f (Xi ) + µ(f )) >

4 i=1

F

This completes the proof.

1.2

The contraction principle

A function ϕ : R → R is called a contraction if |ϕ(y) − ϕ(x)| ≤ |y − x| for all x, y ∈ R. Theorem 3. Let Φ : R+ → R+ be a non-decreasing convex function. Let T be a non-empty bounded subset of Rn , and let ε1 , . . . , εn be independent Rademacher random variables. Let ϕi : R → R, 1 ≤ i ≤ n be contractions with ϕi (0) = 0. Then n !# n !# " " X X 1 E Φ sup ϕi (ti )εi ≤ E Φ sup εi t i . 2 t∈T t∈T i=1

i=1

Proof. See Ledoux and Talagrand (1991), Theorem 4.12. The following corollary is a simple but important consequence of the contraction principle. Corollary 1. Let σ 2 > 0 be any positive constant such that σ 2 ≥ supf ∈F P f 2 . Let ε1 , . . . , εn be independent Rademacher random variables independent of X1 , . . . , Xn . Then

#

n

# " " n

X

X

εi f (Xi ) . f 2 (Xi ) ≤ nσ 2 + 8E max F (Xi ) E

1≤i≤n i=1

i=1

F

9

F

Proof. By the triangle inequality, n n X X f 2 (Xi ) ≤ nP f 2 + (f 2 (Xi ) − P f 2 ) , i=1

i=1

from which, together with the symmetrization inequality (Theorem 1), we have

#

# " n " n

X

X

E f 2 (Xi ) ≤ nσ 2 + 2E εi f 2 (Xi ) .

i=1

i=1

F

F

Fix X1 , . . . , Xn . Let M = max1≤i≤n F (Xi ). Define the function ϕ : R → R by  2 if x > M  M ϕ(x) = x2 if −M ≤ x ≤ M .   2 M if x < −M Then ϕ is Lipschitz continuous with Lipschitz constant bounded by 2M , that is |ϕ(x) − ϕ(y)| ≤ 2M |x − y|, x, y ∈ R. Hence by the contraction principle (Theorem 3) applied to ϕ(·)/(2M ), we have

#

# " n " n

X

X

2 Eε εi f (Xi ) ≤ 4M Eε εi f (Xi ) .

i=1

i=1

F

F

This completes the proof.

1.3

L´ evy and Hoffmann-Jørgensen inequalities

In this subsection, let ε1 , . . . , εn be independent Rademacher random variables independent of X1 , . . . , Xn , and let Sk (f ) =

k X

εi f (Xi ), k = 1, . . . , n.

i=1

Further, we take F = supf ∈F |f |. For the notational convenience, write kSk k = kSk kF = sup |Sk (f )|. f ∈F

Let Ak denote the σ-field generated by (ε1 , X1 ), . . . , (εk , Xk ). Proposition 1 (L´evy). For every t > 0, P max kSk k > t ≤ 2P{kSn k > t}, 1≤k≤n P max F (Xi ) > t ≤ 2P{kSn k > t}. 1≤i≤n

10

(4) (5)

Therefore, for every 0 < p < ∞, p p p E max kSk k ≤ 2E [kSn k ] , E max F (Xi ) ≤ 2E [kSn kp ] . 1≤i≤n

1≤k≤n

(6)

Proof. Define τ = inf{1 ≤ k ≤ n : kSk k > t} with the convention that inf ∅ = ∞, and P P (k) Sn (f ) = ki=1 εi f (Xi ) − ni=k+1 εi f (Xi ). Note that for each k = 1, . . . , n, {τ = k} = {kSj k ≤ t (∀j = 1, . . . , k − 1) & kSk k > t} ∈ Ak . (k)

Further, the conditional distribution of kSn k given Ak is identical to that of kSn k. Hence, on the one hand, we have P{kSn k > t} =

n X

P{kSn k > t, τ = k} =

k=1

n X

P{kSn(k) k > t, τ = k}.

k=1 (k)

On the other hand, since 2kSk k ≤ kSn k + kSn k, we have the inclusion relation {τ = k} = {τ = k, kSk k > t} ⊂ {τ = k, kSn k > t} ∪ {τ = k, kSn(k) k > t}. Hence P

X n P(τ = k) max kSk k > t =

1≤k≤n

≤

n X

k=1

n n o X (k) P{τ = k, kSn k > t} + P τ = k, kSn k > t = 2P{kSn k > t},

k=1

k=1

which leads to inequality (4). For the second inequality (5), redefine τ = inf{1 ≤ k ≤ n : F (Xk ) > t} and P (k) Sn (f ) = − i6=k εi f (Xi ) + εk f (Xk ). Then P{kSn k > t} =

n X

P{kSn k > t, τ = k} =

k=1

k=1 (k) kSn k,

Using the inequality 2F (Xk ) ≤ kSn k + X n P max F (Xk ) > t = P(τ = k) 1≤k≤n

≤

n X

we have

k=1

P{kSn k > t, τ = k} +

k=1

n n o X P kSn(k) k > t, τ = k .

n n o X P kSn(k) k > t, τ = k = 2P{kSn k > t}, k=1

which gives inequality (5). The last two inequalities in (6) follow from (4) and (5), and the formula Z ∞ p E[|ξ| ] = ptp−1 P(|ξ| > t)dt. 0

This completes the proof. 11

Example 1. As a simple application of the symmetrization technique and L´evy’s inequality, we shall prove the (weak) classical Glivenko-Cantelli theorem that states for i.i.d. random variables X1 , X2 , . . . in R with common distribution function F , n 1 X P sup 1(−∞,x] (Xi ) − F (x) → 0, n → ∞. n x∈R i=1

Indeed, we shall prove a stronger assertion: n # " 1 X 4 E sup 1(−∞,x] (Xi ) − F (x) ≤ √ . n n x∈R i=1

Proof. The first step is to use the symmetrization inequality (Theorem 1), by which we can bound the left hand side by # " n 1 X 2E sup εi 1(−∞,x] (Xi ) . x∈R n i=1

Fix X1 , . . . , Xn . Let σ be a permutation of {1, . . . , n} such that Xσ(1) ≤ · · · ≤ Xσ(n) . Then n k 1 X 1 X εi 1(−∞,x] (Xi ) = max εσ(i) . sup 1≤k≤n n n x∈R i=1

i=1

Conditionally on {X1 , . . . , Xn }, εσ(1) , . . . , εσ(n) are still independent Rademacher random variables, so that L´evy’s inequality (6) implies that # # " " n " n # k 1 X 1 X 1 X 2 Eε max εσ(i) ≤ 2Eε εσ(i) = 2Eε εi ≤ √ , n n 1≤k≤n n n i=1

i=1

i=1

which leads to the desired claim. The following results are due to Hoffmann-Jørgensen. Proposition 2. For every s > 0 and t > 0, 2

P {kSn k > 2t + s} ≤ 4 (P {kSn k > t}) + P

max F (Xi ) > s .

(7)

1≤i≤n

Proof. Define τ = inf{1 ≤ k ≤ n : kSk k > t}. Then {τ = k} ∈ Ak and P{max1≤k≤n kSk k > t}. For k = 1, . . . , n,

Pn

k=1 P(τ

= k) =

kSn k ≤ kSk−1 k + F (Xk ) + kSn − Sk k, so that P{τ = k, kSn k > 2t + s} ≤ P{τ = k, F (Xk ) > s} + P{τ = k, kSn − Sk k > t} = P{τ = k, F (Xk ) > s} + P(τ = k)P{kSn − Sk k > t} ≤ P τ = k, max F (Xi ) > s + P(τ = k)P max kSj k > t . 1≤i≤n

1≤j≤n

12

Summing over k gives 2 P{kSn k > 2t + s} ≤ P max F (Xi ) > s + P max kSj k > t 1≤i≤n 1≤j≤n ≤ P max F (Xi ) > s + 4(P{kSn k > t})2 ,

1≤i≤n

where the second inequality is due to L´evy’s inequality (4). Proposition 3. Let 0 < p < ∞, and let t0 := inf{t > 0 : P{kSn k > t} ≤ (8 · 3p )−1 }. Then E[kSn kp ] ≤ 2 · 3p E max F p (Xi ) + 2(3t0 )p . (8) 1≤i≤n

Proof. Let u > t0 . Then by the previous inequality, Z ∞ p p ptp−1 P{kSn k > 3t}dt E[kSn k ] = 3 0 Z u Z ∞ p =3 ptp−1 P{kSn k > 3t}dt + u 0 Z ∞ ≤ (3u)p + 3p ptp−1 P{kSn k > 3t}dt u Z ∞ Z ∞ 2 p−1 p p p−1 p pt P max F (Xi ) > t dt ≤ (3u) + 4 · 3 pt (P{kSn k > t}) dt + 3 1≤i≤n u u Z ∞ p p p−1 p p ≤ (3u) + 4 · 3 P {kSn k > u} pt P {kSn k > t} dt + 3 E max F (Xi ) 1≤i≤n u ≤ (3u)p + (1/2)E[kSn kp ] + 3p E max F p (Xi ) . 1≤i≤n

Letting u ↓ t0 , we obtain the desired inequality. In Proposition 3, we have P{kSn k > 8 · 3p E[kSn k]} ≤ (8 · 3p )−1 by Markov’s inequality, so that t0 ≤ 8 · 3p E[kSn k]. Combining the symmetrization inequality (Theorem 1), we have proved the following theorem on comparison between the Lp and L1 norms for the supremum of the empirical process. Theorem 4. For every 1 < p < ∞, there exists a constant Cp > 0 depending only on p such that ( 1/p ) (E[kPn − P kpF ])1/p ≤ Cp E[kPn − P kF ] + n−1 E max F p (Xi ) . 1≤i≤n

An inspection of the proof gives an explicit dependence of Cp on p, which is however too crude (indeed, Cp would be exponential in p). The best possible rate of Cp is known as Cp ∼ p/ log p as p → ∞. The proof of this fact is lengthly and is not pursued here. We refer the interested reader to Ledoux and Talagrand (1991), Chapter 6. 13

2

Maximal inequalities

This section is concerned with bounding moments of the supremum of the empirical process:

p # " n

X

E (f (Xi ) − P f ) , 1 ≤ p < ∞.

i=1

F

Generally, maximum inequalities refer to inequalities “that bound probabilities involving suprema of random variables” (van der Vaart and Wellner, 1996, p.90). We first consider a more general situation in which we are interested in the supremum of a generic stochastic process indexed by a semi-metric space. These general results will be applied to empirical processes, in which the symmetrization technique plays a key role.

2.1

Young moduli

This subsection is a preliminary section and studies the properties of Young moduli. These properties will be used in the following subsections. Definition 1. A strictly increasing convex function ψ : [0, ∞) → [0, ∞) with ψ(0) = 0 is called a Young modulus. The associated Orlicz norm of a random variable ξ is defined by kξkψ := inf{c > 0 : E[ψ(|ξ|/c)] ≤ 1}. We will verify that the Orlicz norm is indeed a norm on the space of all random variables ξ (modulo a.s. equivalence) such that kξkψ < ∞. p

Example 2. Typical examples of Young moduli are ψ(x) = xp or ex − 1 for 1 ≤ p < ∞. For ψ(x) = xp , the Orlicz norm reduces to the Lp -norm: kξkψ = (E[|ξ|p ])1/p . Let p 2 2 ψp (x) := ex − 1. Of particular importance is ψ2 (x) = ex − 1. Since ex − 1 = x2 + x4 /2 + · · · ≥ x2p /p!, we have (E[|ξ|2p ])1/(2p) ≤ (p!)1/(2p) kξkψ2 , p = 1, 2, . . . . This shows that for every (real) 1 ≤ p < ∞, (E[|ξ|p ])1/p ≤ Cp kξkψ2 ,

(9)

where Cp > 0 is a constant that depends only on p. A Young modulus ψ is an isomorphism from [0, ∞) onto itself. Indeed, as the 0˜ ˜ extension of ψ to R (i,e., ψ(x) := ψ(x) for x ∈ [0, ∞) and ψ(x) := 0 for x ∈ (−∞, 0)) remains convex and a convex function is continuous on the interior of its domain, ψ is continuous. Furthermore, because a non-constant, non-decreasing convex function from [0, ∞) to itself diverges as x → ∞, ψ is one-to-one from [0, ∞) onto itself (for convenience, we give a proof for the assertion: let ϕ : [0, ∞) → [0, ∞) be a non-constant, non-decreasing convex function. Suppose on the contrary ϕ is bounded and hence a finite 14

limit ϕ(∞) := limx→∞ ϕ(x) exists. Since ϕ is non-constant, there exists a point x0 ∈ [0, ∞) such that ϕ(x0 ) < ϕ(∞). The convexity of ϕ now implies that ϕ((x0 + x)/2) ≤ ϕ(x0 )/2 + ϕ(x)/2, and as x → ∞, ϕ(∞) ≤ ϕ(x0 )/2 + ϕ(∞)/2, that is, ϕ(∞) ≤ ϕ(x0 ), a contradiction). Since ψ is continuous and strictly increasing, the inverse function ψ −1 is also continuous. In what follows, let ψ be a Young modulus. Lemma 1. Let ξ be a random variable such that 0 < c := kξkψ < ∞. Then we have E[ψ(|ξ|/c)] = 1. Proof. Let cm be a sequence of positive constants such that E[ψ(|ξ|/cm )] ≤ 1 and cm ↓ c. By the monotone convergence theorem, we have E[ψ(|ξ|/c)] = lim E[ψ(|ξ|/cm )] ≤ 1, m→∞

which leads to the desired conclusion. Proposition 4. The Orlicz norm k · kψ is a norm on the space of all random variables ξ (modulo a.s. equivalence) such that kξkψ < ∞. Proof. It is not difficult to see that kaξkψ = |a|kξkψ for a ∈ R. Suppose that kξkψ = 0. By Jensen’s inequality, ψ(E[|ξ|]/c) ≤ E[ψ(|ξ|/c)] ≤ 1 for all c > 0, by which we conclude that E[|ξ|] = 0 and so ξ = 0 almost surely. It remains to prove the triangle inequality. Let ξi , i = 1, 2 be two random variables such that ci := kξi kψ < ∞, i = 1, 2. Without loss of generality, we may assume that ci > 0, i = 1, 2. Define λ := c1 /(c1 + c2 ). By the monotonicity and convexity of ψ, E[ψ(|ξ1 + ξ2 |/(c1 + c2 ))] ≤ E[ψ((|ξ1 | + |ξ2 |)/(c1 + c2 ))] = E[ψ(λ|ξ1 |/c1 + (1 − λ)|ξ2 |/c2 )] ≤ λE[ψ(|ξ1 |/c1 )] + (1 − λ)E[ψ(|ξ2 |/c2 )] = 1. (Lemma 1) This shows that kξ1 + ξ2 kψ ≤ c1 + c2 = kξ1 kψ + kξ2 kψ , completing the proof. Lemma 2. Let ξm be a sequence of random variables such that |ξm | ↑ |ξ| almost surely for some random variable ξ. Then we have kξm kψ ↑ kξkψ . Proof. Suppose first that kξkψ < ∞. Since kξm kψ ≤ kξkψ , there exists a finite constant c such that kξm kψ ↑ c. It c = 0, then ξm = 0 almost surely for all m ≥ 1 and so ξ = 0 almost surely. Otherwise, by the monotone convergence theorem, 1 ≥ limm→∞ E[ψ(|ξm |/c)] = E[ψ(|ξ|/c)], and so kξkψ ≤ c. Since kξkψ ≥ c, we conclude that kξkψ = c. It is immediate to see that kξm kψ ↑ ∞ if kξkψ = ∞ since otherwise kξm kψ is bounded and kξkψ is finite, a contradiction. Lemma 3. Convergence in k · kψ implies convergence in probability. 15

Proof. For δ > 0, kξkψ ≤ δ is equivalent to E[ψ(|ξ|/δ)] ≤ 1 by Lemma 1. Since ψ is onto [0, ∞), for every ε > 0, there exists a constant M > 0 such that ψ(M ) ≥ 1/ε. Therefore we have Z kξkψ ≤ δ ⇒ 1 ≥ ψ(|ξ|/δ)dP Z ψ(|ξ|/δ)dP ≥ |ξ|>M δ

≥ P(|ξ| > M δ)/ε.

This implies the desired conclusion. Theorem 5. Let ψ be a Young modulus such that lim sup x∧y→∞

ψ −1 (x2 ) ψ −1 (xy) < ∞, lim sup < ∞. −1 ψ −1 (x)ψ −1 (y) x→∞ ψ (x)

(10)

Then there exists a constant Cψ > 0 depending only on ψ such that for every sequence {ξk } of (not necessarily independent) random variables,

sup |ξk | ≤ Cψ sup kξk kψ .

−1 k k ψ (k) ψ p 2 Proof. We only prove the theorem for ψ2 (x) = ex − 1, for which ψ2−1 (x) = log(1 + x). It is not difficult to see that condition (10) is satisfied for ψ2 . By homogeneity, we may 2 assume that kξk kψ2 ≤ 1, that is, E[eξk ] ≤ 2. Let t ≥ 3/2. For k ≥ 9, (log k)−1 + (log t)−1 ≤ (log 9)−1 + (log(3/2))−1 ≤ 3. which implies that 3(log k)(log t) ≥ log k + log t = log(kt). Therefore (

(

2 )

)

(

) p |ξk | P exp sup > t = P sup √ > log t 6 log k k≥9 k≥9 ) ( ∞ X p |ξk | >1 ≤ P{|ξk | > 6(log k)(log t)} = P sup p 6(log k)(log t) k≥9 k=9 =

∞ X

|ξ | √ k 6 log k

2

P{e|ξk | > e6(log k)(log t) } ≤

k=9

≤

∞ X k=9

∞ X k=9

2 e2 log k+2 log t

2 e6(log k)(log t)

∞ X 2 1 = ≤ 2, 2 2 k t 4t k=9

16

by which we conclude "

(

E exp sup k≥9

|ξ | √ k 6 log k

2 )# ≤

3 + 2

Z

∞

3/2

1 dt < 2. 4t2

This completes the proof. From this theorem, we have

max |ξk | ≤ Cψ ψ −1 (N ) max kξkψ .

1≤k≤N

1≤k≤N

ψ

Furthermore, when ψ = ψ2 , we have ψ2−1 (N ) =

p log(1 + N ), and because of (9),

1/p p p E max |ξk | ≤ Cp0 log(1 + N ) max kξk kψ2 , 1≤k≤N

1≤k≤N

for every 1 ≤ p < ∞, where Cp0 > 0 is a constant that depends only on p.

2.2

Maximal inequalities based on covering numbers

Let T be a non-empty set. A stochastic process X(t), t ∈ T is a collection of real-valued random variables, that is, for every t ∈ T , X(t) is a measurable real-valued function on Ω. Let (T, d) be a semi-metric space. A stochastic process X(t), t ∈ T is said to be separable if there exist a null set N and a countable subset T0 ⊂ T such that for every ω ∈ / N and t ∈ T , there exists a sequence tm in T0 with d(tm , t) → 0 and X(tm , ω) → X(t, ω). Note that the existence of a separable stochastic process forces T to be separable. Clearly, if T is separable and X has sample paths almost surely continuous, then X is separable.1 For a separable stochastic process X, supt∈T |Xt | is measurable because the supremum over T reduces to the supremum over a countable subset of T . Definition 2. Let (T, d) be a semi-metric space. For ε > 0, an ε-net of T is a subset Tε of T such that for every t ∈ T there exists a tε ∈ Tε with d(t, tε ) ≤ ε. The εcovering number N (T, d, ε) of T is the infimum of the cardinality of ε-nets of T , that is, N (T, d, ε) := inf{Card(Tε ) : Tε is an ε-net of T } where inf ∅ = +∞ by convention. Note that the map ε 7→ N (T, d, ε) is non-increasing, and T is totally bounded if and only if N (T, d, ε) < ∞ for all ε > 0. The covering number N (T, d, ε) is not monotonic in T in the sense that S ⊂ T does not necessarily imply that N (S, d, ε) ≤ N (T, d, ε). This is due to the fact that a net of T may not be a net of S since a member in the net of T may be outside S. Nevertheless we have the following lemma. Lemma 4. Let (T, d) be a semi-metric space. Then for every S ⊂ T , N (S, d, 2ε) ≤ N (T, d, ε), ∀ε > 0. 1

It is known that when (T, d) is separable, every stochastic process X(t), t ∈ T has a separable modification possibly taking values in the extended real line. See Gikhman and Skorohod (1974), p.167.

17

Proof. The lemma follows from the fact that an ε-ball centered at a point in T that intersects S is contained in a 2ε-ball centered at a point in S (draw a picture). The following is the main theorem of this subsection. Theorem 6 (Dudley (1967); Pisier (1986) etc.). Let (T, d) be a semi-metric space with diameter D, let X(t), t ∈ T be a stochastic process indexed by T , and let ψ be a Young modulus satisfying condition (10), such that kX(t) − X(s)kψ ≤ d(s, t), ∀s, t ∈ T.

(11)

Then there exists a constant Cψ > 0 depending only on ψ such that for every finite subset S of T ,

Z D

max |X(t)| ≤ kX(t0 )kψ + Cψ ψ −1 (N (T, d, ε))dε, ∀t0 ∈ T, (12)

t∈S

0

ψ

Z

max |X(t) − X(s)| ≤ Cψ

d(s,t)<δ;s,t∈S ψ

δ

ψ −1 (N (T, d, ε))dε, 0 < ∀δ ≤ D.

(13)

0

Furthermore, if X is separable, then S in inequalities (12) and (13) can be replace by T , with max replaced by sup. Proof. The last statement follows from the monotone convergence theorem (use Lemma 2). In what follows, let Cψ > 0 denote a generic constant that depends only on ψ whose value may change from place to place. We first prove (12). Without loss of generality, we may assume that t0 ∈ S (otherwise replace S by S ∪ {t0 }) and X(t0 ) = 0 (otherwise replace X(t) by X(t)−X(t0 )). In addition, we may assume that the integral on the right hand side of (12) is finite since otherwise there is nothing to prove. In this proof, we assume that D = 1. The proof for the general case follows from a simple modification. For each k = 0, 1, . . . , let Sk := {sk1 , . . . , skNk } be a minimal 2−k -net of S with Nk := N (S, d, 2−k ). Note that S0 consists of a single point, and without loss of generality we may take S0 = {t0 }. For each k, let πk : S → Sk be a map such that d(s, πk (s)) ≤ 2−k for all s ∈ S (by construction of Sk such πk must exist). Further, because S is finite, there exists a positive integer kS such that d(s, πk (s)) = 0 for all s ∈ S and all k ≥ kS .2 Because of (11), this means that X(s) = X(πk (s)) almost surely for all s ∈ S and all k ≥ kS . Hence we have the following decomposition for each s ∈ S: X(s) =

kS X {X(πk (s)) − X(πk−1 (s))}

a.s.

k=1 2

Since (T, d) is a semi-metric space, d(t, s) = 0 does not necessarily imply s = t.

18

Now since d(πk (s), πk−1 (s)) ≤ d(πk (s), s) + d(s, πk−1 (s)) ≤ 3 · 2−k , we have

kS X

max |X(s)| ≤

max |X(πk (s)) − X(πk−1 (s))|

s∈S

ψ

≤

k=1 kS X k=1

s∈S

ψ

s∈S

max

−k k−1 ,t∈Sk ;d(s,t)≤3·2

|X(t) − X(s)|

. ψ

P S −1 By Theorem 5, the last line is bounded by Cψ kk=1 ψ (Nk Nk−1 )2−k , which is furPkS −1 ther bounded by Cψ k=1 ψ (N (S, d, 2−k ))2−k because of Nk−1 ≤ Nk and ψ −1 (x2 ) ≤ Cψ ψ −1 (x). Together with Lemma 4, k maxs∈S |X(s)|kψ is bounded by Cψ

kS X

ψ

−1

−(k+1)

(N (T, d, 2

−k

))2

≤ Cψ

k=0

∞ X

ψ −1 (N (T, d, 2−(k+1) ))2−(k+2)

k=1 1/4

Z ≤ Cψ

ψ −1 (N (T, d, ε))dε.

0

This completes the proof for the first inequality (12). For the second inequality, let 0 < δ ≤ D. Define U = {(s, t) : s, t ∈ S, d(s, t) < δ}, and Y (u) := X(tu ) − X(su ) for u = (su , tu ) ∈ U . On the set U , define the semi-metric ρ(u, v) := kY (v) − Y (u)kψ . The ρ-diameter of U is bounded by 2 supu∈U kY (u)kψ ≤ 2δ, and we also have kY (v) − Y (u)kψ ≤ kX(tv ) − X(tu )kψ + kX(su ) − X(sv )kψ ≤ d(tv , tu ) + d(sv , su ). Hence if {t1 , . . . , tN } is an ε-net of S, then {(ti , tj ) : 1 ≤ i, j ≤ N } is a 2ε-net of U . Some of (ti , tj ) may not be in U , but still we have N (U, ρ, 4ε) ≤ N 2 (S, d, ε) by Lemma 4. Therefore, applying the first inequality (12) to Y (u), u ∈ U , we have

Z 2δ

max |X(t) − X(s)| = max |Y (u)| ≤ Cψ ψ −1 (N (U, ρ, ε))dε

d(s,t)<δ;s,t∈S

Z ≤ Cψ

2δ

ψ

u∈U

ψ −1 (N 2 (S, d, ε/4))dε ≤ Cψ

0

ψ

Z

δ/2

0

ψ −1 (N (S, d, ε))dε.

0

This completes the proof. Historically, Theorem 6 was developed in investigating conditions under which X admits a continuous version. Recall that a version of a stochastic process X(t), t ∈ T is another stochastic process Y (t), t ∈ T such that for every t1 , . . . , tm ∈ T and m ∈ N, d

(X(t1 ), . . . , X(tm )) = (Y (t1 ), . . . , Y (tm )).

19

Suppose in Theorem 6 that Z

D

ψ −1 (N (T, d, ε))dε < ∞,

0

under which (T, d) is totally bounded and thus separable. Let T0 be a countable dense subset of T . By the monotone convergence theorem, S in inequalities (12) and (13) can be P

replaced by T0 , with max replaced by sup. By Lemma 3, supd(s,t)<δ,s,t∈T0 |X(t)−X(s)| → 0 as δ ↓ 0. Hence there exists a sequence δm ↓ 0 such that sup

|X(t) − X(s)| → 0

a.s.

d(s,t)<δm ,s,t∈T0

However since supd(s,t)<δ,s,t∈T0 |X(t) − X(s)| is non-decreasing in δ, it goes to 0 almost surely as δ ↓ 0. This discussion shows that there exists an event Ω0 ⊂ Ω with P(Ω0 ) = 1 such that supd(s,t)<δ,s,t∈T0 |X(t, ω) − X(s, ω)| → 0 as δ ↓ 0 for all ω ∈ Ω0 . In other words, the restriction of X to T0 has sample paths almost surely uniformly continuous. We shall verify the following lemma. Lemma 5. Let (T, d) a semi-metric space to which a dense subset T0 is attached, and let f : T0 → R be a uniformly continuous function. Then there exists a unique uniformly continuous function f˜ : T → R such that f˜ = f on T0 . Proof. The uniqueness trivially follows. Pick any t ∈ T , and let tm be a sequence in T0 such that d(tm , t) → 0. Then because f is uniformly continuous on T0 , {f (tm )}∞ m=1 is a Cauchy sequence in R, and so a finite limit f˜(t) := limm f (tm ) exists. To verify that f˜ is well-defined, take another sequence t0m in T0 with d(t0m , t) → 0. Then since d(t0m , tm ) → 0 and f is uniformly continuous on T0 , |f (t0m ) − f (tm )| → 0, which implies that limm f (t0m ) = limm f (tm ). The uniform continuity is shown as follows. By construction, f˜ is uniformly continuous on T0 . Pick any ε > 0. Choose δ > 0 in such a way that d(s, t) < δ & s, t ∈ T0 ⇒ |f (s) − f (t)| < ε. Now take any s, t ∈ T such that d(s, t) < δ/2. Then since T0 is dense in T , there exist two sequences sn and tn in T0 with d(sn , s) → 0 and d(tn , t) → 0. For large n, d(sn , tn ) ≤ d(sn , s) + d(s, t) + d(t, tn ) < δ, so that |f (sn ) − f (tn )| < ε. Thus |f (s) − f (t)| ≤ |f (s) − f (sn )| + |f (sn ) − f (tn )| + |f (tn ) − f (t)| < ε + |f (s) − f (sn )| + |f (tn ) − f (t)|. Taking n → ∞, we have |f (s) − f (t)| ≤ ε. This completes the proof. Lemma 5 ensures that a finite limit lims→t,s∈T0 X(s, ω) exists for every t ∈ T and ˜ ω ∈ Ω0 . Define the stochastic process X(t), t ∈ T by ( ˜ ω) = lims→t,s∈T0 X(s, ω) if t ∈ T, ω ∈ Ω0 . X(t, 0 otherwise ˜ is a version of X with almost all sample paths uniformly continuous. Hence Then X Theorem 6 leads to the following corollary. 20

Corollary 2. Let X(t), t ∈ T be a stochastic process indexed by a semi-metric space (T, d), and let ψ be a Young modulus satisfying condition (10), and such that kX(t) − X(s)kψ ≤ d(s, t) for all s, t ∈ T . Suppose that Z D ψ −1 (N (T, d, ε))dε < ∞, 0

˜ that has sample paths almost where D is the diameter of T . Then X admits a version X ˜ (and in fact any separable surely uniformly continuous. Furthermore, that version X version of X) verifies the inequalities

Z D

˜ ˜ 0 )kψ + Cψ

sup |X(t)|

≤ kX(t ψ −1 (N (T, d, ε))dε, ∀t0 ∈ T, (14)

t∈T

0

ψ

Z δ

˜ ˜ sup | X(t) − X(s)| ≤ C ψ −1 (N (T, d, ε))dε, 0 < ∀δ ≤ D,

ψ

d(s,t)<δ;s,t∈T

0

(15)

ψ

where Cψ > 0 is a constant that depends only on ψ. Example 3 (Gaussian processes). As a first example, we consider Gaussian processes. A stochastic process X(t), t ∈ T indexed by a non-empty set T is said to be Gaussian if for every t1 , . . . , tm ∈ T and m ∈ N, the joint distribution of X(t1 ), . . . , X(tm ) is normal. Let p X(t), t ∈ T be a centered Gaussian process. A direct calculation shows that kZkψ2 = 8/3 for Z ∼ N (0, 1), by which we have p kX(t) − X(s)kψ2 = 8/3(E[(X(t) − X(s))2 ])1/2 . p √ Since ψ2−1 (N ) = log(1 + N ) ≤ 2 log N for N ≥ 2, we obtain the following corollary (see also the proof of Lemma 7 ahead). Corollary 3 (Dudley (1967)). Let X(t), t ∈ T be a centered Gaussian process indexed by a non-empty set T . Consider the semi-metric ρ2 on T defined by ρ2 (s, t) := (E[(X(t) − X(s))2 )1/2 for s, t, ∈ T . Suppose that Z 1p log N (T, ρ2 , ε)dε < ∞. 0

˜ that has sample paths almost surely uniformly ρ2 -continuous. Then X admits a version X ˜ (and in fact any separable version of X) verifies the inFurthermore, that version X equalities

Z σp

˜

sup |X(t)|

≤C 1 + log N (T, ρ2 , ε)dε,

t∈T

ψ

0

2

˜ ˜ sup | X(t) − X(s)|

ρ2 (s,t)<δ;s,t∈T

Z ≤C

δ

p log N (T, ρ2 , ε)dε, ∀δ > 0,

0

ψ2

where σ 2 := supt∈T E[X 2 (t)] and C > 0 is a universal constant. 21

2.3

Applications to empirical processes

In this subsection we study applications of the general maximal inequalities established in Theorem 6 to empirical processes. We begin with proving the following version of Hoeffding’s inequality. Let ε1 , . . . , εn be independent Rademacher random variables. Lemma 6 (Hoeffding’s inequality). Let a1 , . . . , an ∈ R be constants such that at least one of a1 , . . . , an is non-zero. Then for every x > 0, ( n ) X x2 P P ai εi > x ≤ 2 exp − . 2 ni=1 a2i i=1

Proof. For λ > 0, E[eλε ] = and thus P

( n X

λ2 λ4 eλ + e−λ 2 =1+ + + · · · ≤ eλ /2 , 2 2 4! )

a i εi > x

h Pn i Pn 2 2 ≤ e−λx E eλ i=1 ai εi ≤ eλ i=1 ai /2−λx .

i=1

The first inequality is due to Markov’s inequality. Minimizing the right hand side with respect to λ gives ( n ) X x2 P . P ai εi > x ≤ exp − 2 ni=1 a2i i=1 Likewise, we have ( P −

n X i=1

) ai εi > x

x2 ≤ exp − Pn 2 i=1 a2i

.

The desired inequality follows from combining these inequalities. P Hoeffding’s inequality leads to a bound on the ψ2 -norm of the sum ni=1 ai εi . Recall 2 that ψ2 (x) = ex − 1. Observe that for λ > 0, nP o o oi Z ∞ n h nP 2 n P exp ( ni=1 ai εi )2 /λ) > x dx E exp ( i=1 ai εi ) /λ) = 0 ) Z ∞ ( X n p =1+ P ai εi > λ log x dx 1 i=1 Z ∞ P ≤1+2 e−b log x dx (set b = λ/(2 ni=1 a2i )) Z1 ∞ =1+2 t−b dt. 1

Take λ = 6

Pn

2 i=1 ai ,

so that b = 3. Then we have Z h nP Pn 2 oi 2 n E exp ( i=1 ai εi ) / 6 i=1 ai ≤1+2

1

22

∞

t−3 dt = 2.

Hence we have

n

X

ai εi

i=1

≤

√

6|a|,

(16)

ψ2

where a = (a1 , . . . , an )> . Lemma 7. Let T be a non-empty bounded subset of Rn with norm |t|n,2 := (n−1 Then

Z σn q n

1 X

log N (T ∪ {0}, | · |n,2 , ε)dε, εi ti ≤ C

sup √

t∈T n 0 i=1

Pn

2 1/2 . i=1 ti )

ψ2

where σn := supt∈T |t|n,2 and C > 0 is a universal constant. Proof. Let T˜ = T ∪ {0}. Define the stochastic process n

1 X X(t) := √ εi ti , t = (t1 , . . . , tn )> ∈ T˜. n i=1

p We shall apply Theorem 6 to this process with ψ = ψ2 . Note that ψ2−1 (N ) = log(1 + N ). By inequality (16), √ kX(t) − X(s)kψ2 ≤ 6|t − s|n,2 , ∀s, t ∈ T˜, √ so that X satisfies condition (11) with d(s, t) := 6|t − s|n,2 . Clearly all sample paths of X are d-continuous. Therefore, an application of Theorem 6 (with t0 = 0) gives

Z Dq n

1 X

log(1 + N (T˜, d, ε))dε, εi ti ≤ C

sup √

t∈T n 0 i=1

ψ2

where D is the d-diameter of T˜. Note that N (T˜, d, ε) ≥ 2 for 0 < ε < D/2. Since log(1 + N ) ≤ 2 log N for N ≥ 2, we have Z

D

Z q log(1 + N (T˜, d, ε))dε ≤ 2

0

D/2 q

log(1 + N (T˜, d, ε))dε

0

√ Z ≤2 2

D/2 q

log N (T˜, d, ε)dε.

0

The desired conclusion follows from a change of variables. Remark 1. When T is finite, then N (T ∪ {0}, | · |n,2 , ε) ≤ 1 + Card(T ) for any ε > 0, and so

n 1 X

p

√ εi ti ≤ Cσn log(1 + Card(T )). (17)

sup

t∈T n i=1

ψ2

23

Lemma 7 is powerful enough to provide various useful maximal inequalities for empirical processes. We present two such examples. Define Z J(δ, F, F ) = 0

δ

q sup 1 + log N (F, k · kQ,2 , εkF kQ,2 )dε, Q

where the supremum is taken over all finitely discrete distributions. Theorem 7. Let 1 ≤ p < ∞. Suppose that F ∈ Lp∨2 (P ). Then there exists a constant Cp > 0 depending only on p such that

p #!1/p " n

1 X

E √ ≤ Cp J(1, F, F )kF kP,p∨2 . (f (Xi ) − P f )

n

i=1

F

Proof. Let ε1 , . . . , εn be independent Rademacher random variables independent of X1 , . . . , Xn . Then by the symmetrization inequality (Theorem 1),

p #

p # " " n n

1 X

1 X

p E √ (f (Xi ) − P f ) εi f (Xi ) . ≤ 2 E √

n

n

i=1

i=1

F

F

Fix X1 , . . . , Xn . By inequality (9), there exists a constant Cp > 0 depending only on p such that

p #

p " n n

1 X X

p

1 Eε √ εi f (Xi ) ≤ Cp

√ εi f (Xi ) ,

n

n i=1

i=1

F

F ψ2 |X1 ,...,Xn

where k·kψ2 |X1 ,...,Xn denotes the k·kψ2 norm evaluated conditionally on X1 , . . . , Xn . Conditionally on X1 , . . . , Xn , apply Lemma 7 to the right hand side with T = {(f (X1 ), . . . , f (Xn )) : f ∈ F}. Then using the simple inequality supf ∈F (Pn f 2 )1/2 ≤ kF kPn ,2 , we may deduce from Lemma 7 that

Z kF kP ,2 q n

1 X

n

√ 1 + log N (F, k · kPn ,2 , ε)dε εi f (Xi ) ≤C

n

0 i=1 F ψ2 |X1 ,...,Xn Z 1q = CkF kPn ,2 1 + log N (F, k · kPn ,2 , εkF kPn ,2 )dε ≤ CkF kPn ,2 J(1, F, F ). 0

The conclusion follows from Fubini’s theorem and E[kF kpPn ,2 ] ≤ kF kpP,p∨2 , which follows from Jensen’s inequality. Hence when p = 1 (say), we have

# " n

1 X

E √ (f (Xi ) − P f ) ≤ C1 J(1, F, F )kF kP,2 .

n

i=1

F

24

The right hand side depends on the L2 (P )-norm of the envelope function F , which may be large compared with the maximum L2 (P )-norm of functions in F, namely, σ = sup kf kP,2 . f ∈F

In such a case, the following theorem will be more useful. Theorem 8 (van der Vaart and Wellner (2011); Chernozhukov et al. (2014)). Suppose that 0 < kF kP,2 < ∞, and let σ 2 > 0 be any positive constant such that supf ∈F P f 2 ≤ p σ 2 ≤ kF k2P,2 . Let δ = σ/kF kP,2 . Define B = E[max1≤i≤n F 2 (Xi )]. Then

# " n

1 X

BJ 2 (δ, F, F )

√ , E √ (f (Xi ) − P f ) ≤ C J(δ, F, F )kF kP,2 +

n

δ2 n i=1

F

where C > 0 is a universal constant. A version of Theorem 8 is proved in van der Vaart and Wellner (2011) under the additional assumption that the envelope F is bounded; the current version is due to Chernozhukov et al. (2014). We first prove the following preliminary lemma. We will assume that J(δ, F, F ) < ∞ for some (and thus all) δ > 0 since otherwise the bound in Theorem 8 is trivial. Lemma 8. Write J(δ) for J(δ, F, F ). Then (i) the map δ 7→ J(δ) is concave; (ii) J(cδ) ≤ cJ(δ) for all c ≥ 1; (iii) the p map√(0, ∞) 3 δ 7→ J(δ)/δ is non-increasing; (iv) the map R+ × (0, ∞) 3 (x, y) 7→ J( x/y) y is concave. p Proof. Let λ(ε) = supQ 1 + log N (F, k · kQ,2 , εkF kQ,2 ). Part (i) follows from the fact that the map ε 7→ λ(ε) is non-increasing. Part (ii) follows from the inequality Z cδ Z δ Z δ λ(ε)dε = c λ(cε)dε ≤ c λ(ε)dε. 0

0

0

Part (iii) follows from the identity J(δ) = δ

Z

1

λ(δε)dε. 0

The proof of part (iv) uses some facts in convex analysis. Lemma 9. Let D be a convex subset of Rn , and let f : D → R be a concave function. Then the perspective (x, t) 7→ tf (x/t), {(x, t) ∈ Rn+1 : x/t ∈ D, t > 0} → R, is also concave. Proof of Lemma 9. We first verify that the set S = {(x, t) ∈ Rn+1 : x/t ∈ D, t > 0} is convex. Pick any (x, t), (y, u) ∈ S and λ ∈ [0, 1]. Then λt + (1 − λ)u > 0 and λx + (1 − λ)y λt x (1 − λ)u y = + ∈ D, λt + (1 − λ)u λt + (1 − λ)u t λt + (1 − λ)u u | {z } | {z } =:θ

=1−θ

25

so that λ(x, t) + (1 − λ)(y, u) ∈ S. Next, let g(x, t) = tf (x/t) for (x, t) ∈ D. Then using the above notation, we have g(λx + (1 − λ)y, λt + (1 − λ)u) = (λt + (1 − λ)u)f (θx/t + (1 − θ)y/u) ≥ (λt + (1 − λ)u){θf (x/t) + (1 − θ)f (y/u)} = λg(x, t) + (1 − λ)g(y, u). Hence g is concave. Lemma 10. Let D1 be a convex subset of Rn , and let gi : D1 → R, 1 ≤ i ≤ k be concave functions. Let D2 denote the convex hull of the set {(g1 (x), . . . , gk (x)) : x ∈ D}. Let h : D2 → R be concave and non-decreasing in each coordinate. Then f (x) = h(g1 (x), . . . , gk (x)), D1 → R, is concave. Proof of Lemma 10. The proof is straightforward and thus omitted. √ Going back to the proof of Lemma 8, let h(s, t) = J(s/t)t, g1 (x, y) = x, and √ g2 (x, y) = y. Then h is concave and non-decreasing in each coordinate, and gi , i = 1, 2 p √ are concave. Hence J( x/y) y = h(g1 (x, y), g(x, y)) is concave. Proof of Theorem 8. Throughout the proof, let C > 0 denote a universal constant of which the value may change from place to place. Without loss of generality, we may assume that F is everywhere positive. Let σn2 = supf ∈F Pn f 2 . For independent Rademacher random variables ε1 , . . . , εn independent of X1 , . . . , Xn , the symmetrization inequality (Theorem 1) implies that

#

# " " n n

1 X

1 X

E √ (f (Xi ) − P f ) ≤ 2E √ εi f (Xi ) .

n

n i=1

i=1

F

F

Further by Lemma 7, we have

# " Z σn q n

1 X

Eε √ εi f (Xi ) ≤C 1 + log N (F, k · kPn ,2 , ε)dε

n 0 i=1 F Z σn /kF kP ,2 q n = CkF kPn ,2 1 + log N (F, k · kPn ,2 , εkF kPn ,2 )dε 0

≤ CkF kPn ,2 J(σn /kF kPn ,2 ). Hence by Lemma 8 (iv) and Jensen’s inequality (Dudley, 1999, Theorem 10.2.6),

# " n

1 X p

√ εi f (Xi ) ≤ CkF kP,2 J( E[σn2 ]/kF kP,2 ). Z := E

n i=1

F

26

By the contraction principle (Corollary 1) and the Cauchy-Schwarz inequality,

n

# "

1 X

E[σn2 ] ≤ σ 2 + 8E max F (Xi ) εi f (Xi )

n

1≤i≤n i=1 F v  u s

2  u X n

1 u

≤ σ 2 + 8 E max F 2 (Xi ) tE  εi f (Xi ) .

n

1≤i≤n i=1

F

By the Hoffmann-Jørgensen inequality (Theorem 4), v  u

2 

# ) ( " n n u X r

1 X 1 u  1



tE εi f (Xi ) εi f (Xi ) + E[ max F 2 (Xi )] , ≤C E

n

n 1≤i≤n n i=1

i=1

F

F

so that we have

p √ E[σn2 ] ≤ CkF kP,2 (∆ ∨ DZ), √ where ∆2 := max{σ 2 , n−1 B 2 }/kF k2P,2 ≥ δ 2 and D := B/( nkF k2P,2 ). Therefore, using Lemma 8 (ii), we have √ Z ≤ CkF kP,2 J(∆ ∨ DZ). We consider the following two cases: √ √ (i) DZ ≤ ∆. In this case, J(∆ ∨ DZ) ≤ J(∆), so that Z ≤ CkF kP,2 J(∆). Since the map δ 7→ J(δ)/δ is non-increasing (Lemma 8 (iii)), J(∆) J(δ) BJ(δ) √ . J(∆) = ∆ ≤∆ = max J(δ), ∆ δ nδkF kP,2 Further, since J(δ)/δ ≥ J(1) ≥ 1, the last expression is bounded by BJ 2 (δ) √ . max J(δ), nδ 2 kF kP,2 √ √ √ (ii) DZ ≥ ∆. In this case, J(∆∨ DZ) ≤ J( DZ), and since the map δ 7→ J(δ)/δ is non-increasing (Lemma 8 (iii)), √ √ √ J( DZ) √ J(∆) √ J(δ) ≤ DZ J( DZ) = DZ √ ≤ DZ . ∆ δ DZ Hence

√ J(δ) Z ≤ CkF kP,2 DZ , δ

that is Z ≤ CkF k2P,2 D

J 2 (δ) CBJ 2 (δ) √ 2 . = δ2 nδ

This completes the proof. 27

As a corollary to Theorem 8, we obtain an extension of Proposition 2.1 of Gin´e and Guillou (2001) to not necessarily uniformly bounded classes of functions. Corollary 4. Consider the same setting as in Theorem 8. Suppose that there exist constants A ≥ e and V ≥ 1 such that V A sup N (F, k · kQ,2 , εkF kQ,2 ) ≤ , 0 < ∀ε ≤ 1. ε Q Then

# " "s # n

1 X

AkF k AkF k V B

P,2 P,2 V σ 2 log + √ log , E √ (f (Xi ) − P f ) ≤C

n

σ σ n i=1

F

where C > 0 is a universal constant. Proof. Observe that Z J(δ) ≤

δ

p √ Z 1 + V log(A/ε)dε ≤ A V

0

∞

A/δ

√

1 + log ε dε. ε2

An integration by parts yields that, for c ≥ e, √ ∞ Z ∞√ Z 1 + log ε 1 + log ε 1 ∞ 1 √ dε = − + dε 2 2 1 + log ε ε ε 2 ε c c c √ √ Z 1 + log c 1 ∞ 1 + log ε + dε, ≤ c 2 c ε2 by which we have Z c

∞

√

√ √ √ 1 + log ε 2 1 + log c 2 2 log c dε ≤ ≤ . ε2 c c

Since A/δ ≥ A ≥ e, we have p √ J(δ) ≤ 2 2V δ log(A/δ). Applying Theorem 8, we obtain the desired conclusion.

28

3

Limit theorems

The multivariate central limit theorem implies that for a finite number of functions f1 , . . . , fm ∈ F, as long as P fj2 < ∞ for 1 ≤ j ≤ m, the sequence of random vectors √ √ ( n(Pn − P )f1 , . . . , n(Pn − P )fm ) converges weakly to a multivariate normal distribution. This section studies a “uniform” version of this result. Recall that we are (implicitly) assuming that supx∈F |f (x)| ≤ F (x) < ∞ for all x ∈ S; so if in addition √ supf ∈F |P f | < ∞, then the (scaled) empirical process n(Pn −P )f, f ∈ F can be viewed as a map from Ω into `∞ (F). Thus it would be natural to study conditions under which √ n(Pn − P )f, f ∈ F converges weakly in `∞ (F) to a Gaussian process. The question is, however, not that straightforward because generally the space `∞ (F) is non-separable and subtle measurability problems inevitably occur (see Remark 5 ahead). Hence we begin with discussing weak convergence of sample-bounded stochastic processes.

3.1

Weak convergence of sample-bounded stochastic processes

A random element in a metric space U is a Borel measurable map from Ω into U . For a metric space U , we always equip the Borel σ-field B(U ) induced by the metric topology of U . We say that a map X : Ω → U is Borel measurable if it is A/B(U ) measurable, that is, X −1 (B) ∈ A for all B ∈ B(U ). A random element X in U is said to be tight if for every ε > 0, there exists a compact set K ⊂ U such that P(X ∈ / K) ≤ ε. There are two other related notions to tightness, namely, pre-tightness and separability. A random element X in U is said to be pre-tight if for every ε > 0, there exists a totally bound, Borel measurable subset K of U such that P(X ∈ / K) ≤ ε; X is said to be separable if there exists a separable Borel subset C of U such that P(X ∈ C) = 1. Lemma 11. Let X be a random element in U , and consider the following three statements: (i) X is tight; (ii) X is pre-tight; (iii) X is separable. Then in general we have: (i) ⇒ (ii) ⇔ (iii). If U is complete, then: (i) ⇔ (ii) ⇔ (iii). Proof. (i) ⇒ (ii): Trivial. (ii) ⇒ (iii): Let X be a pre-tight random element in U . For each i ∈ N, pick a totally bounded Borel measurable subset Ki of U such that P(X ∈ Ki ) ≥ 1 − 1/i. Then S K , X concentrates on C := ∞ i=1 i that is, P(X ∈ C) = 1. Now, because each Ki is separable (as it is totally bounded), the countable union C is also separable. (ii) ⇐ (iii): Let C be a separable Borel measurable subset of U such that P(X ∈ C) = 1. Let {xm } be a countable dense subset in C. For every δ > 0, the balls Bm (δ) with center xm and radius δ cover C, and X concentrates on the union of these balls. Thus for every ε > 0, there exist finitely many such balls Bm1 (δ), . . . , Bmk (δ) S with P(X ∈ kl=1 Bml (δ)) ≥ 1 − ε. Now, for every i ∈ N, there exist finitely many balls with radius 1/i whose union,Tdenoted by Gi , is such that P(X ∈ Gi ) ≥ 1 − ε/2i . The intersection of these sets G = ∞ i=1 Gi is totally bounded and such that P(X ∈ G) ≥ 1−ε. If U is complete, then the closure of a totally bounded subset of U is compact (Billingsley, 1968, p.217); hence we have: (ii) ⇒ (i).

29

Remark 2. In particular, if U is complete and separable (that is, if U is Polish), then any random element in U is tight, which is known as Ulam’s theorem. A stochastic process X(t), t ∈ T indexed by a non-empty set T is said to be samplebounded if supt∈T |X(t, ω)| < ∞ for all ω ∈ Ω. In this case, X can be viewed as a map from Ω into `∞ (T ). The question here is how to properly define weak convergence of a sequence Xn of sample-bounded stochastic processes. There is no difficulty in doing so as soon as we assume that Xn are random elements in `∞ (T ). However, in many applications, it is too stringent to assume that Xn are random elements in `∞ (T ); in particular, the classical empirical process is in general not Borel measurable as a map into `∞ ([0, 1])! Example 4. Let X1 , . . . , Xn be independent uniform random variables on [0, 1]. The empirical distribution function Fn is defined by n

Fn (t, ω) =

1X 1[0,t] (Xi (ω)), 0 ≤ t ≤ 1. n i=1

Then the map ω 7→ Fn (t, ω), Ω → `∞ ([0, 1]) is in general not Borel measurable. To see this, consider the simple case where n = 1, and let Y (t, ω) = 1[0,t] (X1 (ω)) = 1[X1 (ω),1] (t), t ∈ [0, 1]. Let Bs be the open ball in `∞ ([0, 1]) with center 1[s,1] and radius 1/2. Then Y (·, ω) ∈ Bs if and only if X1 (ω) = s, so that, for every subset A of [0, 1], {ω : Y (·, ω) ∈ ∪s∈A Bs } = {X1 ∈ A}. Meanwhile, since ∪s∈A Bs is open in `∞ ([0, 1]), the map ω 7→ Y (·, ω) can not be Borel measurable when Ω = [0, 1], A is the collection of Borel subsets of [0, 1] (or Lebesgue measurable sets in [0, 1]), and X1 (ω) = ω (take A to be any Lebesgue non-measurable subset in [0, 1]). Even if we allow (Ω, F, P) to be a more general probability space, if the map ω 7→ X1 (ω) were Borel measurable, then λ = P ◦ X1−1 would be a measure defined for every subset of [0, 1] with the property that λ((a, b]) = b − a; but assuming the continuum hypothesis, no such λ exists (Dudley (2002), Appendix C). A “traditional” remedy is to think of Fn as a map into D([0, 1]), the space of all right continuous functions with left limits (c`adl`ag functions), and equip D([0, 1]) with the Skorohod topology, with which D([0, 1]) is Polish; then every stochastic process with c`adl`ag sample paths is automatically a tight random element in D([0, 1]) (see Billingsley, 1968, Chapter 3). For general empirical processes, however, such a remedy does not always apply. Remark 3. A sample-bounded stochastic process defines a probability measure on the cylinder σ-field on `∞ (T ), which is the smallest σ-field for which every coordinate projection f 7→ f (t), t ∈ T , is measurable. Since every coordinate projection f 7→ f (t) is continuous from `∞ (T ) into R, on `∞ (T ), the cylinder σ-field is included in the Borel σ-field. The example above shows that the inclusion is strict when T = [0, 1]. Henceforth we allow Xn not to be random elements in `∞ (T ). To define weak convergence of not necessarily Borel measurable maps from Ω into `∞ (T ), we will use the 30

outer expectation. For an arbitrary map Y : Ω → [−∞, ∞] = R, the outer expectation of Y with respect to P is defined by E∗ [Y ] = inf{E[W ] : W ≥ Y, W : Ω → R measurable and E[W ] exists}. The outer expectation is defined for all maps Y : Ω → R (as we may take W = +∞). Further, let P∗ and P∗ denote the outer and inner probabilities for P, respectively, that is, for any A ⊂ Ω, P∗ (A) := inf{P(B) : B ⊃ A, B ∈ A}, P∗ (A) := 1 − P∗ (Ac ). We will prove a few properties of the outer expectation at the end of this subsection (see Lemma 15). We refer to van der Vaart and Wellner (1996, Section 1.2) for further details about the outer expectation and the outer/inner probabilities. Definition 3 (Hoffmann-Jørgensen (1991)). Let T be a non-empty set, and let Xn (t), t ∈ T be a sequence of sample-bounded stochastic processes. We say that Xn , viewed as maps from Ω into `∞ (T ), converge weakly to a tight random element X in `∞ (T ), denoted w by Xn → X in `∞ (T ), if lim E∗ [H(Xn )] = E[H(X)] n→∞

for every bounded continuous function H from `∞ (T ) into R. In this definition, the limit process X must be a tight random element in `∞ (T ), which ensures that the expectation of H(X) is well-defined. On the other hand, the outer expectation is needed to properly define the “expectation” of H(Xn ) since Xn may not be Borel measurable as maps from Ω into `∞ (T ). Remark 4. The tightness of the limit process may be dropped from the definition. However, there are no known examples of nonseparable Borel measures (van der Vaart and Wellner, 1996, p.24), and so this restriction does not lose great generality. We note that the weak convergence in the sense of Definition 3 is no longer tied with convergence of probability measures, simply because Xn need not induce Borel probability measures on `∞ (T ). Further, this weak convergence depends on the underlying probability measure P, because the outer expectation depends on P. However, in many statistical applications, we are interested in approximating the distribution of H(Xn ) for some continuous functional H on `∞ (T ), and even when Xn is not Borel measurable as w a map into `∞ (T ), it may happen that H(Xn ) is measurable. In that case, if Xn → X w ∞ ∞ in ` (T ) for some tight random element X in ` (T ), then H(Xn ) → H(X) in the usual sense. In what follows, we shall study characterizations of weak convergence of samplebounded stochastic processes. We first prove the following theorem, which gives a characterization of a sample-bounded stochastic process indexed by a non-empty set T to be a tight random element in `∞ (T ).

31

Theorem 9 (Andersen and Dobri´c (1987)). Let X(t), t ∈ T be a sample-bounded stochastic process indexed by a non-empty set T . Then X is a tight random element in `∞ (T ) if and only if there exists a semi-metric d for which (T, d) is totally bounded and such that almost all sample paths of X are uniformly d-continuous. Proof. Suppose first that X is a tight random element in `∞ (T ). Let µ be the distribution of X, which is a Borel probability measure on `∞ (T ). Since X is tight,Sthere exists a non-decreasing sequence Kn of compact subsets of `∞ (T ) such that µ( ∞ n=1 Kn ) = 1. Define dn (s, t) := sup{|f (t) − f (s)| : f ∈ Kn }. We shall prove that the semi-metric d on T defined by ∞ X d(s, t) = 2−n (1 ∧ dn (s, t)) n=1

P −n ≤ makes (T, d) totally bounded. Let ε > 0. Choose m in such a way that ∞ n=m+1 2 ε/4, and let {f1 , . . . , fr } be an ε/4-net of Km for the sup norm, that is, for every f ∈ Km , there exists a function fi such that kf −fi kT ≤ ε/4 (such a net exists by the compactness of Km ). The subset A of Rr defined by A = {(f1 (t), . . . , fr (t)) : t ∈ T } is bounded, and so totally bounded. Hence there exists a set Tε := {tj : 1 ≤ j ≤ N } such that for every t ∈ T there exists a point tj with max1≤i≤r |fi (t) − fi (tj )| ≤ ε/4. The set Tε is an ε-net of T for d. Indeed, for t ∈ T, tj ∈ Tε as above, we have dm (t, tj ) = sup |f (t) − f (tj )| ≤ max |fi (t) − fi (tj )| + ε/2 ≤ 3ε/4, f ∈Km

1≤i≤r

Pm −n dn (t, tj ) + ε/4 which implies that d(t, tj ) ≤ n=1 2 S ≤ ε. This shows that (T, d) is totally bounded. Further, every function f ∈ K := ∞ n=1 Kn is uniformly d-continuous since f ∈ K means that f ∈ Kn for some n, and |f (t) − f (s)| ≤ dn (s, t) ≤ 2n d(s, t) whenever d(s, t) ≤ 1. Since P(X ∈ K) = µ(K) = 1, we conclude that almost all sample paths of X are uniformly d-continuous. Conversely, suppose that (T, d) is totally bounded and X has sample paths almost surely uniformly d-continuous. There exists an event Ω0 ⊂ Ω with P(Ω0 ) = 1 such that ˜ ω) = the map t 7→ X(t, ω) is uniformly d-continuous for every ω ∈ Ω0 . Define X(·, c ˜ X(·, ω) for ω ∈ Ω0 and X(·, ω) = 0 for ω ∈ Ω0 . Then X is a tight random element ˜ is so (we implicitly assume here that (Ω, A, P) is complete), in `∞ (T ) if and only if X ˜ ˜ is a tight and X can be viewed as a map from Ω into Cu (T, d). We shall prove that X random element in Cu (T, d). To this end, we shall prove the following lemma. Lemma 12. Let (T, d) be a totally bounded semi-metric space. Then (i) Cu (T, d) is complete and separable (that is, Cu (T, d) is a separable Banach space with respect to the uniform norm k · kT ); and (ii) on Cu (T, d), the Borel σ-field coincides with the the cylinder σ-field (the cylinder σ-field is the smallest σ-field for which every coordinate projection f 7→ f (t), t ∈ T , is measurable). Proof of Lemma 12. For the notational convenience, write Cu (T ) instead of Cu (T, d). Part (i): It is not difficult to verify that Cu (T ) is complete. To prove separability of Cu (T ), we first show that may assume that d is a metric. To see this, define the 32

equivalence relation ∼ on T by s ∼ t ⇔ d(s, t) = 0, and let [t] = {s ∈ T : s ∼ t} for the equivalence class of t ∈ T . Define the metric on the quotient space T / ∼= {[t] : t ∈ T } ˆ by d([s], [t]) = d(s, t) for s, t ∈ T . Then dˆ is indeed a well-defined metric on T / ∼ and the map T 3 t 7→ [t] ∈ T / ∼ is isometrically isomorphic; in particular T / ∼ is totally bounded. For each f ∈ Cu (T ), define fˆ([t]) = f (t) for t ∈ T ; then fˆ is well-defined since f (s) = f (t) whenever s ∼ t by continuity of f , and fˆ is uniformly continuous. Further, the restriction map Cu (T ) 3 f 7→ fˆ 3 Cu (T / ∼) is isometrically isomorphic, and so separability of Cu (T ) follows from that of Cu (T / ∼). Hence we may assume that d is a metric. Suppose for a while that T is compact. In this case a continuous function on T is automatically (bounded and) uniformly continuous, and thus we drop the subscript “u” in Cu (T ) and write C(T ) for the space of all continuous functions on T equipped with the uniform norm k · kT . Let {tm } be a countable dense subset of T , and consider the map h : T → [0, 1]N defined by d(t, tm ) h(t) = . 1 + d(t, tm ) m∈N Then h is a homeomorphism from T onto h(T ) where [0, 1]N is equipped with the product topology ([0, 1]N is actually a compact metric space). So it suffices to show that C(T ) is separable in the case where T is a closed subset of [0, 1]N . In this case, the set of finite linear combinations of monomials of the form (x1 , x2 , . . . ) 7→ xα1 1 · · · xαnn with non-negative integers α1 , . . . , αn is dense in C(T ) by the Stone-Weierstrass theorem (Dudley, 2002, Theorem 2.4.11), and so the set of those linear combinations with rational coefficients is dense in C(T ), which shows that C(T ) is separable when T is compact. In the case where T is totally bounded (but not compact), let T¯ be a completion3 of T ; then T¯ is compact by total boundedness of T . Further, C(T¯) is separable and the map C(T¯) 3 f 7→ f |T ∈ Cu (T ) is isometrically isomorphic (cf. Lemma 5); thus Cu (T ) is separable. Part (ii): Finally we shall show that on Cu (T ), the Borel σ-field coincides with the cylinder σ-field. We begin with noting that since every coordinate projection f 7→ f (t) is continuous from Cu (T ) into R, the latter is included in the former. So we have to show the converse. Let T0 be a countable dense subset of T . Then for any given g ∈ Cu (T ), we have \ {f ∈ Cu (T ) : kf − gkT ≤ ε} = {f ∈ Cu (T ) : |f (t) − g(t)| ≤ ε}, t∈T0

so that every closed ball and thus every open ball in Cu (T ) belong to the cylinder σ-field. Since Cu (T ) is separable, every open set in Cu (T ) can be written as a countable union of open balls, and so is a member of the cylinder σ-field. This completes the proof. ˜ is a random element in Cu (T, d). In addition, since Cu (T, d) This lemma shows that X ˜ is complete and separable, X is tight (by Lemma 11). Since the inclusion map Cu (T, d) → 3 ¯ and an isometry ¯ , d) For any (semi-)metric space (U, d), there exist a compete (semi-)metric space (U ¯ ¯ such that ϕ(U ) is dense in U ¯ . Then U ¯ is called a completion of U . Since (U, d) and (ϕ(U ), d) ϕ:U →U ¯. are isometrically isomorphic, we may identify U with ϕ(U ) and think of U as a subset of U

33

˜ is also a tight random element in `∞ (T ). This completes the `∞ (T ) is continuous, X whole proof. An immediate corollary of this theorem (actually of its proof) is the following. Corollary 5. The distribution of a tight random element in `∞ (T ), which is defined on the Borel σ-field of `∞ (T ), is uniquely determined by its finite dimensional distributions. Remark 5. The proof of Theorem 9 reveals that a major difficulty in dealing with `∞ (T ) is the fact that the space `∞ (T ) is generally not separable. In fact, `∞ (T ) is not separable whenever T is infinite. To see this, it is enough to consider the case where T = N. Denote by x(i) , i ∈ I the set of all 0-1 sequences, such as (0, 1, 1, 0, . . . ). The set 0 I is uncountable and kx(i) − x(i ) kN = 1 whenever i 6= i0 . Denote by Bi the open ball in `∞ (N) with center x(i) and radius 1/2. Then the balls Bi , i ∈ I are disjoint, so that `∞ (T ) can not be separable. In our applications, the limit process X is Gaussian, that is, for every t1 , . . . , tm ∈ T and m ∈ N, the joint distribution of X(t1 ), . . . , X(tm ) is normal. In such a case, the semi-metric d in Theorem 9 can be taken to be ρp (s, t) := (E[|X(t) − X(s)|p ])1/p for any 1 ≤ p < ∞ (a typical choice is p = 2). Theorem 10 (Andersen and Dobri´c (1987)). Let X(t), t ∈ T be a Gaussian process indexed by a set T . For 1 ≤ p < ∞, define the semi-metric ρp on T by ρp (s, t) := (E[|X(t) − X(s)|p ])1/p , s, t ∈ T . Then X is a tight random element in `∞ (T ) if and only if (T, ρp ) is totally bounded and X has sample paths almost surely uniformly ρp -continuos. Remark 6. Precisely speaking, if (T, ρp ) is totally bounded and X has sample paths almost surely uniformly ρp -continuous, then X is almost surely bounded. Choose an event Ω ⊂ Ω0 with P(Ω0 ) = 1 such that supt∈T |X(t, ω)| < ∞ for all ω ∈ Ω0 . Define ˜ ω) = X(·, ω) for ω ∈ Ω0 and X(·, ˜ ω) = 0 for ω ∈ Ωc . Then X ˜ is a version of X that X(·, 0 ∞ ˜ is a tight random element in ` (T ). We do not distinguish between X and X. Proof of Theorem 10. The “only if” part follows from Theorem 9. Conversely, suppose that X is a tight random element in `∞ (T ), and choose a semi-metric d for which (T, d) is totally bounded and such that almost all sample paths of X are uniformly d-continuous. Let sn and tn be two sequences in T such that d(sn , tn ) → 0. Then X(tn ) − X(sn ) → 0 almost surely and so in distribution. Since X(tn ) − X(sn ) are Gaussian, this means that E[|X(tn ) − X(sn )|p ] → 0, that is, ρp (sn , tn ) → 0, which implies that d(s, t) → 0 ⇒ ρp (s, t) → 0.

(18)

The total boundedness of (T, d) thus implies the total boundedness of (T, ρp ). It remains to show that almost all sample paths of X are uniformly ρp -continuous. Without loss of generality, we may assume that all sample paths of X are uniformly d¯ be a completion of (T, d), which is compact by total boundedness continuous. Let (T¯, d) of T . Then the process X has the unique continuous extension to T¯ (by Lemma 5), and we denote this extension by the same symbol X. 34

Let ω ∈ Ω. If the path t 7→ X(t, ω) is not uniformly ρp -continuous, then there exist ε > 0 and sequences sn and tn such that ρp (sn , tn ) → 0 and |X(sn , ω) − X(tn , ω)| ≥ ε for ¯ is compact, sn and tn have subsequences sn0 and tn0 that converge in all n. Since (T¯, d) ¯ d to limits s and t (say) in T¯, respectively. Then since the path t 7→ X(t, ω) is uniformly ¯ d-continuous, we have that |X(s, ω) − X(t, ω)| = limn0 |X(sn0 , ω) − X(tn0 , ω)| ≥ ε. In addition, because of (18), we have ρp (sn0 , s) → 0 and ρp (tn0 , t) → 0, and so ρp (s, t) = 0 by the triangle inequality. The discussion so far shows that {ω ∈ Ω : t 7→ X(t, ω) is not uniformly ρp -continuous} ⊂ {ω ∈ Ω : ∃s, t ∈ T¯ s.t. ρp (s, t) = 0 but X(s, ω) 6= X(t, ω)} =: N. ¯ It suffices to show that N is a nullset. Take a countable d-dense subset A of {(s, t) ∈ ¯ T¯ × T¯ : ρp (s, t) = 0}. Then since all sample paths of X are d-continuous, N reduces to N = {ω ∈ Ω : ∃(s, t) ∈ A s.t. X(s, ω) 6= X(t, ω)}. By the definition of ρp , for every fixed (s, t) ∈ A, we have X(s, ω) = X(t, ω) for almost all ω, by which we conclude that N is a nullset. This completes the proof. For a tight random element X in a Banach space B, we say that X is Gaussian if F (X) is Gaussian for every F ∈ B ∗ , where B ∗ is the dual of B, that is, B ∗ is the set of all continuous linear functionals on B. Recall that (`∞ (T ), k · kT ) is a real Banach space. If X = (X(t))t∈T is a Gaussian process and at the same time a tight random element in `∞ (T ), it would be natural to ask whether X is Gaussian as a random element in `∞ (T ). In fact, it turns out that the concepts of Gaussianity as a random element in `∞ (T ) and as a stochastic process are equivalent for tight random elements in `∞ (T ). Hence, we will not distinguish these two. Lemma 13. Let X = (X(t))t∈T be a tight random element in `∞ (T ). Then X is Gaussian as a random element in `∞ (T ) if and only if X is a Gaussian process. ∞ Proof. The “only if” part is trivial. If X is a Gaussian random element Pnin ` (T ), then ∞ for every a1 , . . . , an ∈ R and t1 , . . . , tnP∈ T , the map `P(T ) 3 f 7→ i=1 ai πti (f ) is a continuous linear functional, so that ni=1 ai πti (X) = ni=1 ai X(ti ) is Gaussian. This implies that (X(t1 ), . . . , X(tn )) is jointly Gaussian. Now, we turn to the “‘if” part. Pick any semimetric d on T such that (T, d) is totally bounded and X has sample paths almost surely uniformly d-continuous. Since P{X ∈ Cu (T )} = 1 and the restriction of each element in `∞ (T )∗ to Cu (T ) is an element ¯ of Cu (T )∗ , it suffices to show that F (X) is Gaussian for every F ∈ Cu (T )∗ . Let (T¯, d) be a completion of (T, d). Each f ∈ Cu (T ) extends uniquely to a continuous function f¯ on T¯, and the extension f 7→ f¯ is a linear isometric isomorphism from Cu (T ) onto ¯ Hence, each F ∈ Cu (T )∗ corresponds to a unique element F¯ in C(T¯)∗ C(T¯) = Cu (T¯, d). by F (f ) = F¯ (f¯) for all f ∈ Cu (T ). Now, since T¯ is compact, the Riesz representation ∗ theorem shows that for every R F ∈ Cu (T ) , there exists a finite signed Borel measure ¯ ¯ µ on T such that F (f ) = T¯ f dµ for all f ∈ Cu (T ). By compactness of T¯, there exist

35

S m m disjoint sets Bjm , j = 1, . . . , Nm in T¯ with radius at most 1/m such that T¯ = N j=1 Bj , m m m ¯ and pick any tj ∈ Bj . Since T is dense in T , we may assume that tj ∈ T . Define P m m m m m µm = N j=1 aj δtj with aj = µ(Bj ), where δt denotes the Dirac measure at t. Observe that Z Z Nm X m ¯ ¯ F (f ) = f dµ = lim f dµm = lim am j f (tj ) T¯

m

T¯

m

j=1

P m m m for every f ∈ Cu (T ). Therefore, we have that F (X) = limm N j=1 aj X(tj ), and since the right hand side is a limit of Gaussian random variables, F (X) is Gaussian. We are now in position to state the following theorem, which leads to a necessary and sufficient condition for weak convergence of sample-bounded stochastic processes. Theorem 11 (Andersen and Dobri´c (1987)). Let Xn (t), t ∈ T be a sequence of samplebounded stochastic processes. Then the following statements are equivalent: (i) Xn converges weakly to a tight random element X in `∞ (T ). (ii) For every t1 , . . . , tm ∈ T and every m ∈ N, (Xn (t1 ), . . . , Xn (tm )) converges in distribution, and there exists a semi-metric d for which (T, d) is totally bounded and such that for every ε > 0, ( ) lim lim sup P∗ δ↓0

n→∞

sup

|Xn (t) − Xn (s)| > ε

= 0.

(19)

d(s,t)<δ;s,t∈T

If (ii) holds, then the limit process X in (i) has sample paths almost surely uniformly d-continuous. In addition, if X in (i) has sample paths almost surely uniformly ρcontinuous for some semi-metric ρ that makes (T, ρ) totally bounded, then the semimetric d in condition (19) can be taken to be d = ρ. Condition (19) will be referred to as the asymptotic equicontinuity condition. For proving (ii) ⇒ (i) in Theorem 11, we will use the following version of the Portmanteau theorem. Theorem 12 (Portmanteau theorem). Let Xn be sample-bounded stochastic processes, and let X be a tight random element in `∞ (T ). Then the following statements are equivalent. w

(i) Xn → X in `∞ (T ). (ii) For every closed set F ⊂ `∞ (T ), lim supn P∗ (Xn ∈ F ) ≤ P(X ∈ F ). (iii) For every open set G ⊂ `∞ (T ), lim inf n P∗ (Xn ∈ G) ≥ P(X ∈ G). (iv) For every Borel set A ⊂ `∞ (T ) such that P(X ∈ ∂A) = 0, limn P∗ (Xn ∈ A) = limn P∗ (Xn ∈ A) = P(X ∈ A). 36

We defer the proof of Theorem 12 after the proof of Theorem 11. Proof of Theorem 11. (ii) ⇒ (i): The proof consists of two steps. Step 1: We shall prove that there exists a tight random element X in `∞ (T ) such that for every t1 , . . . , tm ∈ T and every m ∈ N, w

(Xn (t1 ), . . . , Xn (tm )) → (X(t1 ), . . . , X(tm ))

(20)

as n → ∞. It is not difficult to verify consistency of the weak limits of the finite dimensional distributions of Xn , and thus Kolmogorov’s extension theorem (Dudley, 2002, Theorem 12.1.3) ensures the existence of a stochastic process X(t), t ∈ T such that (20) holds. We need to verify that X admits a version that is a tight random element in `∞ (T ). Let T0 be a countable d-dense subsetS of T , and let Tk , k = 1, 2, . . . be an increasing sequence of finite subsets of T0 such that ∞ k=1 Tk = T0 . Then the Portmanteau theorem implies that P max |X(t) − X(s)| > ε ≤ lim inf P max |Xn (t) − Xn (s)| > ε n→∞ d(s,t)<δ;s,t∈Tk d(s,t)<δ;s,t∈Tk ≤ lim inf P max |Xn (t) − Xn (s)| > ε . n→∞

d(s,t)<δ;s,t∈T0

Taking k → ∞ on the far left hand side, we conclude that P max |X(t) − X(s)| > ε ≤ lim inf P max n→∞

d(s,t)<δ;s,t∈T0

d(s,t)<δ;s,t∈T0

|Xn (t) − Xn (s)| > ε .

By the asymptotic equicontinuity condition (19), there exists a sequence δr > 0 with δr ↓ 0 as r → ∞ such that P max |X(t) − X(s)| > ε ≤ 2−r . d(s,t)<δr ;s,t∈T0

By the Borel-Cantelli lemma, there exists an event Ω0 ⊂ Ω with P(Ω0 ) = 1 such that for every ω ∈ Ω0 , there exists r = r(ω) with max

|X(t, ω) − X(s, ω)| ≤ ε.

d(s,t)<δr ;s,t∈T0

This shows that X|T0 , the restriction of X to T0 , has sample path almost surely uniformly ˜ ˜ ω) = X(t, ω), t ∈ T0 , ω ∈ d-continuous. Define the stochastic process X(t), t ∈ T0 by X(t, ˜ ˜ Ω0 and X(t, ω) = 0, t ∈ T0 , ω ∈ / Ω0 . The process X has the continuous extension to T by ˜ on T0 (cf. Lemma 5). Denote this extension by the same symbol uniform continuity of X ˜ Then X ˜ is a version of X and all sample paths of X ˜ are uniformly d-continuous. By X. Theorem 9, the desired claim follows. w Step 2: We now prove Xn → X in `∞ (T ). Since (T, d) is totally bounded, for SN (τ ) every τ > 0, there exists a finite set of points t1 , . . . , tN (τ ) such that T = i=1 B(ti , τ ), 37

where B(t, τ ) = {s ∈ T : d(s, t) < τ }. Then for every t ∈ T we can choose a point πτ (t) ∈ {t1 , . . . , tN (τ ) } such that d(πτ (t), t) < τ . Define the processes Xn,τ and Xτ by Xn,τ (t) = Xn (πτ (t)), Xτ (t) = X(πτ (t)), t ∈ T. Since πτ (t) takes only N (τ ) values, the finite dimensional convergence implies that w

Xn,τ → Xτ in `∞ (T ).

(21)

To see this, let Ti = {t ∈ T : ti = πτ (t)}. Then {Ti } forms a partition of T and PN (τ ) πτ (t) = ti whenever t ∈ Ti . Since the map Πτ : (a1 , . . . , aN (τ ) ) 7→ i=1 ai 1Ti (t) is continuous from RN (τ ) into `∞ (T ), the finite dimensional convergence implies that for any bounded continuous function H : `∞ (T ) → R, E[H(Xn,τ )] = E[H ◦ Πτ (Xn (t1 ), . . . , Xn (tN (τ ) ))] → E[H ◦ Πτ (X(t1 ), . . . , X(tN (τ ) ))] = E[H(Xτ )], which proves (21). In addition, since X has sample paths almost surely uniformly continuous, lim kXτ − XkT = 0 a.s. τ ↓0

Let H : `∞ (T ) → R be a bounded continuous function. Then |E∗ [H(Xn )] − E[H(X)]| ≤ |E∗ [H(Xn )] − E[H(Xn,τ )]| + |E[H(Xn,τ )] − E[H(Xτ )]| + |E[H(Xτ )] − E[H(X)]| =: In,τ + IIn,τ + IIIτ . We have seen that limn→∞ IIn,τ = 0 for each fixed τ > 0 and limτ ↓0 IIIτ = 0. Hence it remains to prove that limτ ↓0 lim supn→∞ In,τ = 0. We will use the following lemma. Lemma 14. Let (U, d) be a metric space, and let f : U → R be a continuous function. Let K ⊂ U be a compact subset. Then for every ε > 0 there exists δ > 0 such that d(u, v) < δ, u ∈ K, v ∈ U ⇒ |f (u) − f (v)| < ε. Proof of Lemma 14. Suppose on the contrary that the assertion is false. Then there exist ε > 0 and sequences un ∈ K and vn ∈ U such that d(un , vn ) → 0 and |f (un )−f (vn )| ≥ ε. Since K is compact, there exists a subsequence un0 of un such that un0 has a limit u in K. Then vn0 → u and, by continuity of f , |f (un0 ) − f (vn0 )| → |f (u) − f (u)| = 0, which is a contradiction. Pick any ε > 0. Since X is tight, there exists a compact set K ⊂ `∞ (T ) such that P(X ∈ K c ) ≤ ε. Choose δ > 0 in such a way that kf − gkT < δ, f ∈ K, g ∈ `∞ (T ) ⇒ |H(f ) − H(g)| < ε. 38

Let K δ/2 = {f ∈ `∞ (T ) : inf g∈K kf − gkT < δ/2}. Observe that (kXn − Xn,τ kT < δ/2) ∧ (Xn,τ ∈ K δ/2 ) ⇒ (kXn − Xn,τ kT < δ/2) ∧ (∃u ∈ K s.t. ku − Xn,τ kT < δ/2) ⇒ ∃u ∈ K s.t. (ku − Xn kT < δ) ∧ (ku − Xn,τ kT < δ/2) ⇒ ∃u ∈ K s.t. |H(Xn ) − H(Xn,τ )| ≤ |H(Xn ) − H(u)| + |H(u) − H(Xn,τ )| < 2ε, from which we have |E∗ [H(Xn )] − E[H(Xn,τ )]| h n oi ≤ 2kHk∞ P∗ {kXn − Xn,τ kT ≥ δ/2} + P Xn,τ ∈ (K δ/2 )c + 2ε. Then, on the one hand, the asymptotic equicontinuity condition (19) yields that lim lim sup P∗ (kXn − Xn,τ kT ≥ δ/2) = 0. τ ↓0

n→∞

On the other hand, since δ/2 c Xn,τ ∈ (K δ/2 )c ⇔ (Xn (t1 ), . . . , Xn (tN (τ ) )) ∈ Π−1 ) ), τ ((K δ/2 )c ) is a closed subset in RN (τ ) , the Portmanteau theorem yields that and Π−1 τ ((K o o n n lim sup P Xn,τ ∈ (K δ/2 )c ≤ P Xτ ∈ (K δ/2 )c . n→∞

Further, since limτ ↓0 kXτ − XkT = 0 almost surely, we have o n lim sup P Xτ ∈ (K δ/2 )c ≤ P(X ∈ K c ) ≤ ε. τ ↓0

Therefore, we conclude that lim sup lim sup |E∗ [H(Xn )] − E[H(Xn,τ )]| ≤ (2kHk∞ + 2)ε. τ ↓0

n→∞

This completes the proof for (ii) ⇒ (i). (i) ⇒ (ii): By Theorem 9, there exists a semi-metric d for which (T, d) is totally bounded and X has sample paths almost surely uniformly d-continuous. Let Fδ,ε = {f ∈ `∞ (T ) : supd(s,t)<δ |f (t) − f (s)| ≥ ε}. Then Fδ,ε is a closed subset in `∞ (T ), and so by the Portmanteau theorem stated above, ( ) lim sup P∗ n→∞

|Xn (t) − Xn (s)| ≥ ε

sup

= lim sup P∗ (Xn ∈ Fδ,ε )

d(s,t)<δ;s,t∈T

n→∞

(

)

≤ P(X ∈ Fδ,ε ) = P

|X(t) − X(s)| ≥ ε .

sup d(s,t)<δ;s,t∈T

The far right hand side goes to 0 as δ ↓ 0. This completes the proof for (i) ⇒ (ii). 39

We now turn to proving Theorem 12. To this end, we will need a few properties of the outer expectation, summarized as follows. Lemma 15. Let Y : Ω → R be any map. Then (i) there exists an a.s.-unique measurable map Y ∗ : Ω → R such that Y ∗ ≥ Y and if W : Ω → R is measurable and W ≥ Y a.s., then W ≥ Y ∗ a.s. We call Y ∗ the measurable cover of Y . (ii) Provided that E[Y ∗ ] exists, we have E∗ [Y ] = E[Y ∗ ]. (iii) For any x ∈ R, P∗ (Y > x) = P(Y ∗ > x). Proof of Lemma 15. (i). Let W = {W : Ω → R : W measurable and W ≥ Y }, and let  x = −∞  −1 x ϕ(x) = 1+|x| x ∈ R .   1 x = +∞ Choose a sequence Wm ∈ W such that E[ϕ(Wm )] ↓ inf W ∈W E[ϕ(W )] =: α ≥ −1. Define Y ∗ = lim min Wm . n 1≤m≤n

Then Y ∗ is a measurable map from Ω into R such that Y ∗ ≥ Y . Furthermore, E ϕ min Wm = E min ϕ(Wm ) ≤ E[ϕ(Wn )], 1≤m≤n

1≤m≤n

and so the dominated convergence theorem yields that E[ϕ(Y ∗ )] ≤ α. Hence we have E[ϕ(Y ∗ )] = α by the definition of α. For any W ∈ W, min{W, W1 , . . . , Wn } ↓ W ∧ Y ∗ , and so we have E[ϕ(W ∧ Y ∗ )] = α = E[ϕ(Y ∗ )], which implies that W ≥ Y ∗ a.s. since ϕ is strictly increasing. (ii). If E[Y ∗ ] exists, by the definition of the outer expectation, we have E∗ [Y ] ≤ E[W ∗ ], but by the definition of the measurable cover, the reverse inequality also holds. (iii). We first note that for any A ⊂ Ω, there exists a set B ∈ A with B ⊃ A such that P∗ (A) = P(B). Indeed, T choose a sequence Bn ∈ A with Bn ⊃ A such that P(Bn ) ↓ P∗ (A), and take B = n Bn . Next, since {Y > x} ⊂ {Y ∗ > x}, we have P∗ (Y > x) ≤ P(Y ∗ > x). To prove the reverse inequality, pick a set A ∈ A with A ⊃ {Y > x} such that P∗ (Y > x) = P(A). Then Y ∗ ≤ x almost surely on Ac since otherwise W = Y ∗ 1A + x1Ac would be a measurable map from Ω into R such that W ≥ Y and W < Y ∗ with positive probability, which violates the definition of the measurable cover. Thus we have P(Y ∗ > x) ≤ P(A) = P∗ (Y > x). Proof of Theorem 12. We will prove the following implications: (i) ⇒ (ii) ⇔ (iii), (ii) + (iii) ⇒ (vi) ⇒ (i). The equivalence between (ii) and (iii) follows from taking complements. For (ii) + (iii) ⇒ (vi), observe that lim sup P∗ (Xn ∈ A) ≤ lim sup P∗ (Xn ∈ A) ≤ P(X ∈ A) n

n

= P(X ∈ A◦ ) ≤ lim inf P∗ (Xn ∈ A◦ ) ≤ lim inf P∗ (Xn ∈ A), n

n

40

where A and A◦ denote the closure and the interior of A, respectively. (i) ⇒ (ii): Let F ⊂ `∞ (T ) be closed, and define η : R → [0, 1] by   if x < 0 1 η(x) = 1 − x if 0 ≤ x ≤ 1 .   0 if x > 1 Further, for any ε > 0, let Hε (f ) = η(ε−1 inf g∈F kf − gkT ) for f ∈ `∞ (T ). Then Hε is bounded and continuous, and 1F ≤ Hε ≤ 1F ε] where F ε] = {f ∈ `∞ (T ) : inf g∈F kf − gkT ≤ ε}. Thus we have lim sup P∗ (Xn ∈ F ) ≤ lim E∗ [Hε (Xn )] = E[Hε (X)] ≤ P X ∈ F ε] . n

n

Since F is closed, taking ε = εm ↓ 0, we have P(X ∈ F ε] ) ↓ P(X ∈ F ). (vi) ⇒ (i): Pick any bounded continuous function H : `∞ (T ) → R. By scaling, it suffices to consider the case where 0 ≤ H ≤ 1. Let H(Xn )∗ be the measurable cover of H(Xn ); then applying Fubini’s theorem, we have Z 1 Z 1 ∗ ∗ ∗ P∗ {H(Xn ) > x}dx, P{H(Xn ) > x}dx = E [H(Xn )] = E[H(Xn ) ] = 0

0

R1

and E[H(X)] = 0 P{H(X) > x}dx as well. Since ∂{H > x} ⊂ {H = x} and the set of x ∈ [0, 1] such that P{H(X) = x} > 0 is at most countable, P∗ {H(Xn ) > x} → P{H(X) > x} for all but countable x ∈ [0, 1], and hence the bounded convergence theorem yields that Z 1 Z 1 P∗ {H(Xn ) > x}dx → P{H(X) > x}dx. 0

0

This completes the overall proof.

3.2

Uniform law of large numbers

Before moving into the uniform CLT, we study a uniform version of the law of large numbers, namely, we study conditions under which kPn − P kF → 0 almost surely, in L1 , or in probability. We first show that the almost sure convergence and the L1 convergence of kPn − P kF → 0 follow from the convergence in probability, that is, kPn − P kF → 0 ⇒ kPn − P kF → 0 almost surely and in L1 . P

Note that in general convergence in probability does not imply almost sure convergence nor convergence in L1 . Recall that X1 , X2 , . . . are the coordinates of the infinite product space (Ω, A, P) = (S N , S N , P N ), and suppose that P F < ∞. Let Σn denote the set of all maps from S N into S N that are permutations of the first n coordinates, and define Sn := {A ∈ S N : 1A (x) = 1A (σn x) ∀σn ∈ Σn }. 41

It is not difficult to see that Sn is a non-increasing sequence of sub σ-fields of S N . By this definition, we see that Z Z A ∈ Sn ⇒ f (Xi )dP = f (Xj )dP, 1 ≤ ∀i, j ≤ n, A A Z Pn (f )dP. = A

Since Pn (f ) is invariant under any permutation of X1 , . . . , Xn , it is Sn -measurable. Hence we have E[f (Xi ) | Sn ] = Pn (f ) for all 1 ≤ i ≤ n, by which we conclude that E[kPn−1 − P kF | Sn ] ≥ kE[(Pn−1 − P )(f ) | Sn ]kF = kPn − P kF . This shows that if P F < ∞, then {kP−n − P kF , S−n : n ≤ −1} is a reversed submartingale. Theorem 10.6.4 of Dudley (2002) leads to the following lemma. Lemma 16. If P F < ∞, then kPn − P kF converges almost surely and in L1 . P

A consequence of this lemma is that whenever kPn − P kF → 0, then kPn − P kF → 0 P

almost surely and in L1 . So the problem reduces to showing that kPn − P kF → 0. For M > 0, define FM := {f 1{F ≤M } : f ∈ F}. The following is the main theorem of this subsection. ˇ Theorem 13 (Vapnik and Cervonenkis (1981); Pollard (1982); Gin´e and Zinn (1984)). If P F < ∞ and n−1 log N (FM , k · kPn ,1 , ε)∗ → 0 as n → ∞ for every M > 0 and every P

P

ε > 0, then kPn − P kF → 0, and thus kPn − P kF → 0 almost surely and in L1 . Proof. By the symmetrization inequality (Theorem 1), we have

# " n

1 X

E[kPn − P kF ] ≤ 2E εi f (Xi )

n

i=1

F #

# " n " n

1 X

1 X

≤ 2E εi f (Xi )1{F ≤M } (Xi ) + 2E εi f (Xi )1{F >M } (Xi ) .

n

n

i=1

i=1

F

The second term is bounded by P 2 ni=1 E[F (Xi )1{F >M } (Xi )] = 2P F 1{F >M } → 0, M → ∞. n Therefore, it is enough to show that for every M > 0, 

 n

1 X

εi f (Xi )  → 0, n → ∞. E 

n

i=1

FM

42

F

Fix X1 , . . . , Xn . For ε > 0, let {f1 , . . . , fN } be a minimal ε-net of FM with N = N (FM , k · kPn ,1 , ε); that is, for every f ∈ FM there exists a number P 1 ≤ k(f ) ≤ N such that kf − fk(f ) kPn ,1 ≤ ε. Hence we may bound the term Eε [kn−1 ni=1 εi f (Xi )kFM ] by 

 # n " n

1 X 1 X

Eε  εi {f (Xi ) − fk(f ) (Xi )}  + Eε max εi fk (Xi )

n 1≤k≤N n i=1 i=1 FM v r u n u1 X 1 + log N ≤ε+C max t fk2 (Xi ) 1≤k≤N n n i=1 r 1 + log N (FM , k · kPn ,1 , ε) ≤ ε + CM , n where we have used the inequality (17). Because of the hypothesis of the theorem, P P P n−1 log N (FM , k · kPn ,1 , ε)∗ → 0, which implies that Eε [kn−1 ni=1 εi f (Xi )kFM ] → 0; but since the left hand side bounded by M , we conclude from the dominated convergence Pis n −1 theorem that E[kn i=1 εi f (Xi )kFM ] → 0. This completes the proof. The following is a simple corollary of Theorem 13, which we will use in the proof of the uniform central limit theorem. Corollary 6. If P F < ∞ and n−1 log N (F, k · kPn ,1 , εkF kPn ,1 )∗ → 0 for every ε > 0, P

P

then kPn − P kF → 0, and thus kPn − P kF → 0 almost surely and in L1 . Proof. We have to check that n−1 log N (FM , k·kPn ,1 , ε)∗ → 0. Without loss of generality, we may assume that P F > 0. Then by the law of large numbers, kF kPn ,1 = Pn F → P F almost surely and thus P{kF kPn ,1 ≤ 2P F } → 1. Because N (FM , k · kPn ,1 , ε) ≤ N (F, k · kPn ,1 , ε/2), on the event that kF kPn ,1 ≤ 2P F , we have P

N (FM , k · kPn ,1 , ε) ≤ N (F, k · kPn ,1 , εkF kPn ,1 /(4P F )). This implies that n−1 log N (FM , k · kPn ,1 , ε)∗ → 0. P

3.3

Uniform central limit theorem

We are now in position to study conditions under which a class F of measurable functions S → R obeys the “uniform” central limit theorem, that is, the (scaled) empirical process √ n(Pn − P )f, f ∈ F converges weakly to a tight random element in `∞ (F). This √ statement makes sense only when n(Pn − P )f, f ∈ F is a sample-bounded stochastic process. In addition, as long as F ⊂ L2 (P ) which we alway assume, the limit process (if exists) must be Gaussian because of the multivariate central limit theorem. If this limit process admits a version that is a tight random element in `∞ (T ), we call the class F P -pre-Gaussian. The formal definition is as follows.

43

Definition 4 (Pre-Gaussian class). A class F of measurable functions S → R is called P -pre-Gaussian if F ⊂ L2 (P ) and if there exists a tight Gaussian random element GP (f ), f ∈ F in `∞ (F) such that E[GP (f )] = 0 for all f ∈ F and E[GP (f )GP (g)] = P (f −P f )(g−P g) for all f, g ∈ F. In this definition, F need not be pointwise measurable. Whenever F ⊂ L2 (P ), Kolmogorov’s extension theorem ensures that there exists a centered Gaussian process GP (f ), f ∈ F whose covariance function is E[GP (f )GP (g)] = P (f − P f )(g − P g). Consider the semi-metric ρP,2 on F defined by ρP,2 (f, g) := (E[GP (g) − GP (f ))2 ])1/2 = (P (f − P f − g + P g)2 )1/2 . Then by Theorem 10, F is P -pre-Gaussian if and only if (T, ρP,2 ) is totally bounded and GP has a version that has sample paths almost surely uniformly ρP,2 -continuous. In what follows, we denote by GP a centered Gaussian process whose covariance function is E[GP (f )GP (g)] = P (f − P f )(g − P g), and when F is P -pre-Gaussian, GP is understood as a tight random element in `∞ (F). Definition 5 (Donsker class). Let F be a class of measurable functions S → R, not necessarily pointwise measurable, such that F ⊂ L2 (P ), supf ∈F |f (x)| < ∞ for all x ∈ S, and supf ∈F |P f | < ∞. Then F is said to be P -Donsker if F is P -pre-Gaussian and the √ sequence of sample-bounded stochastic processes n(Pn f −P f ), f ∈ F converges weakly to GP in `∞ (F). Consider the semi-metric eP,2 defined by p eP,2 (f, g) = P (f − g)2 , f, g ∈ L2 (P ). The following lemma is a consequence of Theorem 11. Lemma 17. Let F be a class of measurable functions S → R, not necessarily pointwise measurable, such that F ⊂ L2 (P ), supf ∈F |f (x)| < ∞ for all x ∈ S, and supf ∈F |P f | < ∞. Then the following three statements are equivalent: (i) F is P -Donsker; (ii) (T, ρP,2 ) is totally bounded and for every ε > 0, ( ) n 1 X lim lim sup P∗ sup ((f − g)(Xi ) − P (f − g)) > ε = 0; √ δ↓0 n→∞ ρP,2 (f,g)<δ;f,g∈F n i=1

(iii) (T, eP,2 ) is totally bounded and for every ε > 0, ( ) n 1 X lim lim sup P∗ sup ((f − g)(Xi ) − P (f − g)) > ε = 0. √ δ↓0 n→∞ n eP,2 (f,g)<δ;f,g∈F i=1

44

Proof. Recall that F is P -pre-Gaussian if and only if (F, ρP,2 ) is totally bounded and GP has a version that has sample paths almost surely uniformly ρP,2 -continuous. Hence the equivalence between (i) and (ii) follows from Theorem 11. In addition, the direction (iii) ⇒ (i) also follows from the same theorem. We have to prove the converse direction (i) ⇒ (iii). Since ρP,2 ≤ eP,2 , GP has sample paths almost surely uniformly eP,2 -continuous. Hence by Theorem 11, all what we need is to prove that (T, eP,2 ) is totally bounded. Pick any ε > 0. Since (T, ρP,2 ) is totally bounded, there is an ε-net {f1 , . . . , fN } of F under ρP,2 , that is, for every f ∈ F, there is a function fi such that ρP,2 (f, fi ) ≤ ε. Since supf ∈F |P f | < ∞, for every 1 ≤ i ≤ N , {P f : f ∈ F, ρP,2 (f, fi ) ≤ ε} is a bounded subset in R. Choose a finite subset {fi,1 , . . . , fi,Ni } ⊂ {f ∈ F : ρP,2 (f, fi ) ≤ ε} such that for every f ∈ F with ρP,2 (f, fi ) ≤ ε, there exists a function fi,j with |P (f − fi,j )| ≤ ε. Then for every f ∈ F, there exist 1 ≤ i ≤ N and 1 ≤ j ≤ Ni such that ρP,2 (f, fi ) ≤ ε and |P (f − fi,j )| ≤ ε, so that e2P,2 (f, fi,j ) = ρ2P,2 (f, fi,j ) + |P (f − fi,j )|2

Therefore, {fi,j

≤ 2ρ2P,2 (f, fi ) + 2ρ2P,2 (fi , fi,j ) + |P (f − fi,j )|2 ≤ 5ε2 . √ : 1 ≤ i ≤ N, 1 ≤ j ≤ Ni } is a 5ε-net of F under eP,2 .

The following theorem gives a sufficient condition for F to be P -Donsker. Theorem 14 (Dudley (1978); Koltchinskii (1981); Pollard (1982)). Let F be a pointwise measurable class of functions S → R, to which a measurable envelope F is attached. Suppose that P F 2 < ∞ and Z 1 q sup log N (F, k · kQ,2 , εkF kQ,2 )dε < ∞, (22) 0

Q

where the supremum is taken over all finitely discrete distributions. Then the class F is P -Donsker. Remark 7. When kF kQ,2 = 0, all functions in F are 0 almost surely in Q, and so N (F, k · kQ,2 , 0) = 1. We present two proofs for Theorem 14; the first one relies on the ULLN, and the second one, which I think is simpler, does not. First proof of Theorem 14. By Lemma 17, we have to check (i) the total boundedness of (F, eP,2 ); and (ii) the asymptotic equicontinuity condition. The total boundedness of (F, eP,2 ) follows from the next lemma. Lemma 18. supf,g∈F |(Pn − P )(f − g)2 | → 0 almost surely. Proof of Lemma 18. Let H = {(f − g)2 : f, g ∈ F}. Then an envelope function for H is given by 4F 2 . Let us write F − F = {f − g : f, g ∈ F}. For f, g ∈ F − F, Pn |f 2 − g 2 | = Pn |f − g||f + g| ≤ Pn |f − g|(4F ) ≤ kf − gkPn ,2 k4F kPn ,2 , 45

by which we have N (H, k · kPn ,1 , εk4F 2 kPn ,1 ) ≤ N (F − F, k · kPn ,2 , εkF kPn ,2 ) ≤ N 2 (F, k · kPn ,2 , εkF kPn ,2 /2) ≤ sup N 2 (F, k · kQ,2 , εkF kQ,2 /2). Q

The far right hand side is finite and independent of n, so that n−1 log N (H, k · kPn ,1 , εk4F 2 kPn ,1 )∗ → 0. P

Therefore, by Corollary 6, we conclude that kPn − P kH → 0 almost surely. This lemma shows in particular that there exists ω ∈ Ω such that for every ε > 0 there exists n ∈ N with supf,g∈F |(Pn (ω) − P )(f − g)2 | ≤ ε2 . With this ω and n, we have N (F, k · kP,2 ,

√

2ε) ≤ N (F, k · kPn (ω),2 , ε) < ∞.

Therefore, (T, eP,2 ) is totally bounded. It remains to check the asymptotic equicontinuity condition. By the symmetrization inequality for probabilities (Theorem 2), it is enough to prove that ( ) n 1 X sup lim lim sup P εi (f (Xi ) − g(Xi )) > ε = 0, ∀ε > 0. √ δ↓0 n→∞ eP,2 (f,g)<δ;f,g∈F n i=1

Fix any ε > 0. The left probability is bounded by   n  1 X  P sup εi (f (Xi ) − g(Xi )) > ε √ e2 (f,g)<2δ2 ;f,g∈F n  i=1 Pn ,2 ) ( +P

sup |e2Pn ,2 (f, g) − e2P,2 (f, g)| > δ 2

f,g∈F

=: In,δ + IIn,δ . By Lemma 18, IIn,δ → 0 as n → ∞ for each fixed δ > 0. We have to bound In,δ . Without loss of generality, we may assume that F ≥ 1 (otherwise take F ∨ 1 instead of F ). In what follows, let C > 0 denote a universal constant of which the value may change from line to line. By Theorem 6 with ψ = ψ2 √ P applied to X(f ) = (1/ n) ni=1 εi f (Xi ) conditionally on X1 , . . . , Xn , together with

46

inequality (9), we have  Eε 

e2Pn ,2

 Z δq n 1 X  sup εi (f (Xi ) − g(Xi )) ≤ C log N (F, ePn ,2 , ε)dε √ 0 (f,g)<2δ 2 ;f,g∈F n i=1

Z

δ/kF kPn ,2

= CkF kPn ,2

q log N (F, ePn ,2 , εkF kPn ,2 )dε

0

Z ≤ CkF kPn ,2

δ

sup 0

q log N (F, k · kQ,2 , εkF kQ,2 )dε

Q

=: CkF kPn ,2 λ(δ). Hence we have  E

e2Pn ,2

 n 1 X sup εi (f (Xi ) − g(Xi ))  ≤ CkF kP,2 λ(δ). √ n 2 (f,g)<2δ ;f,g∈F i=1

Since the right hand side is independent of n and λ(δ) → 0 as δ ↓ 0, we have lim lim sup In,δ = 0. δ↓0

n→∞

This completes the overall proof. Before presenting the second proof, we introduce packing numbers. For any semimetric space (T, d), a collection of points in T is said to be ε-separated if the d-distance between each pair of points is greater than ε. The packing number D(T, d, ε) is the maximum number of ε-separated points in T . Lemma 19. For every ε > 0, N (T, d, ε) ≤ D(T, d, ε) ≤ N (T, d, ε/2). Proof. If {t1 , . . . , tD } is a maximal ε-separated subset of T , then by maximality it must be an ε-net, so that N (T, d, ε) ≤ D. Conversely, if {t01 , . . . , t0N 0 } is a minimal ε/2-net of T , then no two points in {t1 , . . . , tD } are not in the same ball with center t0i and radius ε/2, so that D ≤ N 0 . The following lemma shows that supQ in (22) can be replaced by the supremum over all probability measures Q such that QF 2 < ∞. Lemma 20. Let P be a probability measure on (S, S) such that P F r < ∞ for some r ≥ 1. Then D(F, k · kP,r , 2εkF kP,r ) ≤ sup D(F, k · kP,r , εkF kQ,r ), Q

where supQ means the supremum over all finitely discrete distributions. 47

Proof. Let {f1 , . . . , fm } be a maximal 2εkF kP,r -separated subset of (F, k · kP,r ), so that P |fi − fj |r > (2ε)r P F r for all i 6= j. Take X1 , X2 , · · · ∼ P i.i.d. Then by the strong law of large numbers, Pn |fi − fj |r → P |fi − fj |r for all i 6= j and Pn F r → P F r almost surely. Hence there exist ω and n such that Pn (ω)|fi − fj |r > (2ε)r P F r > εr Pn (ω)F r for all i 6= j, which leads to the desired conclusion. Second proof of Theorem 14. The total boundedness of (T, eP,2 ) follows from combining (22) and the previous lemma. We have to verify the asymptotic equicontinuity, to which end it is enough to show that √ lim lim sup E[ nkPn − P kFδ ] = 0, δ↓0

n→∞

where Fδ = {f − g : f, g ∈ F, eP,2 (f, g) < δkF kP,2 }. The class Fδ has envelope 2F , and since Fδ ⊂ {f − g : f, g ∈ F}, we have √ sup N (Fδ , k · kQ,2 , 2 2εk2F kQ,2 ) ≤ sup N 2 (F, k · kQ,2 , εkF kQ,2 ), Q

Q

which leads to J(ε, Fδ , 2F ) ≤ CJ(ε, F, F ). Hence by Theorem 8, √ Bn J 2 (δ, F, F ) √ E[ nkPn − P kFδ ] ≤ C J(δ, F, F )kF kP,2 + , δ2 n p √ where Bn = E[max1≤i≤n F 2 (Xi )]. Because F ∈ L2 (P ), we have4 Bn = o( n), so that √ lim sup E[ nkPn − P kFδ ] ≤ CJ(δ, F, F )kF kP,2 . n→∞

Then the right hand side goes to 0 as δ ↓ 0, so that the desired conclusion follows.

3.4

Application: CLT in C-space

Let (K, d) be a compact metric space, and let C(K) denote the space of all continuous functions K → R, equipped with the uniform norm kxkK = supt∈K |x(t)|. The space (C(K), k·kK ) is a separable Banach space. In this subsection, we assume that X1 , X2 , . . . are i.i.d. random elements in C(K) with mean zero (that P is, E[Xi (t)] = 0), and study conditions under which the normalized sum Sn = n−1/2 ni=1 Xi obeys the central limit theorem, that is, Sn converges weakly to a Gaussian random element in C(K) as n → ∞. We first recall some definition. In general, a sequence Yn of random elements in a metric space U is said to converge weakly to a random element Y in U , denoted by w Yn → Y in U , if lim E[H(Yn )] = E[H(Y )] n→∞

for all bounded continuous functions H : U → R. A Gaussian process indexed by K whose sample paths are continuous may be viewed as a map from Ω into C(K), which is automatically Borel measurable (see Lemma 12) and we call it a Gaussian random element in C(K). 4

Recall that for i.i.d. random variables ξ1 , ξ2 , . . . , the following three statements are equivalent: (1) E[|ξ1 |] < ∞; (2) max1≤i≤n |ξi |/n → 0 almost surely; E[max1≤i≤n |ξi |] = o(n).

48

Theorem 15 (Jain and Marcus (1975)). Let X, X1 , X2 , . . . be i.i.d. random elements in C(K) such that E[X(t)] = 0 and E[X 2 (t)] < ∞ for all t ∈ K. Suppose that there exists a non-negative random variable M with E[M 2 ] < ∞ such that |X(t, ω) − X(s, ω)| ≤ M (ω)d(s, t), ω ∈ Ω, s, t ∈ K, and moreover, 1p

Z

log N (K, d, ε)dε < ∞.

0

P Then the normalized sum Sn = n−1/2 ni=1 Xi obeys the central limit theorem, that is, Sn converges weakly to a Gaussian random element G in C(K) with mean zero and covariance function Cov(G(t), G(s)) = Cov(X(t), X(s)). Proof. Let kgkBL := kgk∞ + sup s6=t

|g(t) − g(s)| , BL(K) := {g ∈ C(K) : kgkBL < ∞}. d(s, t)

Consider the following setting: S = BL(K), S = the σ-field consisting of all Borel subsets of C(K) contained in BL(K), P = the distribution of X (note that BL(K) is a Borel subset of C(K) and the distribution of X concentrates on BL(K)). Consider the class F of functions BL(K) → R defined by F = {ft : t ∈ K}, ft : g 7→ g(t). Then F ⊂ L2 (S, S, P ), P f = 0 for all f ∈ F, sup |f (g)| = sup |g(t)| = kgkK ≤ kgkBL =: F (g) < ∞, and f ∈F

t∈K 2

E[F (X)] ≤ 2(E[X 2 (t0 )] + D2 E[M 2 ]) < ∞, where t0 ∈ K is any point and D is the diameter of K. Since |ft (g) − fs (g)| = |g(t) − g(s)| ≤ F (g)d(s, t), we have sup N (F, k · kQ,2 , εkF kQ,2 ) ≤ N (K, d, ε). Q

Therefore, by Theorem 14, F is P -Donsker, which implies that n lim lim sup P δ↓0

n→∞

n 1 X o (ft − fs )(Xi ) > ε = 0. √ | {z } n eP,2 (ft ,fs )<δ;ft ,fs ∈F

sup

i=1

=Xi (t)−Xi (s)

Observe that eP,2 (fs , ft ) =

p P (ft − fs )2 = (E[(X(t) − X(s))2 ])1/2 ≤ (E[M 2 ])1/2 d(s, t), 49

so that we conclude that ( lim lim sup P δ↓0

n→∞

) |Sn (t) − Sn (s)| > ε

sup

= 0.

d(s,t)<δ;s,t∈K

Therefore, by Theorem 11, Sn , viewed as maps from Ω into `∞ (K), converge weakly to a tight Gaussian random element G in `∞ (K) with mean zero and covariance function Cov(G(t), G(s)) = Cov(X(t), X(s)), and G has samples paths almost surely continuous. We may take G in such a way that all sample paths of G are continuous. Since Sn and w G take values in C(K), the weak convergence takes place in C(K), that is, Sn → G in C(K).

50

4

Covering numbers

In order to apply the maximal inequalities in Theorems 7 and 8, the uniform law of large numbers (Theorem 13), or the uniform central limit theorem (Theorem 14), one has to bound the covering numbers. An important class of examples for which good estimates ˇ on the covering numbers are known are Vapnik-Cervonenkis (VC) subgraph classes and their suitable transformations. This section defines VC-subgraph and type classes, and investigates their basic properties.

4.1

VC subgraph classes

We begin with the definition of VC classes of sets. Let S be a set, and let C be a class of subsets of S. We say that C picks out a certain subset A of a finite set {x1 , . . . , xn } ⊂ S if it can be written as A = {x1 , . . . , xn } ∩ C for some C ∈ C. Let ∆C (x1 , . . . , xn ) denote the number of subsets of {x1 , . . . , xn } picked out by C, that is, ∆C (x1 , . . . , xn ) = |{{x1 , . . . , xn } ∩ C : C ∈ C}|. The class C is said to shatter {x1 , . . . , xn } if C picks out all of its 2n subsets, that is, ∆C (x1 , . . . , xn ) = 2n . The VC index V (C) is defined by the smallest n for which no set of size n is shattered by C, that is, letting mC (n) := maxx1 ,...,xn ∆C (x1 , . . . , xn ), ( inf n : mC (n) < 2n if the set is non-empty V (C) = . ∞ otherwise The class C is said to be a VC-class if V (C) < ∞. Note that V (C) ≤ n ⇔ for every n points x1 , . . . , xn , there exists a subset A of {x1 , . . . , xn } such that A is not picked out by C. Example 5. 1. Consider the collection of subsets C = {(−∞, c] : c ∈ R} of S = R. This C has VC index = 2. Indeed, clearly V (C) ≥ 2, and for every x1 < x2 , {x2 } is not picked out by C. 2. Consider the collection of subsets C = {(a, b] : a, b ∈ R, a < b} of S = R. This C has VC index = 3. Indeed, clearly V (C) ≥ 3, and for every x1 < x2 < x3 , {x1 , x3 } is not picked out by C. The concept of VC classes of sets can be used to define suitable classes of functions. The subgraph of a function f : S → R, denoted by sg(f ), is defined by sg(f ) := {(x, t) : t < f (x)}. A class F of functions on S is said to be a VC-subgraph class if the collection of all subgraphs of functions in F (which we denote by sg(F) := {sg(f ) : f ∈ F }) is a VC class of sets in S × R. The VC-index of F, denoted by V (F), is defined by the VC-index of the collection of the subgraphs, that is, V (F) = V (sg(F)). 51

For a VC-class C of sets, mC (n) < 2n for n ≥ V (C). Indeed, asymptotically (n → ∞), is much smaller than 2n :

mC (n)

mC (n) = O(nV (C)−1 ). This is a consequence of Sauer’s lemma (van der Vaart and Wellner, 1996, Lemma 2.6.2), which we do not prove here. By Sauer’s Lemma, together with an additional work, we can bound the covering numbers for VC-subgraph classes as follows. In what follows, supQ means the supremum over all finitely discrete distributions. Theorem 16. Let F be a VC-subgraph class of functions on S with envelope F . Let r ≥ 1. Then we have V (F )

sup N (F, k · kQ,r , εkF kQ,r ) ≤ KV (F)(16e) Q

r(V (F )−1) 1 , 0 < ∀ε ≤ 1, ε

where K > 0 is a universal constant. Proof. See van der Vaart and Wellner (1996), Theorem 2.6.7. The theorem shows that for VC-subgraph classes, the covering numbers grow only polynomially in 1/ε, and hence, for example, condition (22) in the uniform central limit theorem (Theorem 14) is satisfied for VC-subgraph classes. Lemma 21. Let C be a VC class of sets, and let FC = {1C : C ∈ C} be the associated class of indicator functions. Then FC is a VC-subgraph class with VC index V (C). Proof. If {x1 , . . . , xn } is shattered by C, then (x1 , 0), . . . , (xn , 0) are shattered by sg(FC ). Conversely, suppose that (x1 , t1 ), . . . , (xn , tn ) are shattered by sg(FC ). Pick any subset {i1 , . . . , im } ⊂ {1, . . . , n}. We wish to show that {xi1 , . . . , xim } is picked out by C. By rearranging the labeling, we may take {i1 , . . . , im } = {1, . . . , m}. Since (x1 , t1 ), . . . , (xm , tm ) are picked out by sg(FC ), there is a set C1 ∈ C such that ti < 1C1 (xi ), 1 ≤ ∀i ≤ m, tm+j ≥ 1C1 (xm+j ), 1 ≤ ∀j ≤ n − m. Since (x1 , t1 ), . . . , (xn , tn ) are shattered by sg(FC ), (xm+1 , tm+1 ), . . . , (xn , tn ) must be picked out by sg(FC ), so that there is a set C2 ∈ C such that ti ≥ 1C2 (xi ), 1 ≤ ∀i ≤ m, tm+j < 1C2 (xm+j ), 1 ≤ ∀j ≤ n − m. Hence one finds that 0 ≤ tk < 1 for all 1 ≤ k ≤ n, so that xi ∈ C1 for all 1 ≤ i ≤ m and xm+j ∈ / C1 for all 1 ≤ j ≤ n − m, which means that {x1 , . . . , xm } = {x1 , · · · , xn } ∩ C1 . Therefore, {x1 , . . . , xn } is shattered by C. Combining the following lemma, we conclude that V (FC ) = V (C). Lemma 22. If sg(F) shatters n distinct points (x1 , t1 ), . . . , (xn , tn ) in S × R, then x1 , . . . , xn are distinct. 52

Proof. Suppose on the contrary some of x1 , . . . , xn are equal. Without loss of generality, we may assume x1 = x2 . Since sg(F) pikcs out (x1 , t1 ), there is an f1 ∈ F such that t1 < f1 (x1 ) and t2 ≥ f1 (x2 ) = f1 (x1 ). Hence t1 < t2 . On the other hand, since sg(F) pikcs out (x2 , t2 ), there is an f2 ∈ F such that t2 < f2 (x2 ) and t1 ≥ f2 (x1 ) = f2 (x2 ), which in turn implies t2 < t1 , a contradiction. Lemma 23. Any finite dimensional vector space F of functions is a VC-subgraph class with VC-index at most dim(F) + 2. Proof. Pick any collection of n := dim(F)+2 points (x1 , t1 ), . . . , (xn , tn ). By assumption, the vectors (f (x1 ) − t1 , . . . , f (xn ) − tn )), f ∈ F, are contained in a (n−1)-dimensional subspace of Rn . Pick any vector a = (a1 , . . . , an ) ∈ Rn that is orthogonal to this subspace. Then n X

ai (f (xi ) − ti ) = 0, ∀f ∈ F,

i=1

that is, X ai >0

ai (f (xi ) − ti ) =

X

(−ai )(f (xi ) − ti ), ∀f ∈ F.

ai <0

It is clear that there exists such a vector a with at least one positive coordinate. Then the set {(xi , ti ) : ai > 0} is not picked out by sg(F) since otherwise there P is an f ∈ F with ti < f (x ) for all a > 0 and t ≥ f (x ) for all a ≥ 0, in which case i i i i ai >0 ai (f (xi )−ti ) > Pi 0 and ai <0 (−ai )(f (xi ) − ti ) ≤ 0, a contradiction. Lemma 24. Let F be a VC-subgraph class, and let φ : R → R be a monotone function. Then the class φ ◦ F = {φ ◦ f : f ∈ F} is also a V C-subgraph class with VC-index at most V (F). Proof. Without loss of generality we may assume that φ is non-decreasing (consider −φ if φ is non-increasing). Suppose on the contrary that V (φ ◦ F) > V (F) =: n. Then there are n points (x1 , t1 ), . . . , (xn , tn ) in S × R that are shattered by {sg(φ ◦ f ) : f ∈ F}. Note that x1 , . . . , xn must be distinct. Choose functions f1 , . . . , fm in F with m = 2n such that {sg(φ ◦ fj ) : j = 1, . . . , m} shatters (x1 , t1 ), . . . , (xn , tn ). For each 1 ≤ i ≤ n, define si = max{fj (xi ) : φ(fj (xi )) ≤ ti }. Then si < fj (xi ) ⇔ ti < φ ◦ fj (xi ). (the direction “⇒” is trivial; for the converse direction, if ti < φ ◦ fj (xi ), for every 1 ≤ k ≤ m with φ(fk (xi )) ≤ ti , φ(fk (xi )) < φ(fj (xi )), which implies that fk (xi ) < fj (xi ) since φ is non-decreasing, so that si < fj (xi ).) This implies that (x1 , s1 ), . . . , (xn , sn ) are shattered by {sg(fj ) : j = 1, . . . , m}, a contradiction.

53

4.2

VC type classes

An important consequence of Theorem 16 is that there exist universal constants A ≥ 1 and c ≥ 1 such that for every VC-subgraph class F and every envelope F for F, we have sup N (F, k · kQ,2 , εkF kQ,2 ) ≤ (A/ε)cV (F ) , 0 < ∀ε ≤ 1.

(23)

Q

Indeed, this property is precisely what we need in many applications. Given these observations, we define VC-type classes of functions as follows. Definition 6. Let F be a class of functions S → R with envelope F . We say that F is VC-type with constants A ≥ 1, V ≥ 1 and envelope F , if sup N (F, k · kQ,2 , εkF kQ,2 ) ≤ (A/ε)V , 0 < ∀ε ≤ 1.

(24)

Q

Clearly, VC-subgraph classes are VC-type. A useful feature of the VC type property is that it is closed under Lipschitz transformations. Proposition 5. Let F1 , . . . , Fk be classes of functions S → R with envelopes F1 , . . . , Fk , respectively, and let φ : Rk → R be a map that is Lipschitz in the sense that for every f = (f1 , . . . , fk ), g = (g1 , . . . , gk ) ∈ F1 × · · · × Fk =: F, 2

|φ ◦ f (x) − φ ◦ g(x)| ≤

k X

L2j (x)|fj (x) − gj (x)|2 , x ∈ S,

j=1

where L1 , . . . , Lk are some P functions on S. Consider the class of functions φ(F) := {φ ◦ f : f ∈ F}. Denote ( kj=1 L2j Fj2 )1/2 by L · F . Then we have sup N (φ(F), k · kQ,2 , εkL · F kQ,2 ) ≤ Q

k Y

sup N (Fj , k · kRj ,2 , εkFj kRj ,2 )

j=1 Rj

for every 0 < ε ≤ 1, where the suprema are taken over all finitely discrete distributions. Proof. Let Q be any finitely discrete distribution, and define the measure dRj = L2j dQ, which may not be a probability measure. Then k Z k X X 2 kφ ◦ f − φ ◦ gkQ,2 ≤ |fj − gj |2 dRj = Rj (|fj − gj |2 ). j=1

j=1

For each j, construct a minimal εkFj kRj ,2 -net Gj for Fj with respect to k · kRj ,2 . Then the set of points φ(g1 , . . . , gk ) with (g1 , . . . , gk ) ranging over all possible combinations of gj from Gj forms an εkL · F kQ,2 -net for φ(F) with respect to k · kQ,2 , so that N (φ(F), k · kQ,2 , εkL · F kQ,2 ) ≤

k Y

N (Fj , k · kRj ,2 , εkFj kRj ,2 ).

j=1

Replacing Rj by Rj /Rj (1) (the latter is a probability measure), we obtain the desired result. 54

The following corollary is a direct consequence of Proposition 5. Corollary 7. (i) Let F and G be classes of functions S → R with envelopes F and G, respectively. Then √ sup N (F + G, k · kQ,2 , 2εkF + GkQ,2 ) Q

≤ sup N (F, k · kQ,2 , εkF kQ,2 ) sup N (G, k · kQ,2 , εkGkQ,2 ), Q

Q

sup N (FG, k · kQ,2 , 2εkF GkQ,2 ) Q

≤ sup N (F, k · kQ,2 , εkF kQ,2 ) sup N (G, k · kQ,2 , εkGkQ,2 ) Q

Q

for every 0 < ε ≤ 1, where FG = {f g : f ∈ F, g ∈ G}. (ii) Let F be a class of functions S → R with envelope F . For every q ≥ 1, let |F|q = {|f |q : f ∈ F}. Then sup N (|F|q , k · kQ,2 , qεkF q kQ,2 ) ≤ sup N (F, k · kQ,2 , εkF kQ,2 ) Q

Q

for every 0 < ε ≤ 1. Example 6. Let φ and ϕ be monotone functions R → R, and let F and G be VCsubgraph classes. Suppose that there exists a non-negative function H : S → R+ with supf ∈F |φ ◦ f (x)| ≤ H(x) and supg∈G |ϕ ◦ g(x)| ≤ H(x) for all x ∈ S. Then φ ◦ F + ϕ ◦ G is a VC-type class with constants A, c(V (F) + V (G)) and envelope 2H, where A and c are universal constants. The following proposition, essentially taken from Lemma 1 in Gin´e and Nickl (2009), is useful in analyzing kernel estimators. Recall that, for p ∈ [1, ∞), a function f : R → R is said to have bounded p-variation if (N ) X vp (f ) := sup |f (xi ) − f (xi−1 )|p : −∞ < x0 < x1 < · · · < xN < ∞ < ∞. i=1

A function of bounded 1-variation is said to be of bounded variation, and a function of 2-variation is said to be of bounded quadratic variation. If a function f : R → R is of bounded p-variation, then f is bounded, since |f (x)| ≤ 2p−1 |f (0)| + 2p−1 |f (x) − f (0)| ≤ 2p−1 |f (0)| + vp (f ). Proposition 6. Let f : R → R be a function of bounded p-variation for some p ∈ [1, 2], and consider the function class F = {x 7→ f (ax + b) : a, b ∈ R}. Then there exist constants A, V > 0 that depend only on p such that sup N (F, k · kQ,2 , εvp1/p (f )) ≤ (A/ε)V , 0 < ∀ε ≤ 1, Q

where supQ is taken over all finitely discrete distributions on R. 55

In what follows, we will prove Proposition 6. Define vp,f (x) := vp (f 1(−∞,x] ), x ∈ R. Note that vp,f (x) is non-decreasing in x. Lemma 25. Let f : R → R be of bounded p-variation for some p ∈ [1, ∞). Then there exist a non-decreasing function h : R → R with 0 ≤ h ≤ vp (f ), and a 1/p-H¨ older function g : [0, vp (f )] → R with H¨ older constant at most 1 (i.e., |g(x) − g(y)| ≤ |x − y|1/p for all x, y ∈ [0, vp (f )]) such that f = g ◦ h. The proof of this lemma relies on the following Kirszbraun-McShane extension theorem. Theorem 17 (Kirszbraun-McShane extension theorem). Let (T, d) be a metric space, and let S ⊂ T be non-empty. Let f : S → R be a bounded function, bounded by M > 0, and with the property that |f (s) − f (t)| ≤ ϕ(d(s, t)) for all x, y ∈ S, where ϕ : [0, ∞) → [0, ∞) is a non-decreasing function such that ϕ(0) = 0 and ϕ(x + y) ≤ ϕ(x) + ϕ(y) for all x, y ≥ 0. Then there exists a real-valued function F defined on T such that F |S = f, |F | ≤ M , and |F (s) − F (t)| ≤ ϕ(d(s, t)) for all s, t ∈ T . Proof. Let g(t) = inf s∈S [f (s) + ϕ(d(s, t))] ≥ −M and F (t) = min{g(t), M } for t ∈ T . By definition, |F | ≤ M . Observe that if t ∈ S, then f (t) ≤ f (s) + ϕ(d(s, t)) for all s ∈ S, and the equality takes place when s = t. Hence g|S = f , and since |f | ≤ M , we have that F |S = f . It remains to prove that |F (s) − F (t)| ≤ ϕ(d(s, t)) for all s, t ∈ T . Pick any s, t ∈ T . Since the function x 7→ min{x, M } is 1-Lipschitz, we have that |F (s) − F (t)| ≤ |g(s) − g(t)| ≤ sup |ϕ(d(u, s)) − ϕ(d(u, t))|. u∈S

Since ϕ(d(u, s)) ≤ ϕ(d(u, t)+d(t, s)) ≤ ϕ(d(u, t))+ϕ(d(s, t)) and ϕ(d(u, t)) ≤ ϕ(d(u, s))+ ϕ(d(s, t)), we have that |ϕ(d(u, s)) − ϕ(d(u, t))| ≤ ϕ(d(s, t)). This shows that |F (s) − F (t)| ≤ ϕ(d(s, t)). Proof of Lemma 25. We first verify that for any x < y, |f (y) − f (x)|p ≤ vp,f (y) − vp,f (x).

(25)

To see this, for a given ε > 0, pick −∞ < x0 < · · · < xN = x such that vp,f (x) ≤ PN PN +1 p p i=1 |f (xi ) − f (xi−1 )| + ε. Let xN +1 = y, and observe that i=1 |f (xi ) − f (xi−1 )| ≤ vp,f (y). This implies that vp,f (y) − vp,f (x) ≥ |f (y) − f (x)|p − ε. Letting ε ↓ 0, we have proved (25). Now, let h = vp,f with Rh = {h(x) : x ∈ R} ⊂ [0, vp (f )]. From (25), f is constant on −1 h ({u}) for every u ∈ Rh . So, for every u ∈ Rh , let g(u) be the value of f on h−1 ({u}). By construction, f = g ◦ h. Furthermore, it is seen that g is 1/p-H¨older continuous with 56

H¨older constant at most 1 on Rh . In fact, for any u, v ∈ Rh , pick any x ∈ h−1 ({u}) and y ∈ h−1 ({v}). Then |g(u) − g(v)| = |f (x) − f (y)| ≤ |h(x) − h(y)|1/p = |u − v|1/p . Finally, from the Kirszbraun-McShane extension theorem, g extends to a 1/p-H¨older continuous function with H¨ older constant at most 1 defined on [0, vp (f )]. Proof of Proposition 6. Let f = g ◦ h be the decomposition as in Lemma 25. Since the function class {x 7→ ax + b : a, b ∈ R} is a vector space of dimension 2, it is a VC subgraph class with VC index at most 4. Since h is monotone, the function class M := {x 7→ h(ax + b) : a, b ∈ R} is also a VC subgraph class with VC index at most 4. Recall that |h| ≤ vp (f ). Now, for any finitely discrete probability measure Q on R and for any ε > 0, let {m1 , . . . , mN } be a minimal εvp (f )-net of M under k · kQ,2/p with N = N (M, k · kQ,2/p , εvp (f )). Let fj = g ◦ mj , j = 1, . . . , N . By definition, for every m ∈ M, there exists mj such that km − mj kQ,2/p ≤ εvp (f ). Hence, for f = g ◦ m, we have that Z Z 2 2 kf − fj kQ,2 = (f − fj ) dQ ≤ |m − mj |2/p dQ ≤ (εvp (f ))2/p . R

R

This implies that N (F, k · kQ,2 , (εvp (f ))1/p ) ≤ N (M, k · kQ,2/p , εvp (f )). Since M is a VC subgraph class with VC index at most 4, and M is uniformly bounded by vp (f ), there exist constants A1 , V1 > 0 that depend only on p such that N (M, k · kQ,2/p , εvp (f )) ≤ (A1 /ε)V1 , 0 < ∀ε ≤ 1, which yields that 1/p

N (F, k · kQ,2 , εvp1/p (f )) ≤ (A1 /εp )V1 ≤ (A1 /ε)pV1 , 0 < ∀ε ≤ 1. This completes the proof. We have seen that suitable transformations of VC subgraph classes are VC type. In some cases, one can directly compute the covering numbers. The following is a simple example. Lemma 26. Let Θ ⊂ Rd be a non-empty bounded subset with diameter D, and let F = {fθ : θ ∈ Θ} be a class of functions on S indexed by Θ such that for some nonnegative function M : S → R+ , |fθ1 (x) − fθ2 (x)| ≤ M (x)|θ1 − θ2 |, x ∈ S, θ1 , θ2 ∈ Θ. Then we have

4D d , ε > 0. sup N (F, k · kQ,2 , εkM kQ,2 ) ≤ 1 + ε Q

57

(26)

Proof. Let Q be any finitely discrete distribution. By (26), we can see that N (F, k · kQ,2 , εkM kQ,2 ) ≤ N (Θ, | · |, ε). Since Θ has diameter D, by Lemma 4, we have N (Θ, | · |, 2ε) ≤ N (BD , | · |, ε), where BD is a ball in Rd with radius D. We shall prove the following lemma, from which the desired conclusion follows. Lemma 27. Let Br be a ball in Rd with radius r < ∞. Then for every ε > 0, 2r d . N (Br , | · |, ε) ≤ 1 + ε Proof of Lemma 27. By homogeneity, we may assume r = 1. Let {x1 , . . . , xN } be a maximal subset of B1 such that |xi − xj | > ε for all i 6= j. By maximality, {x1 , . . . , xN } is an ε-net of B1 . Then the open balls with center xi and radius ε/2 are disjoint and contained in B1+ε/2 . Hence comparing the volumes, we conclude that N (ε/2)d ≤ (1 + ε/2)d , that is

2 d N ≤ 1+ . ε

This completes the proof.

58

5

Gaussian processes

The study of Gaussian processes is one of central issues in probability theory. In the context of empirical process theory, Gaussian processes arise as limit processes. Not surprisingly, Gaussian processes possess many fine properties. We will review a few of them. For a Gaussian process X(t), t ∈ T indexed by a non-empty set T , we always equip T with the intrinsic semi-metric ρ2 defined by ρ2 (s, t) = (E[(X(t) − X(s))2 ])1/2 . In addition, when we say X is separable, we mean X is separable with respect to this intrinsic semi-metric.

5.1

Gaussian concentration inequality

Since a Gaussian process is a collection of jointly normal random variables, and a separable Gaussian process can be approximated (in a suitable sense) by a set of finite jointly normal random variables, the properties of separable Gaussian processes are typically deduced from studying Gaussian measures on finite dimensional Euclidean spaces. We begin with the simplest case: the standard normal distribution on R. One important feature of the standard normal distribution is its concentration property. More formally, we have the following lemma. Lemma 28. Let Z be a standard normal random variable in R. Then for every r > 0, P(|Z| ≥ r) ≤ e−r

2 /2

.

Proof. Let φ denote the density function of N (0, 1). Then ∞

∞

r z 2φ(r) 1 2 −r2 /2 P(|Z| ≥ r) = 2 φ(z)dz ≤ 2 φ(z)dz = = e . r r r π r r p p 2 The far right hand side is at most e−r /2 for r > 2/π. For 0 < r ≤ 2/π, we make a direct calculation. Let Z ∞ −r2 /2 h(r) = e −2 φ(z)dz. Z

Z

r

Observe that h(0) = 0 and 0

−r2 /2

h (r) = −re which is non-negative for 0 < r ≤

r +

2 −r2 /2 e , π

p p 2/π, so that h(r) ≥ h(0) = 0 for 0 < r ≤ 2/π.

The message of this lemma is that the probability that |Z| is larger than r decreases quite fast as r → ∞; in other words, the standard normal distribution concentrates around its mean 0. Lemma 28 is a consequence of an elementary calculation and perhaps not surprising. What is remarkable is the following result. 59

Theorem 18 (Borell (1975); Sudakov and Tsirel’son (1978)). Let Z be a stantdard normal random vector in Rm , and let F : Rm → R be a Lipschitz continuous function with Lipschitz constant at most 1. Then for every r > 0, P{F (Z) ≥ E[F (Z)] + r} ≤ e−r

2 /2

.

(27)

First proof of Theorem 18. The theorem has several proofs. We present two approaches. The first proof, due to Pisier (1989, Theorem 4.7), is totally elementary but with a worse constant. The second proof, which we will present in the next subsection, is based on the log-Sobolev inequality for Gaussian measures and is more involved than the first proof, but with the sharp constant and can be generalized to other distributions. Without loss of generality, we may assume that F is continuously differentiable (otherwise, approximate F by a sequence of smooth Lipschitz functions). Let Z˜ be an independent copy of Z. Let Z(θ) = Z sin θ + Z˜ cos θ, Z 0 (θ) = Z cos θ − Z˜ sin θ. Then we have ˜ = F (Z(π/2)) − F (Z(0)) = F (Z) − F (Z)

Z

π/2

∇F (Z(θ))> Z 0 (θ)dθ.

0

Let Φλ (x) = exp(λx). Since Φλ is convex, we have Z π 2 π/2 ˜ Φλ (F (Z) − F (Z)) ≤ Φλ ∇F (Z(θ))> Z 0 (θ) dθ π 0 2 by Jensen’s inequality, and thus Z

π/2

h π i E Φλ ∇F (Z(θ))> Z 0 (θ) dθ 2 h 0 π i = E Φλ ∇F (Z)> Z˜ . 2

˜ ≤ 2 E[Φλ (F (Z) − F (Z))] π

d The last equality is because (Z(θ)> , Z 0 (θ)> )> = (Z > , Z˜ > )> . Now we have h π i 1 π 2 2 2 >˜ 2 2 E Φλ ∇F (Z) Z | Z = exp |∇F (Z)| λ ≤ eπ λ /8 2 2 2

˜ ≤ eπ2 λ2 /8 . Since Φλ is convex, the left hand side is bounded and so E[Φλ (F (Z) − F (Z))] ˜ from below by E[Φλ (F (Z) − E[F (Z)])] by Jensen’s inequality. Therefore, by Markov’s inequality, we have for every λ > 0, P {F (Z) − E[F (Z)] ≥ r} ≤ eπ

2 λ2 /8−λr

Choosing λ = 4r/π 2 , we find that P {F (Z) − E[F (Z)] ≥ r} ≤ e−2r This completes the first proof. 60

2 /π 2

.

.

Applying inequality (27) to −F , we also have P{|F (Z) − E[F (Z)]| ≥ r} ≤ 2e−r

2 /2

,

which shows that F (Z) sharply concentrates around its mean, like a standard normal random variable despite the possibility that F may be a complicated function. Theorem 18 is referred to as the Gaussian concentration inequality. We refer the reader to the monograph by Ledoux (2001) for a comprehensive account on concentration of measure phenomenon. The following is a simple application of Theorem 18 that is relevant in statistics. Proposition 7. Let ε1 , . . . , εn be independent normal random variables with mean 0 and variance σ 2 > 0, and let ε = (ε1 , . . . , εn )> . Let A be an arbitrary n × n matrix. Then for every η > 0, we have n o P |Aε|2 ≥ (1 + η)σ 2 Tr(A> A) + 2(1 + η −1 )σ 2 kAk2op r ≤ e−r , ∀r > 0. Proof. By homogeneity, we may assume σ 2 = 1. We shall apply Theorem 18 to F (z) = |Az| with z = (z1 , . . . , zn )> . The F is Lipschitz continuous with Lipschitz constant at most kAkop . Then for every r > 0, 2

P {|Aε| ≥ E[|Aε|] + kAkop r} ≤ e−r /2 . p p We have E[|Aε|] ≤ E[|Aε|2 ] = Tr(A> A). Furthermore, for every η > 0, (a + b)2 ≤ (1 + η)a2 + (1 + η −1 )b2 , and hence o n 2 P |Aε|2 ≥ (1 + η)σ 2 Tr(A> A) + (1 + η −1 )kAk2op r2 ≤ e−r /2 . The final conclusion follows from a simple change of variables. We shall study a few implications of the Gaussian concentration inequality to Gaussian processes. Recall that kXkT = supt∈T |X(t)|. Theorem 19. Let X(t), t ∈ T be a separable, centered Gaussian process indexed by a non-empty set T , such that kXkT < ∞ almost surely. Then automatically σ 2 := supt∈T E[X 2 (t)] < ∞, E[kXkT ] < ∞, and for every r > 0, P {kXkT ≥ E[kXkT ] + r} ≤ e−r

2 /(2σ 2 )

.

Furthermore, for every r > 0, P {|kXkT − E[kXkT ]| ≥ r} ≤ 2e−r

2 /(2σ 2 )

.

Proof. We begin with verifying that σ 2 < ∞. Let σt2 = E[X 2 (t)]. Since kXkT is finite almost surely, there exists a finite M with P(kXkT ≤ M ) ≥ 1/2 (say). Then for every t ∈ T , P(|X(t)| ≤ M ) ≥ 1/2. But since for every t with σt2 > 0, X(t)/σt ∼ N (0, 1), 61

there exists a universal constant c > 0 such that M/σt ≥ c whenever σt2 > 0. Hence we conclude that σ 2 = supt∈T σt2 ≤ (M/c)2 < ∞. Fix any t1 , . . . , tm ∈ T and consider the centered normal random vector (X(t1 ), . . . , X(tm ))> . Denote by Γ the covariance matrix of this random vector. Define the function F : Rm → R by F (z) = max1≤i≤m |(Γ1/2 z)i | for z ∈ Rm . Take Z ∼ N (0, Im ). Then d

d

(X(t1 ), . . . , X(tn ))> = Γ1/2 Z and thus max1≤i≤m |X(ti )| = F (Z). The Lipschitz constant of F is bounded by 1/2 m X max  (Γ1/2 )2ij  = max (E[X 2 (ti )])1/2 ≤ σ. 

1≤i≤m

1≤i≤m

j=1

Hence by Theorem 18, we have 2 2 P max |X(ti )| ≥ E[ max |X(ti )|] + r ≤ e−r /(2σ ) . 1≤i≤m

1≤i≤m

(28)

Applying the same theorem to −F , we also have 2 2 P max |X(ti )| ≤ E[ max |X(ti )|] − r ≤ e−r /(2σ ) . 1≤i≤m

1≤i≤m

2

2

Take r0 large enough so that e−r0 /(2σ ) < 1/2. Then E[max1≤i≤m |X(ti )|] − r0 must be bounded from above by M , that is, E[max1≤i≤m |X(ti )|] ≤ M + r0 . Note that M and r0 are chosen independently from t1 , . . . , tm . By separability, there exists an increasing sequence of finite subsets Tn of T such that max |X(t)| ↑ sup |X(t)| = kXkT t∈Tn

a.s.

t∈T

By the monotone convergence theorem, we conclude that E[kXkT ] = lim E[max |X(t)|] ≤ M + r0 < ∞. n→∞

t∈Tn

In addition, we have 2 2 P max |X(t)| > E[kXkT ] + r ≤ P max |X(t)| > E[max |X(t)|] + r ≤ e−r /(2σ ) t∈Tn

t∈Tn

t∈Tn

by inequality (28). Letting n → ∞, we conclude that P {kXkT > E[kXkT ] + r} ≤ e−r This completes the proof. 62

2 /(2σ 2 )

.

An inspection of the proof shows that the assumption that kXkT < ∞ almost surely can be relaxed to P(kXkT < ∞) > 0. Hence we have proved the 0-1 law: P(kXkT < ∞) = 0 or 1 for every separable Gaussian process X(t), t ∈ T . Theorem 19 shows that kXkT has finite moments of all orders. More precisely, we have the following corollary on the integrability of the supremum of a Gaussian process. Corollary 8 (Landau and Shepp (1970); Marcus and Shepp (1971)). Let X(t), t ∈ T be a separable, centered Gaussian process indexed by a non-empty set T , such that kXkT < ∞ almost surely. Let σ 2 := supt∈T E[X 2 (t)] < ∞. Then we have E[exp(αkXk2T )] < ∞ ⇔ 2ασ 2 < 1. Proof. The “⇐” part follows from Theorem 19. We wish to show the converse direction. If E[exp(αkXk2T )] < ∞, then E[exp(αX 2 (t))] < ∞ for every t ∈ T , and a direct calculation shows that 2αE[X 2 (t)] < 1. Taking the supremum, we have 2ασ ≤ 1. We have to show that 2ασ 2 6= 1. Suppose on the contrary that 2ασ 2 = 1. Choose a sequence tn in such a way that σn2 := E[X 2 (tn )] ↑ σ 2 . Then we have E[exp(αX 2 (tn ))] = 2/(1 − 2ασn2 ) ↑ ∞, a contradiction. In addition, all the Lp norms of kXkT are equivalent. Corollary 9. Let X(t), t ∈ T be a separable, centered Gaussian process indexed by a non-empty set T , such that kXkT < ∞ almost surely. Then for every 1 < p < ∞, we have E[kXkT ] ≤ (E[kXkpT ])1/p ≤ Cp E[kXkT ], where Cp > 0 is a constant that depends only on p. Proof. The absolute moment of the standard normal distribution is (2/π)1/2 . Hence E[|X(t)|] = (E[X 2 (t)])1/2 (2/π)1/2 , and so (E[X 2 (t)])1/2 = (π/2)1/2 E[|X(t)|] ≤ (π/2)1/2 E[kXkT ]. This shows that σ ≤ (π/2)1/2 E[kXkT ]. By Theorem 19, Z ∞ p E[|kXkT − E[kXkT ]| ] = prp−1 P{|kXkT − E[kXkT ]| > r}dr 0 Z ∞ 2 2 ≤ 2p rp−1 e−r /(2σ ) dr 0 Z ∞ 2 p = 2pσ rp−1 e−r /2 dr. 0

Taking these together, we obtain the desired conclusion.

63

5.2

Second proof of Borell-Sudakov-Tsirel’son theorem

The second proof of Theorem 18 uses Gross’s (1975) log-Sobolev inequality for Gaussian measures. Let (U, U, µ) be a probability space. For a non-negative measurable function f : U → R+ , define the entropy functional Z Z Z Entµ (f ) := (f log f )dµ − f dµ · log f dµ . Note that the entropy functional is well-defined (with the convention that 0 log 0 = limx↓0 x log x = 0) whenever f ∈ L1 (µ) since x log x ≥ −1/e for x ≥ 0. In addition, the entropy functional is non-negative by Jensen’s inequality, as the map x 7→ x log x is convex on R+ . The log-Sobolev inequality for Gaussian measures is stated as follows. Let γ = γ1 = N (0, 1), and let γm = γ ⊗ · · · ⊗ γ = N (0, Im ). Let W21 (γm ) denote the set of all continuously differentiable functions f : Rm → R such that Z (|f |2 + |∇f |2 )dγm < ∞. Theorem 20 (Gross (1975)). For every f ∈ W21 (γm ), Z 2 Entγm (f ) ≤ 2 |∇f |2 dγm .

(29)

We will prove this theorem in the next subsection. Second proof of Theorem 18. We derive from the log-Sobolev inequality a suitable bound on the Laplace transform of F (Z) − E[F (Z)]. This derivation is called the Herbst argument (Ledoux, 2001, Chapter 5). As before, we may assume that F is smooth. In addition, we may assume that E[F (Z)] = 0 (replace F by F − E[F (Z)]). Let G(λ) = E[eλF (Z)−λ Define the function f : Rm → R by f 2 = eλF −λ (29) to f . On the one hand, we have Entγm (f 2 ) = E[eλF (Z)−λ

2 /2

2 /2

2 /2

].

, and apply the log-Sobolev inequality

(λF (Z) − λ2 /2)] − G(λ) log G(λ).

On the other hand, differentiating f 2 , we have ∇f = (λ/2)f ∇F and |∇f |2 ≤ (λ2 /4)eλF −λ

2 /2

2 /2

|∇F |2 ≤ (λ2 /4)eλF −λ

.

Therefore, we have E[eλF (Z)−λ

2 /2

(λF (Z) − λ2 /2)] − G(λ) log G(λ) ≤ 64

λ2 G(λ), 2

by which we conclude that E[eλF (Z)−λ

2 /2

(λF (Z) − λ2 )] ≤ G(λ) log G(λ),

that is, λG0 (λ) ≤ G(λ) log G(λ). Equivalently, letting H(λ) := λ−1 log G(λ) for λ > 0, we have H 0 (λ) =

1 λ2 G(λ)

{λG0 (λ) − G(λ) log G(λ)} ≤ 0, ∀λ > 0.

This implies that H(λ) ≤ lim H(λ) = λ↓0

G0 (0) = E[F (Z)] = 0. G(0)

In particular, G(λ) ≤ 1, that is, E[eλF (Z) ] ≤ eλ

2 /2

.

The rest of the proof is the same as in the first proof.

5.3

Proof of Gross’s log-Sobolev inequality

Gross’s log-Sobolev inequality has also several proofs. The original proof by Gross (1975) is lengthly and complicated. A shorter proof can be found in Ledoux (2001, Theorem 5.1), which relies on stochastic analysis and is still not elementary. We shall follow here an approach described in Massart (2007, p.62-p.64), which is based on the tensorization inequality of the entropy functional and a clever use of the central limit theorem. We first study the properties of the entropy functional. Lemma 29 (Duality and variational formulas for entropy functional). Let (U, U, µ) be a probability space. Let f : U → R+ be a non-negative measurable function such that f log f ∈ L1 (µ). Then we have (i) Z Z g Entµ (f ) = sup f gdµ : e dµ ≤ 1 . (ii) Furthermore, Z Entµ (f ) = inf

u>0

[f (log f − log u) − (f − u)]dµ.

Proof. We use the notation Eµ [·] for the expectation under µ. If Eµ [f ] = 0, then f = 0 µa.e. and there is nothing to prove. Hence we assume Eµ [f ] > 0. (i). Note that the entropy functional is homogeneous in the sense that Entµ (cf ) = c Entµ (f ) for c > 0. Hence we may assume that Eµ [f ] = 1, in which case Entµ (f ) = Eµ [f log f ]. By Young’s inequality: uv ≤ u log u − u + ev , 65

we have, whenever Eµ [eg ] ≤ 1, Eµ [f g] ≤ Entµ (f ) − 1 + Eµ [eg ] ≤ Entµ (f ). On the other hand, choosing g = log f leads to Eµ [f g] = Entµ (f ). This completes the proof for the first assertion. (ii). To prove the second assertion, define Ψ(u) = Eµ [f (log f − log u) − (f − u)]. Note that Ψ(u) = Entµ (f ) + Eµ [f ]

u − 1 − log Eµ [f ]

u Eµ [f ]

.

Here the map x 7→ x − 1 − log x is non-negative and equal to 0 when x = 1. Hence Ψ(u) ≥ Entµ (f ) and the equality takes place when u = Eµ [f ]. This completes the proof for the second assertion. For a function f defined on the product space U1 ×· · ·×Um , let fi denote the function on Ui defined by “freezing” all variables except for the i-th coordinate, that is, fi (·) = f (x1 , . . . , xi−1 , ·, xi+1 , . . . , xm ). Lemma 30 (Tensorization inequality). Let (Ui , Ui , µi ), 1 ≤ i ≤ m be probability spaces, and let µ = µ1 ⊗ · · · ⊗ µm be the product probability measure on the product space U = U1 × · · · × Um equipped with the product σ-field. Then for every non-negative measurable function f : U → R+ such that f log f ∈ L1 (µ), we have Entµ (f ) ≤

m Z X

Entµi (fi )dµ.

i=1

Proof. The proof is by induction. The m = 1 case R is trivial. Suppose that the lemma holds for m − 1 and we will prove it for m.R If f dµ = 0, then f = 0 µ-.a.e. and there is nothing to prove. Hence we assume f dµ > 0. Let Φ(x) = x log x and µ−m = µ1 ⊗ · · · ⊗ µm−1 . Then by Fubini’s theorem, Z Z Z Φ(f )dµ = Φ(f (·, xm ))dµ−m dµm (xm ) ) m−1 Z ( Z XZ ≤ Φ f (·, xm )dµ−m + Entµi (fi (·, xm ))dµ−m dµm (xm ). (30) i=1

Define

Z g(xm ) = log

Z f (·, xm )dµ−m − log

66

f dµ.

R Then eg dµm = 1, so that by the duality formula for the entropy functional (Lemma 29 (i)), we have Z Z Z Entµm (fm )dµ ≥ fm gdµm dµ−m Z Z Z = Φ f (·, xm )dµ−m dµm (xm ) − Φ f dµ . (31) Combining (30) and (31) leads to the desired result. Proof of Theorem 18. By the tensorization inequality (Lemma 30), it is enough to prove inequality (29) for m = 1, that is, we shall prove that for every f ∈ W21 (γ), Z Entγ (f ) ≤ 2 |f 0 |2 dγ. Recall that γ = N (0, 1). We first prove the following lemma. Lemma 31. Let ν denote the distribution of a Rademacher random variable, that is, ν({−1}) = ν({1}) = 1/2. Then for every function f on {−1, 1}, we have 1 Entν (f 2 ) ≤ (f (1) − f (−1))2 . 2 Proof of Lemma 31. Let ε be a Rademacher random variable. Since the inequality is trivial when f is constant (in which case Entν (f ) = 0), and ||f (1)| − |f (−1)|| ≤ |f (1) − f (−1)|, without loss of generality, we may assume that f is non-negative and either f (−1) > f (1) or f (1) > f (−1). Assume here that f (−1) > f (1). Then, by homogeneity of the entropy functional, it is enough to prove 2

2

2 Entν (f /f (−1)) ≤

2 f (1) −1 . f (−1)

The left hand side is calculated as 2E[(f 2 (ε)/f 2 (−1)) log(f 2 (ε)/f 2 (−1))] − 2E[(f 2 (ε)/f 2 (−1))] log(E[(f 2 (ε)/f 2 (−1))]) 2 2 2 f 2 (1) f (1) f (1) 1 f (1) = 2 log − + 1 log · +1 . f (−1) f 2 (−1) f 2 (−1) 2 f 2 (−1) Putting x = f 2 (1)/f 2 (−1) and h(x) = x log x, we shall prove √ h(x) − (x + 1)h((x + 1)/2) ≤ ( x − 1)2 , 0 < x ≤ 1,

67

from which the desired claim follows. Let √ ψ(x) = ( x − 1)2 − h(x) + (x + 1)h((x + 1)/2). Then we have ψ(1) = 0, and ψ 0 (x) = −x−1/2 − log x + h((x + 1)/2) + 2−1 (x + 1)(1 + log((x + 1)/2)), ψ 00 (x) = x−1 (x−1/2 /2 − 1) + 1/2. For 0 < x ≤ 1, ψ 00 (x) ≥ 0, and hence ψ 0 (x) ≤ ψ 0 (1) = 0. This implies that ψ(x) ≥ ψ(1) = 0 for 0 < x ≤ 1, completing the proof. Going back to the proof of Theorem 18, we may indeed assume that f is twice continuously differentiable and of compact support (otherwise use approximations). P Let ε1 , . . . , εn be independent Rademacher random variables, and let Sn = n−1/2 ni=1 εi . w The central limit theorem yields that Sn → Z ∼ N (0, 1). Hence we have E[f 2 (Sn ) log f 2 (Sn )] − E[f 2 (Sn )] log(E[f 2 (Sn )]) → E[f 2 (Z) log f 2 (Z)] − E[f 2 (Z)] log(E[f 2 (Z)]) = Entγ (f 2 ). On the other hand, by Lemma 31 together with the tensorization inequality (Lemma 30), the far left hand side is bounded by " 2 # n 1X 1 − εi 1 + εi , E f Sn + √ − f Sn − √ 2 n n i=1

which is, by Taylor’s theorem, expanded as 2E[|f 0 (Sn )|2 ] + O(n−1/2 ). Taking these together, we conclude that Z 2 0 2 0 2 Entγ (f ) ≤ 2 lim E[|f (Sn )| ] = 2E[|f (Z)| ] = 2 |f 0 |2 dγ. n→∞

This completes the proof.

5.4

Size of expected suprema

So far in this section, we have not discussed the size of the expectation E[kXkT ]. Example 3 in Section 2 gives an upper bound, namely Dudley’s entropy integral bound, on the expectation. There is also a reverse inequality due to Sudakov, which we wish to prove here. Theorem 21 (Dudley (1967); Sudakov (1971)). Let X(t), t ∈ T be a separable, centered Gaussian process indexed by a non-empty set T . Let σ 2 := supt∈T E[X 2 (t)]. Then Z σp p 1 + log N (T, ρ2 , ε)dε, c sup ε log N (T, ρ2 , ε) ≤ E[kXkT ] ≤ C ε>0

0

where c, C > 0 are universal constants. 68

As noted, we have already shown the upper bound in Section 2, Example 3. To prove the lower bound, we shall prove the following lemma. Lemma 32 (Slepian (1962)). Let Y = (Y1 , . . . , YN )> and Z = (Z1 , . . . , ZN )> be two centered normal random vectors in RN . If E[(Zi − Zj )2 ] ≤ E[(Yi − Yj )2 ] for all i 6= j, then E[max1≤i≤N Zi ] ≤ E[max1≤i≤N Yi ]. Proof of Lemma 32. We shall prove a less sharp inequality: E max Zi ≤ 2E max Yi . 1≤i≤N

1≤i≤N

For a proof for the precise inequality, see Dudley (1999), Theorem 2.3.7. The current proof is divided into two steps. Step 1. Suppose that E[Yi2 ] = E[Zi2 ] for all 1 ≤ i ≤ N . Under this additional assumption, we shall prove that for every x1 , . . . , xN ∈ R, P{Y1 ≤ x1 , . . . , YN ≤ xN } ≤ P{Z1 ≤ x1 , . . . , ZN ≤ xN }. Without loss of generality, we may assume that two random vectors Y = (Y1 , . . . , YN )> and Z = (Z1 , . . . , ZN )> are independent. We first assume that the covariance matrices of Y and Z are non-singular. For 0 ≤ λ ≤ 1, define the random vector Y (λ) = (Y1 (λ), . . . , YN (λ))> by Yi (λ) = (1 − λ)1/2 Yi + λ1/2 Zi . Fix any x1 , . . . , xN ∈ R. Let p(λ) = P{Y1 (λ) ≤ x1 , . . . , YN (λ) ≤ xN }. We shall prove that the map λ 7→ p(λ) is non-decreasing. Let gλ denote the density function of Y (λ), which exists since the covariance matrix of Y (λ) is non-singular. Let σλ2 (s) = E[(s> Y (λ))2 ] for s ∈ RN . Since s> Y (λ) ∼ N (0, σλ2 (s)), the characteristic function of Y (λ) is given by h √ > i E e −1s Y (λ) = exp{−σλ2 (s)/2}. Hence the Fourier inversion formula gives Z √ 1 gλ (y) = exp{− −1s> y − σλ2 (s)/2}ds. N (2π) Note that σλ2 (s) = E[((1 − λ)1/2 s> Y + λ1/2 s> Z)2 ] = σ02 (s) + 2λ

X

si sj (E[Zi Zj ] − E[Yi Yj ]),

i
where we have used the fact that E[Yi2 ] = E[Zi2 ] for all 1 ≤ i ≤ N . Setting γij = E[Zi Zj ] − E[Yi Yj ], we have Z n √ o P 1 > 2 exp − −1s y − σ (s)/2 − λ s s γ ds, gλ (y) = i j ij 0 i
and hence ∂ gλ (y) ∂λ Z X o n √ P −1 > 2 = si sj γij exp − −1s y − σ0 (s)/2 − λ k<` sk s` γk` ds (2π)N =

X i
γij

i
∂yi ∂yj

gλ (y).

Note that γij ≥ 0 since E[(Zi − Zj )2 ] ≤ E[(Yi − Yj )2 ] for all i 6= j and E[Yi2 ] = E[Zi2 ] for all 1 ≤ i ≤ N . Then Z xN Z x1 ∂ 0 ··· p (λ) = gλ (y)dy −∞ ∂λ −∞ Z xN Z x1 X ∂2 gλ (y)dy. ··· = γij −∞ ∂yi ∂yj −∞ i
The far right hand side is non-negative, since, for example for i = 1 and j = 2, Z x1 Z xN Z x3 Z xN ∂2 ··· gλ (y)dy = ··· gλ (x1 , x2 , y3 , . . . , yN )dy3 · · · dyN ≥ 0. −∞ −∞ ∂y1 ∂y2 −∞ −∞ This shows that p(λ) is non-decreasing, and hence P{Y1 ≤ x1 , . . . , YN ≤ xN } = p(0) ≤ p(1) = P{Z1 ≤ x1 , . . . , ZN ≤ xN }. Consider now the general case where the covariance matrices of Y and Z may be singular. It is enough to consider the case where E[Yi2 ] = E[Zi2 ] > 0 for all 1 ≤ i ≤ N . Let η1 , . . . , ηN , ζ1 , . . . , ζN be independent standard normal random variables, independent of Y and Z, and let Yiε = Yi + εηi , Ziε = Zi + εζi for ε > 0. Then E[(Ziε − Zjε )2 ] ≤ E[(Yiε − Yjε )2 ] for all i 6= j, E[(Yiε )2 ] = E[(Ziε )2 ] for all ε )> 1 ≤ i ≤ N , and the covariance matrices of Y ε = (Y1ε , . . . , YNε )> and Z ε = (Z1ε , . . . , ZN are non-singular, so that ε P{Y1ε ≤ x1 , . . . , YNε ≤ xN } ≤ P{Z1ε ≤ x1 , . . . , ZN ≤ xN }. w

w

Now, as ε ↓ 0, Y ε → Y and Z ε → Z, and for A = (−∞, x1 ] × · · · × (−∞, xN ], we have P(Y ∈ ∂A) = P(Z ∈ ∂A) = 0 because E[Yi2 ] and E[Zi2 ] are positive for all 1 ≤ i ≤ N . Therefore, by the Portmanteau theorem, P(Y ∈ A) = lim P(Y ε ∈ A) ≤ lim P(Z ε ∈ A) = P(Z ∈ A). ε↓0

ε↓0

Step 2. Without loss of generality, we may assume that Y1 = Z1 = 0 (since otherwise we can replace Yi by Yi − Y1 and Zi by Zi − Z1 ), in which case E[Zi2 ] ≤ E[Yi2 ] for all 70

1 ≤ i ≤ N . Let ζ be another standard normal random variable independent of Y and Z. Define C 2 = max1≤i≤N E[Yi2 ] and let Y˜i = Yi + ζ(C 2 − E[Yi2 ] + E[Zi2 ])1/2 ,

Z˜i = Zi + ζC.

It is clear that E[(Z˜i − Z˜j )2 ] = E[(Zi − Zj )2 ] ≤ E[(Yi − Yj )2 ] ≤ E[(Y˜i − Y˜j )2 ] and E[Z˜i2 ] = E[Y˜i2 ]. Hence by Step 1, we have ˜ ˜ P max Zi > x ≤ P max Yi > x , 1≤i≤N

1≤i≤N

that is, max1≤i≤N Y˜i stochastically dominates max1≤i≤N Z˜i , so that E max Z˜i ≤ E max Y˜i . 1≤i≤N

1≤i≤N

Note that max1≤i≤N Z˜i = max1≤i≤N Zi +ζC and thus E[max1≤i≤N Z˜i ] p = E[max1≤i≤N Zi ]. On the other hand, max1≤i≤N Y˜i ≤ max1≤i≤N Yi + ζ+ C and E[ζ+ ] = 1/(2π). Observe that E[|Yi |] = (2/π)1/2 (E[Yi2 ])1/2 , and √ 1/2 1/2 C ≤ (π/2) max E[|Yi |] ≤ (π/2) E max |Yi | ≤ 2πE max Yi . 1≤i≤N

1≤i≤N

The last inequality is because |Yi | = Yi ∨ (−Yi ) and E max |Yi | = E max Yi ∨ max (−Yi ) 1≤i≤N 1≤i≤N 1≤i≤N ≤ E max Yi + E max (−Yi ) 1≤i≤N 1≤i≤N d = 2E max Yi . (Y = −Y )

1≤i≤N

max ±Yi ≥ ±Y1 = 0

1≤i≤N

1≤i≤N

Therefore, we conclude that E

˜ max Yi ≤ 2E max Yi .

1≤i≤N

1≤i≤N

This completes the proof. Lemma 33. There exists a universal constant c > 0 such that for every independent standard normal random variables Z1 , . . . , ZN , p E[ max Zi ] ≥ c log N . 1≤i≤N

Proof of Lemma 33. Let Φ(·) and φ(·) denote the distribution and density functions of ¯ N (0, 1), respectively, and let Φ(x) = 1 − Φ(x). For N = 1, E[Z1 ] = 0, and for N = 2, r Z Z 1 2 E[max{Z1 , Z2 }] = 2 zφ(z)Φ(z)dz = 2 φ (z)dz = > 0. π 71

y −2

Let Y1 , . . . , YN be i.i.d. random variables taking values in [1, ∞) with density f (y) = and distribution function F (y) = 1−y −1 . Then the density function of min{Y1 , . . . , YN }

is N f (y)(1 − F (y))N −1 = N y −N −1 , so that

Z E[min{Y1 , . . . , YN }] =

∞

N y −N dy =

1

N . N −1

d

Note that Yi = 1/Φ(Zi ). Hence N = E[min{Y1 , . . . , YN }] = E[min{Φ(Zi )−1 , . . . , Φ(ZN )−1 }] N −1 = E[Φ(max{Z1 , . . . , ZN })−1 ]. Observe that d d2 Φ(x)−1 = −φ(x)Φ(x)−2 = xφ(x)Φ(x)−2 + 2φ2 (x)Φ(x)−3 , 2 dx dx and the last expression is positive for all x ∈ R. Hence Φ(x)−1 is convex and Jensen’s inequality yields that N −1 Φ(E[max{Z1 , . . . , ZN }]) ≥ , N that is, ¯ −1 (N −1 ). E[max{Z1 , . . . , ZN }] ≥ Φ−1 (1 − N −1 ) = Φ The right hand side is positive for N ≥ 3. Furthermore, ¯ −1 (x) Φ y lim √ = lim p , ¯ x→0 −2 log x y→∞ −2 log Φ(y) and ¯ y2 y Φ(y) = lim = 1. ¯ y→∞ −2 log Φ(y) y→∞ φ(y) lim

Hence there exists a constant c > 0 such that p ¯ −1 (N −1 ) ≥ c 2 log N , ∀N ≥ 3. Φ This completes the proof. We are now in position to prove Theorem 21. Proof of Theorem 21. We only have to prove the lower inequality. Pick any ε > 0. Let {t1 , . . . , tN } be a maximal subset of T such that (E[(X(ti ) − X(tj ))2 ])1/2 = ρ2 (ti , tj ) > ε for all i 6= j. By maximality, N (T, ρ2 , ε) ≤ N . Let Z1 , . . . , ZN be independent standard

72

√ normal random variables, and let Vi = εZi / 2. Then E[(Vi − Vj )2 ] = ε2 < E[(X(ti ) − X(tj ))2 ] for all i 6= j. Hence by Lemmas 32 and 33, we have E sup |X(t)| ≥ E max X(ti ) 1≤i≤N t∈T ≥ E max Vi 1≤i≤N √ = εE max Zi / 2 1≤i≤N p ≥ cε log N . This completes the proof. The proof of Theorem 21 implies that for any centered, possibly non-separable Gaussian process X(t), t ∈ T index by a non-empty set T , p c sup ε N (T, ρ2 , ε) ≤ sup E max |X(t)| , t∈F

F ⊂T,F :finite

ε>0

and so if the right hand side is finite, the left hand side is finite. In particular, we obtain the following corollary. Corollary 10 (Sudakov-Chevet). Let X(t), t ∈ T be a centered Gaussian process index by a non-empty set T , and suppose that supF E[maxt∈F |X(t)|] < ∞ where the supremum is taken over all finite subsets F of T . Then (T, ρ2 ) is totally bounded. It is natural to ask whether Dudley’s entropy integral bound is sharp. The following is an example in which Dudley’s entropy integral bound is not sharp. Example 7. Let X1 , X2 , . . . be a sequence of independent, centered normal random variables such that σn2 = E[Xn2 ] = (1 + log n)−1 . Then E[supn≥1 |Xn |] < ∞ but the entropy integral becomes infinite. Proof. For n ≥ 1 and r ≥ 2, P(|Xn | ≥ r) ≤ e−r

2 /(2σ 2 ) n

1 2 (1+log n)

≤ e− 2 r = n−r

2 /2

e−r

2 /2

,

by which we have

P sup |Xn | > r

≤e

n≥1

−r2 /2

∞ X

n=1 −r2 /2

≤ Ce 73

.

n−r

2 /2

This implies that E[supn≥1 |Xn |] < ∞. On the other hand, pick any ε > 0. For n < nε = exp(−1 + 1/(2ε2 )), we have 2 σn > 2ε2 . Since Xn are independent, p 2 > 2ε, ∀n, m < n , ρ2 (n, m) = σn2 + σm ε which means that n and m can not belong to the same ε-ball as long as n, m < nε . Hence N (N, ρ2 , ε) ≥ nε − 1, so that p inf ε log N (N, ρ2 , ε) > 0, ε>0

and especially Z

1p

log N (N, ρ2 , ε)dε = ∞.

0

This example shows that Dudley’s entropy integral bound may not be sharp in some cases. A shaper bound can be obtained by using the “generic chaining” method. See Talagrand (2005).

5.5

Absolute continuity of Gaussian suprema

Let X(t), t ∈ T be a separable, centered Gaussian process such that 0 < kXkT < ∞ a.s.

(32)

We are interested in the distribution of kXkT . Let F be the distribution function of kXkT , namely, F (r) = P{kXkT ≤ r}. Define r0 = inf{r ≥ 0 : F (r) > 0}, that is, r0 is the left endpoint of the support of F . The following is taken from Theorem 11.1 of Davydov et al. (1998). Theorem 22. The distribution function F is absolutely continuous on (r0 , ∞), and the derivative F 0 , which exists on (r0 , ∞) except on an at most countable subset ∆ ⊂ (r0 , ∞), is positive on (r0 , ∞) \ ∆. The function F may have a jump at r0 . Denote by q its jump size, i.e., q = F (r0 ). Note that q < 1 (because F is strictly increasing on (r0 , ∞) by Theorem 22). The following corollary, which is important in statistical applications, is a direct consequence of Theorem 22. 74

Corollary 11. The quantile function F −1 is continuous and strictly increasing on (q, 1). The q may take any value in [0, 1) even under the assumption (32). See Example 3.2 in Hoffmann-Jørgensen et al. (1979). Nonetheless, we have the following lemma. Lemma 34. Let X(t), t ∈ T be a tight Gaussian random element in `∞ (T ) with mean zero. Then r0 = 0. Hence q = 0 unless kXkT = 0 almost surely. Proof. The proof is due to Ledoux and Talagrand (1991), p.57-p.58. To ease the notation, write kXk = kXkT . Step 1. We first show that for every ε > 0 and x ∈ `∞ (T ), √ (P{kX − xk ≤ ε})2 ≤ P{kXk ≤ 2ε}. To see this, if Y is an independent copy of X, d

(P{kX − xk ≤ ε})2 = P{kX − xk ≤ ε}P{kY + xk ≤ ε} (−Y = X) ≤ P{kX − x + Y + xk ≤ 2ε} √ d √ ≤ P{kXk ≤ 2ε}. (X + Y = 2X) Step 2. Suppose on the contrary there exists an ε0 > 0 such that P{kXk ≤ ε0 } = 0. By Theorem 10, (T, ρ2 ) is totally bounded and almost all sample paths of X are uniformly ρ2 -continuous, that is, P{X ∈ Cu (T, ρ2 )} = 1. Since Cu (T, ρ2 ) is separable (see Lemma 12), choose a countable dense subset {xn } of Cu (T, ρ2 ), so that √ P{∃n, kX − xn k ≤ ε0 / 2} = 1. But then 1≤

X

X √ P{kX − xn k ≤ ε0 / 2} ≤ (P{kXk ≤ ε0 })1/2 = 0,

n

n

a contradiction. We now turn to the proof of Theorem 22. The proof relies on the log-concavity of Gaussian measures. For A, B ⊂ Rn and λ ∈ R, we write λA = {λx : x ∈ A}, A + B = {x + y : x ∈ A, y ∈ B}. Generally, a Borel probability measure µ on Rn is said to be log-concave if for all Borel subsets A, B ⊂ Rn and λ ∈ [0, 1], µ(λA + (1 − λ)B) ≥ µ(A)λ µ(B)1−λ . [There is a subtle measurability problem in this definition, namely, λA + (1 − λ)B need not be Borel measurable (Erd¨ os and Stone, 1970), but since λA + (1 − λ)B is the image of the Borel set A × B in R2n by the continuous map R2n 3 (x, y) 7→ λx + (1 − λ)y, it is an analytic set and thus universally measurable (see Dudley, 2002, Section 13.2). Hence µ(λA + (1 − λ)B) makes sense.] The following theorem shows that the canonical Gaussian measure on Rn is log-concave. 75

Theorem 23. Let γn be the canonical Gaussian measure on Rn . Then for all Borel subsets A, B ⊂ Rn and λ ∈ [0, 1], γn (λA + (1 − λ)B) ≥ γn (A)λ γn (B)1−λ . This theorem is a direct consequence of the following Pr´ekopa-Leindler inequality. Theorem 24 (Pr´ekopa-Leindler). Let f, g and h be non-negative, integrable functions on Rn , and let λ ∈ [0, 1]. If for all x, y ∈ Rn , h(λx + (1 − λ)y) ≥ f λ (x)g 1−λ (y), then we have

λ Z

Z

Z h(x)dx ≥

f (x)dx

Rn

1−λ g(x)dx

Rn

.

Rn

Proof of Theorem 23. In Theorem 24, take f = φn 1A , g = φn 1B and h = φn 1λA+(1−λ)B 2 where φn (x) = (2π)−n/2 e−|x| /2 , x ∈ Rn . Since log φn is concave, these f, g, h verify the hypothesis of Theorem 24, so that the desired conclusion follows. Proof of Theorem 24. We only need to consider the case where 0 < λ < 1. The proof is by induction. Suppose that n = 1. By the hypothesis of the theorem, we have for t > 0, Ct := {x : h(x) > t} ⊃ {x : f (λ−1 x) > t} + {x : g((1 − λ)−1 x) > t} =: At + Bt , so that, Z

∞

Z Z h(x)dx =

1(t < h(x))dtdx Z ∞ µ(Ct )dt ≥ µ(At + Bt )dt,

ZR∞ 0

R

= 0

(33)

0

where µ denotes the Lebesgue measure on R. We shall prove the following lemma. Lemma 35. For all Borel subsets A, B ⊂ R, µ(A + B) ≥ µ(A) + µ(B). Proof of Lemma 35. We first prove the lemma when A and B are compact. We may assume that A and B are non-empty. Since A + B ⊃ (sup A + B) ∪ (A + inf B) and (sup A + B) ∩ (A + inf B) = {sup A + inf B}, we have µ(A + B) ≥ µ(sup A + B) + µ(A + inf B) = µ(A) + µ(B). We now prove the lemma for general Borel subsets A, B of R. By regularity of the Lebesgue measure, there exist sequences Am and Bm of compact subsets of R with Am ⊂ A and Bm ⊂ B such that µ(Am ) ↑ µ(A) and µ(Bm ) ↑ µ(B). Then µ(A + B) ≥ µ(Am + Bm ) ≥ µ(Am ) + µ(Bm ). Taking the limit, we obtain the desired conclusion. 76

We go back to the proof of Theorem 24. By Lemma 35, we have Z ∞ Z ∞ µ(Bt )dt µ(At )dt + (33) ≥ 0 0Z Z = λ f (x)dx + (1 − λ) g(x)dx R

R

λ Z

Z ≥

f (x)dx

1−λ g(x)dx

,

where the last inequality follows from the weighted arithmetic geometric mean inequality λa + (1 − λ)b ≥ aλ b1−λ . Suppose that the lemma holds up to some n and we would like to show that it holds for n + 1. By assumption, for x, y ∈ Rn , u, v ∈ R, h(λx + (1 − λ)y, λu + (1 − λ)v) ≥ f λ (x, u)g 1−λ (y, v). For a while fix u and v, and let us define h1 (x) = h(x, λu + (1 − λ)v), f1 (x) = f (x, u), g1 (x) = g(x, v). Then by the induction hypothesis, λ Z

Z

Z h1 (x)dx ≥

f1 (x)dx

Rn

1−λ g1 (x)dx

Rn

.

Rn

Define Z h2 (u) =

Z

Z

h(x, u)dx, f2 (u) = Rn

f (x, u)dx, g2 (u) = Rn

g(x, u)dx. Rn

Then the inequality (34) implies that h2 (λu + (1 − λ)v) ≥ f2λ (u)g21−λ (v), so that by the induction hypothesis, Z

Z

Z h2 (u)du ≥

h(x, u)dxdu = Rn+1

R

λ Z λ f2 (u)du g2 (u)du

R

Z =

R

λ Z f (x, u)dxdu

Rn+1

1−λ g(x, u)dxdu .

Rn+1

This completes the proof. The log-concavity of Gaussian measures implies the following lemma. Lemma 36. The function log F is concave on (r0 , ∞).

77

(34)

Proof. By separability of X, there exist t1 , t2 , · · · ∈ T such that lim max |Xti | = sup |Xt | a.s.

n→∞ 1≤i≤n

(35)

t∈T

Denote by Fn the distribution function of max1≤i≤n |Xti |. For a while fix n, and denote by Γ the covariance matrix of (Xt1 , . . . , Xtn )> . Then d

(Xt1 , . . . , Xtn )> = Γ1/2 Z, Z ∼ N (0, In ). Let r1 , r2 ∈ R and λ ∈ [0, 1], and set A = {x ∈ Rn : max1≤i≤n |(Γ1/2 x)i | ≤ r1 } and B = {x ∈ Rn : max1≤i≤n |(Γ1/2 x)i | ≤ r2 }. Since λA + (1 − λ)B ⊂ {x ∈ Rn : max |(Γ1/2 x)i | ≤ λr1 + (1 − λ)r2 }, 1≤i≤n

we conclude by Theorem 23 that Fn (λr1 + (1 − λ)r2 ) ≥ Fn (r1 )λ Fn (r2 )1−λ .

(36)

By (35), Fn (r) → F (r) as n → ∞ for every continuity point of F . Denote by D the set of jump points of F . The set D is countable. Choose and fix r1 , r2 ∈ R \ D with r1 6= r2 , and let ΛD = {λ ∈ [0, 1] : λr1 + (1 − λ)r2 ∈ D}. The set ΛD is also countable. Taking n → ∞ in (36), we have for all λ ∈ [0, 1] \ ΛD , F (λr1 + (1 − λ)r2 ) ≥ F (r1 )λ F (r2 )1−λ .

(37)

We shall verify that the above inequality holds for all λ ∈ [0, 1]. Indeed, for λ ∈ ΛD , take a sequence λm in [0, 1] \ ΛD with λm → λ such that λm r1 + (1 − λm )r2 ↓ λr1 + (1 − λ)r2 . Then by the right continuity of F , we see that (37) also holds for such λ. In the similar manner, we see that (37) holds for all r1 , r2 ∈ R. Therefore, by taking logarithm of both sides of (37), we obtain the desired conclusion. We recall the following (well-known) fact on convex/concave functions. Lemma 37. Let f : (a, b) → R be a convex (or concave) function (a < b; a = −∞ and b = ∞ are allowed). Then f is locally absolutely continuous on (a, b), i.e., f is absolutely continuous on each compact subinterval of (a, b). Proof. We only need to consider convex f . Take any compact subinterval [a1 , b1 ] ⊂ (a, b), and moreover take a < a2 < a1 < b1 < b2 < b. Since f is convex, for all x, y ∈ [a1 , b1 ] with x 6= y, we have f (a2 ) − f (a1 ) f (y) − f (x) f (b2 ) − f (b1 ) ≤ ≤ , a2 − a1 y−x b2 − b1 so that f (a2 ) − f (a1 ) f (b2 ) − f (b1 ) . |f (y) − f (x)| ≤ |y − x| · max , b2 − b1 a2 − a1 This implies that f is Lipschitz continuous on [a1 , b1 ]. Every Lipschitz continuous function on a compact interval is absolutely continuous on the same interval, so that the desired conclusion follows. 78

We are now in position to prove Theorem 22. Proof of Theorem 22. Let G = log F so that F = eG . By Lemma 36, G is concave on (r0 , ∞). By Lemma 37, G is locally absolutely continuous on (r0 , ∞), and so is F . Since F is a probability distribution function, F is absolutely continuous on (r0 , ∞). To prove the second assertion, we first verify that G is strictly increasing. Observe that for t0 ∈ T such that E[Xt20 ] > 0 (such t0 is assumed to exist), F (r) ≤ P(X(t0 ) ≤ r) < 1 for all r ∈ R. Suppose on the contrary that there exist points r2 > r1 (> r0 ) such that F (r1 ) = F (r2 ) =: p. Note that p < 1, and take r3 > r2 such that F (r3 ) > p. Write r2 as a convex combination of r1 and r3 : r2 = λr1 + (1 − λ)r3 for some λ ∈ (0, 1). Then G(r2 ) = log p < λG(r1 ) + (1 − λ)G(r3 ), which contradicts concavity of G. By concavity of G, it is routine to see that the map t 7→

G(r + t) − G(r) , (0, ∞) → R, t

is non-increasing, so that the right derivative G0+ (r) exists and is positive (the latter follows from the fact that G is strictly increasing), i.e., G0+ (r) = lim t↓0

G(r + t) − G(r) > 0. t

Then G0+ (r) is finite and map r 7→ G0+ (r) is non-increasing. Hence G0+ is continuous except on at most countably many points, which implies that, except on at most countably many points, G is differentiable and its derivative is positive. This completes the proof.

79

6

Rademacher processes

This section studies Rademacher processes. Let T ⊂ Rn be a non-empty bounded subset, and let ε1 , . . . , εn be independent Rademacher random variables. A stochastic process of the form n X X(t) = εi ti , t = (t1 , . . . , tn )> ∈ T i=1

is called a Rademacher process. Rademacher processes share many fine properties with Gaussian processes. As in the Gaussian case, many of these properties follow from an analogue of Theorem 18 to the Rademacher case. A function Rm → R is said to be separately convex if it is convex in each coordinate. Theorem 25. Let ε1 , . . . , εm be independent Rademacher random variables, and let ε = (ε1 , . . . , εm )> . Let F : Rm → R be a separately convex and Lipschitz continuous function with Lipschitz constant at most 1. Then for every r > 0 we have P{F (ε) ≥ E[F (ε)] + r} ≤ e−r

2 /4

.

In view of the second proof of Theorem 18, Theorem 25 follows from the next lemma, which is an easy consequence of Lemma 31. Lemma 38. Let νm denote the distribution of ε = (ε1 , . . . , εm )> with ε1 , . . . , εm independent Rademacher random variables. Then for every separately convex and continuously differentiable f : Rm → R, νm satisfies the log-Sobolev inequality of the form Z Entνm (f 2 ) ≤ 4 |∇f |2 dνm . Proof. By the tensorization inequality (Lemma 30), it is enough to prove the theorem for m = 1, ε = ε1 and ν = ν1 . By Lemma 31, we have 1 1 Entν (f 2 ) ≤ (f (1) − f (−1))2 = (f (ε) − f (−ε))2 . 2 2 If f is convex and smooth, f (ε) − f (−ε) ≥ 2f 0 (−ε)ε. Replacing ε by −ε, we also have f (−ε) − f (ε) ≥ −2f 0 (ε)ε. This shows that (f (ε) − f (−ε))2 ≤ 4(|f 0 (ε)|2 + |f 0 (−ε)|2 ), by which we have Entν (f 2 ) ≤ 2E[|f 0 (ε)|2 + |f 0 (−ε)|2 ] = 4E[|f 0 (ε)|2 ] Z = 4 |f 0 |2 dν. This completes the proof. 80

Given Theorem 25, we can obtain an analogue of Theorem 19 to Rademacher processes. Theorem 26. Let T ⊂ Rn be a non-empty bounded subset, and let ε1 , . . . , εn be indePn pendent Rademacher random variables. Let X(t) = i=1 εi ti , t = (t1 , . . . , tn )> ∈ T . Then for every r > 0 we have P{kXkT ≥ E[kXkT ] + r} ≤ e−r

2 /(4σ 2 )

,

where σ 2 := supt∈T |t|2 . Furthermore, all the Lp norms of kXkT are equivalent for 1 ≤ p < ∞, that is, E[kXkT ] ≤ (E[kXkpT ])1/p ≤ Cp E[kXkT ], (38) where Cp > 0 is a constant depending only on p. The inequality in (38) is (a version of) the Khinchin-Kahane inequality. P Proof of Theorem 26. Let F (ε) = supt∈T | ni=1 εi ti |, t = (t1 , . . . , tn )> . Then F is convex and Lipschitz continuous with Lipschitz constant bounded by supt∈T |t| = σ. Hence by Theorem 25, we obtain the first assertion. To prove the second assertion, we shall prove the following classical Khinchin inequality. √ P Lemma 39. |t| ≤ 3E[| ni=1 εi ti |]. P Proof of Lemma 39. The proof is due to Littlewood. Let Z = ni=1 εi ti . Then a simple computation gives E[Z 2 ] = |t|2 , E[Z 4 ] =

n X

t4i + 3

i=1

X

t2i t2j ≤ 3|t|4 = 3(E[Z 2 ])2 .

i6=j

For 0 < α < 1, H¨ older’s inequality yields E[Z 2 ] = E[|Z|α |Z|2−α ] ≤ (E[|Z|])α (E[|Z|(2−α)/(1−α) )1−α . Choosing α in such a way that

2−α = 4, 1−α

that is, α = 2/3, we have E[Z 2 ] ≤ (E[|Z|])2/3 (E[Z 4 )1/3 ≤ 31/3 (E[|Z|])2/3 (E[Z 2 ])2/3 , by which we conclude that |t|2 = E[Z 2 ] ≤ 3(E[|Z|])2 . Going back to the proof of Proposition 26, we have √ √ σ = sup |t| ≤ 3 sup E[|X(t)|] ≤ 3E[kXkT ]. t∈T

t∈T

The rest of the proof is exactly the same as the proof of Corollary 9. 81

Remark 8. The Khinchin-Kahane inequality does not lead to an inequality of the form (E[kPn − P kpF ])1/p ≤ Cp E[kPn − P kF ].

(39)

Compare this with the Hoffmann-Jørgensen inequality (see Theorem 4). By the symmetrization inequality, one has

p #

p # " n " n

X

X

p E (f (Xi ) − P f ) ≤2 E εi f (Xi ) .

i=1

i=1

F

F

Conditionally on X1 , . . . , Xn , the Khinchin-Kahane inequality yields

#!p

p # " n " n

X

X

p ≤ Cp Eε εi f (Xi ) . Eε εi f (Xi )

i=1

i=1

F

However, Jensen’s inequality yields that

#!p # " " n

X

Eε E εi f (Xi ) ≥

i=1

F

and so we can not conclude (39).

82

F

#!p " n

X

E εi f (Xi ) ,

i=1

F

7

Talagrand’s concentration inequality

Talagrand (1996) proved a remarkable concentration inequality for a general empirical process indexed by a uniformly bounded class of functions. Subsequently, Massart (2000), Bousquet (2002) and Krein and Rio (2005) (and others not cited here) refined Talagland’s original result. We state here Bousquet’s version. Theorem 27 (Talagrand (1996); in this form Bousquet (2002)). Let F be a pointwise measurable class of functions S → R. Suppose that there exists a constant B > 0 such that kf k∞ ≤ B for all f ∈ F. Suppose further that P f = 0 for all fP∈ F. Let σ 2 > 0 be any positive constant such that σ 2 ≥ supf ∈F P f 2 . Let Z := k ni=1 f (Xi )kF and V := nσ 2 + 2BE[Z]. Then for every x > 0, √ Bx P Z ≥ E[Z] + 2V x + ≤ e−x . (40) 3 Consequently, for every x > 0, √ 1 1 P Z ≥ inf (1 + α)E[Z] + σ 2nx + + Bx ≤ e−x . α>0 3 α

(41)

The second inequality (41) is a direct consequence of the first inequality (40). Indeed, using the simple inequalities √ √ √ a + b ≤ a + b, 2ab ≤ ca2 + c−1 b2 , ∀c > 0, we have √

p √ 2V x ≤ σ 2nx + 2 BxE[Z] √ ≤ σ 2nx + αE[Z] + α−1 Bx.

Inequality (40) is a deviation inequality rather than a concentration inequality. There is an analogous lower tail inequality, n o √ P Z ≤ E[Z] − 2V x − Bx ≤ e−x , ∀x > 0, (42) proved by Krein and Rio (2005). Combining (40) and (42) leads to a concentration inequality for Z around E[Z]. However, we focus here on the upper tail inequality. We shall prove Theorem 27, but with worse constants. For a proof for the better constants, see the original paper by Bousquet (2002). We shall follow here the proof described in Massart (2007), which is based on Massart (2000) and Boucheron et al. (2003). A major inspiration of their approach, called the “entropy method”, comes from Ledoux (1996).

83

7.1

Two technical theorems

We begin with recalling the setting: each Xi is the i-th coordinate of the product probability space (S n , S n , P n ). To obtain an inequality of type (40), it is enough to have a suitable bound on the Laplace transform of Z − E[Z] (see also the proofs of the Gaussian concentration inequality). We consider here a more general problem of bounding the Laplace transform of a generic statistic of X1 , . . . , Xn . Let X = (X1 , . . . , Xn ). We also use the notation X−i = (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ). Let X10 , . . . , Xn0 be independent copies of X1 , . . . , Xn . ζ(X1 , . . . , Xn ) of X1 , . . . , Xn , we write

For a generic statistic Z =

Z ∨i = ζ(X1 , . . . , Xi−1 , Xi0 , Xi+1 , . . . , Xn ). Define

# n X ∨i 2 =E (Z − Z )+ | X . "

V

+

i=1

Boucheron et al. (2003) proved a remarkable inequality that relates the Laplace transform of Z − E[Z] to that of V + . The usefulness of this inequality comes from the fact that V + is typically easier to control. Theorem 28 (Boucheron et al. (2003)). Let Z be a bounded statistic of X. Then for every θ > 0 and λ ∈ (0, 1/θ), we have log E[exp(λ(Z − E[Z])] ≤

λθ log E[exp(λV + /θ)]. 1 − λθ

Theorem 28 may be viewed as an exponential version of the Efron-Stein inequality that states n 1X Var(Z) ≤ E[(Z − Z ∨i )2 ] = E[V + ]. 2 i=1

The Efron-Stein inequality has been found useful to bound the variance of complicated statistics of independent random variables. See Efron and Stein (1981) and Steele (1986). One may find that Theorem 28 is useful only when a convenient bound on the Laplace transform of V + is available. For that purpose, the following theorem will be useful. Theorem 29 (Massart (2000); Boucheron et al. (2000)). Let Z be a non-negative bounded statistic of X, and suppose that there exist bounded statistics Zi of X−i for i = 1, . . . , n such that 0 ≤ Z − Zi ≤ 1, ∀i = 1, . . . , n,

n X i=1

84

(Z − Zi ) ≤ Z.

Then for every λ ∈ R, we have log E[exp(λZ)] ≤ (eλ − 1)E[Z]. Consequently, for every x > 0, p 2 P Z ≥ E[Z] + 2E[Z]x + x ≤ e−x , 3

(43)

(44)

and n o p P Z ≤ E[Z] − 2E[Z]x ≤ e−x .

(45)

The last two inequalities (44) and (45) will not be used in the proof of Talagrand’s inequality. However, they are of interest in their own right. We shall prove these theorems in what follows. The following modified log-Sobolev inequalities will play a key role. Proposition 8 (Massart (2000)). Let Z be a bounded statistic of X, and for each i = 1, . . . , n, let Zi be a bounded statistic of (X1 , . . . , Xi−1 , Xi0 , Xi+1 , . . . , Xn ). Then for every λ ∈ R, we have λZ

λE[Ze

λ

λZi

] − E[e Z] log E[e

]≤

n X

E[eλZ ϕ(−λ(Z − Zi ))],

i=1

where ϕ(x) = ex − x − 1. Furthermore, if we take Zi = Z ∨i for i = 1, . . . , n, we have λE[ZeλZ ] − E[eλ Z] log E[eλZ ] ≤

n X

E[eλZ ψ(−λ(Z − Z ∨i )+ )],

i=1

where ψ(x) = x(ex − 1). Proof. Let G be a non-negative bounded statistic of X, and let Φ(u) = u log u. We use the notation E∨i [·] = E[· | X−i ]. First, the tensorization inequality (Lemma 30) yields " n # X E[Φ(G)] − Φ(E[G]) ≤ E (E∨i [Φ(G)] − Φ(E∨i [G])) . (46) i=1

Further, the variational formula for the entropy functional (Lemma 29 (ii)) yields that E∨i [Φ(G)] − Φ(E∨i [G])) = inf E∨i [G(log G − log u) − (G − u)]. u>0

Let G = eλZ and Gi = eλZi for i = 1, . . . , n. Since Gi is a statistic of (Xi0 , X−i ), we may take u = Gi for fixed Xi0 in the above inequality, and using Fubini’s theorem, we have E∨i [Φ(G)] − Φ(E∨i [G])) ≤ E∨i [G(log G − log Gi ) − (G − Gi )]. 85

Because G(log G − log Gi ) − (G − Gi ) = eλZ (λZ − λZi ) − (eλZ − eλZi ) = eλZ (λ(Z − Zi ) − 1 + eλ(Zi −Z) ) = eλZ ϕ(−λ(Z − Zi )), we conclude that E∨i [Φ(eλZ )] − Φ(E∨i [eλZ ])) ≤ E∨i [eλZ ϕ(−λ(Z − Zi ))]. Combining inequality (46), we obtain the first inequality. The second inequality is deduced from the first inequality. Observe that, as a = a+ − a− and ϕ(0) = 0, ϕ(−λ(Z − Z ∨i )) = ϕ(−λ(Z − Z ∨i )+ ) + ϕ(λ(Z − Z ∨i )− ). Since conditionally on X−i , Z and Z ∨i have the same distribution, we have ∨i

E∨i [eλZ ϕ(λ(Z − Z ∨i )− )] = E∨i [eλZ ϕ(λ(Z ∨i − Z)− )] = E∨i [eλZ eλ(Z

∨i −Z)

= E∨i [eλZ e−λ(Z

ϕ(λ(Z ∨i − Z)− )]

∨i −Z) −

= E∨i [eλZ e−λ(Z−Z

∨i ) +

ϕ(λ(Z ∨i − Z)− )] ϕ(λ(Z − Z ∨i )+ )].

Since ψ(x) = ex ϕ(−x) + ϕ(x), we conclude that E∨i [eλZ ϕ(−λ(Z − Z ∨i ))] = E∨i [eλZ ψ(−λ(Z − Z ∨i )+ )], which leads to the second inequality. Proof of Theorem 28. Let λ > 0 and F (λ) = E[exp(λZ)]. Observe that, in Proposition 8, ψ(−x) = x(1 − e−x ) ≤ x2 for x > 0, so that 0

λF (λ) − F (λ) log F (λ) ≤ λ

2

n X

E[eλZ (Z − Z ∨i )2+ ]

i=1 2

= λ E[V + eλZ ]. We shall prove the following lemma. Lemma 40. Let Z and W be bounded statistics of X. Then for every λ ∈ R, E[λW eλZ ] E[λZeλZ ] ≤ − log E[eλZ ] + log E[eλW ]. E[eλZ ] E[eλZ ]

86

Proof of Lemma 40. Let Q be the distribution on S n defined by dQ = (eλZ /E[eλZ ])dP n . Denote by EQ [·] the expectation under Q. Then the left hand side is EQ [λW ], and the right hand side is EQ [λZ] + log EQ [eλ(W −Z) ]. Now, by Jensen’s inequality, log EQ [eλ(W −Z) ] ≥ log(eEQ [λ(W −Z)] ) = EQ [λ(W − Z)], that is EQ [λW ] ≤ EQ [λZ] + log EQ [eλ(W −Z) ], completing the proof. Going back to the proof of Theorem 28, let G(λ) = log E[exp(λV + )]. Note that G(0) = 0 and G is convex. Applying Lemma 40 to W = V + /θ, we have E[λ(V + /θ)eλZ ] ≤ E[λZeλZ ] − E[eλZ ] log E[eλZ ] + E[eλZ ] log E[eλV

+ /θ

]

0

= λF (λ) − F (λ) log F (λ) + F (λ)G(λ/θ), by which we have 1 1 λF 0 (λ) − F (λ) log F (λ) ≤ λ2 θ F 0 (λ) + F (λ)G(λ/θ) − F (λ) log F (λ) . λ λ Dividing both sides by λ2 F (λ), this inequality is rearranged as 1 F 0 (λ) 1 θG(λ/θ) − 2 log F (λ) ≤ . λ F (λ) λ λ(1 − λθ) Setting H(λ) = λ−1 log F (λ), we may observe that the left hand side is the derivative of H(λ). Since limλ↓0 H(λ) = E[Z], we have Z H(λ) ≤ E[Z] + 0

λ

θG(s/θ) ds. s(1 − sθ)

We shall prove the following lemma. Lemma 41. The map s 7→ G(s/θ)/s(1 − sθ) is non-decreasing on (0, 1/θ). Proof of Lemma 41. Let 0 < s < t < 1/θ. Write s as s = αt with α ∈ (0, 1). Then since G is convex and G(0) = 0, G(s/θ) ≤ αG(t/θ). Since α = s/t, and the map s 7→ 1/(1 − sθ) is non-decreasing, G(s/θ)/s(1 − sθ) ≤ G(t/θ)/t(1 − sθ) ≤ G(t/θ)/t(1 − tθ), completing the proof.

87

Going back to the proof of Theorem 28, we have shown that log F (λ) ≤ λE[Z] +

λθG(λ/θ) . (1 − λθ)

This is the conclusion of the theorem. Proof of Theorem 29. In Proposition 8, since ϕ is convex with ϕ(0) = 0, ϕ(−λu) ≤ uϕ(−λ) for every λ ∈ R and u ∈ [0, 1]. Hence we have ϕ(−λ(Z − Zi )) ≤ (Z − Zi )ϕ(−λ) (as 0 ≤ Z − Zi ≤ 1), and " # n X λE[ZeλZ ] − E[eλZ ] log E[eλZ ] ≤ E ϕ(−λ)eλZ (Z − Zi ) i=1 λZ

≤ ϕ(−λ)E[Ze

].

P The second inequality is due to the fact that ni=1 (Z − Zi ) ≤ Z. Let Z˜ = Z − E[Z], and ˜ define F (λ) = E[eλZ ]. Then the previous inequality becomes {λ − ϕ(−λ)}

F 0 (λ) − log F (λ) ≤ aϕ(−λ), F (λ)

where a = E[Z]. Setting G(λ) = log F (λ), we have (1 − e−λ )G0 (λ) − G(λ) ≤ aϕ(−λ). Let G0 (λ) = aϕ(λ). Then G0 is a solution of the ordinaly differential equation (1 − e−λ )f 0 (λ) − f (λ) = aϕ(−λ). We want to show that G(λ) ≤ G0 (λ). Let G1 = G − G0 . Then (1 − e−λ )G01 (λ) − G1 (λ) ≤ 0. Setting g(λ) = G1 (λ)/(eλ − 1) for λ 6= 0, we have G1 (λ) = (eλ − 1)g(λ) and hence (1 − e−λ ){eλ g(λ) + (eλ − 1)g 0 (λ)} − (eλ − 1)g(λ) ≤ 0, that is, (1 − e−λ )(eλ − 1)g 0 (λ) ≤ 0. ˜ = 0. Furthermore, G0 (0) = So g 0 (λ) ≤ 0 for λ 6= 0. Since Z˜ is centered, G0 (0) = E[Z] 0 0 aϕ (0) = 0. Using the l’Hˆ opital rule, we have G01 (λ) G1 (λ) = lim = 0. λ→0 eλ − 1 λ→0 eλ

lim g(λ) = lim

λ→0

Taking these together, g(λ) ≥ 0 on (0, ∞) and g(λ) ≤ 0 on (−∞, 0), by which we conclude that G1 (λ) ≤ 0 for all λ ∈ R. This completes the proof for (43). 88

Proving the remaining two inequalities (44) and (45) is rather a routine work. We first prove (44). Let x > 0. For λ > 0, by Markov’s inequality together with inequality (43) just proved, P{Z − E[Z] ≥ x} ≤ e−λx E[eλ(Z−E[Z]) ] ≤ e−λx+aϕ(λ) . We have to minimize the right hand side. Let h(λ) = −λx + aϕ(λ) = aeλ − (x + a)λ − a. Differentiating h(λ) yields h0 (λ) = aeλ − (x + a), and the solution of h0 (λ) = 0 is given by λ = log(1 + x/a) =: λ∗ . Clearly, h(λ) is minimized at λ = λ∗ and h(λ∗ ) = a(1 + x/a) − (x + a) log(1 + x/a) − a = −a {(1 + x/a) log(1 + x/a) − x/a} = −aq(x/a), where q(y) = (1 + y) log(1 + y) − y. Lemma 42. For y > 0, q(y) ≥

y2 . 2(1 + y/3)

Proof of Lemma 42. By Fubini’s theorem, we have Z 1 1−s ds q(y) = y 2 0 1 + ys Z 1 Z 1 1 2 =y 1(0 ≤ s ≤ t ≤ 1)dt ds 1 + ys 0 0 Z 1Z 1 y2 1 = ·2 1(0 ≤ s ≤ t ≤ 1) dsdt. 2 1 + ys 0 0 Let (S, T ) be a random vector uniformly distributed on the triangle region {(s, t) : 0 ≤ s ≤ t ≤ 1}. Then Z 1Z 1 1 1 2 1(0 ≤ s ≤ t ≤ 1) dsdt = E . 1 + sy 1 + yS 0 0 Since the map s 7→ 1/(1 + ys) is convex, and E[S] = 1/3, we have 1 1 1 E ≥ = . 1 + yS 1 + yE[S] 1 + y/3

89

Hence we have

2

x − 2(a+x/3)

P{Z − E[Z] ≥ x} ≤ e

.

Solving x2 =y 2(a + x/3) p √ gives x = y/3 + y 2 /9 + 2ay ≤ 2y/3 + 2ay. Therefore, we have p 2y ≤ e−y . P Z − E[Z] ≥ 2ay + 3 For the opposite inequality (45), observe that P{−Z + E[Z] ≥ x} ≤ e−λx E[e−λ(Z−E[Z]) ] ≤ e−λx+aϕ(−λ) . A straightforward calculation gives min{−λx + aϕ(−λ)} = −aq(−x/a) ≥ − λ>0

x2 2a

when x < a, where we have used the inequality q(−t) ≥ t2 /2 for t ∈ (0, 1). So we have x2

P{Z ≤ E[Z] − x} ≤ e− 2a , for x < a, but this inequality trivially holds for x ≥ a = E[Z] and so for all x > 0.

7.2

Proof of Talagrand’s concentration inequality

Proof of Theorem 27. Instead of (40) itself, we shall prove a less sharp inequality: p P{Z ≥ E[Z] + 2 (2nσ 2 + 16BE[Z])x + 2Bx} ≤ e−x . Because of pointwise measurability, we may assume here that F is countable. Using approximations, we may further assume that F is finite. In addition, by homogeneity, we may assume that B = 1, that is, kf k∞ ≤ 1 for every f ∈ F. Conditionally on X1 , . . . , Xn , let f˜ be a function in F at which the maximum in Z is attained, that is, n n X X f˜(Xi ) . Z = max f (Xi ) = f ∈F i=1

i=1

Under the notation of the previous subsection, one can see that X X n n 0 ∨i ˜ ˜ ˜ f (Xj ) + f (Xi ) ≤ |f˜(Xi ) − f˜(Xi0 )|, f (Xj ) − Z − Z ≤ j6=i j=1 90

and hence n X ≤E (f˜(Xi ) − f˜(Xi0 ))2 | X

"

V

+

#

i=1

=

n X

f˜2 (Xi ) + nP f˜2

i=1

≤ max f ∈F

n X

f 2 (Xi ) + nσ 2

i=1

=: W + v. We shall apply Theorem 29 to Z = W . Let fˇ be a function P in F at which the maximum in W is attained. For each 1 ≤ i ≤ n, let Wi = maxf ∈F j6=i f 2 (Xj ). Then we have 0 ≤ W − Wi ≤

n X

fˇ2 (Xj ) −

j=1

X

fˇ2 (Xj ) = fˇ2 (Xi ) ≤ 1,

j6=i

and n X

(W − Wi ) ≤

i=1

n X

fˇ2 (Xi ) = W.

i=1

Hence by Theorem 29, log E[exp(λW )] ≤ (eλ − 1)E[W ]. Combining Theorem 28 with θ = 1, for every λ ∈ (0, 1), we have log E[exp(λ(Z − E[Z]))] ≤

λ (λv + (eλ − 1)E[W ]). 1−λ

Using the simple inequality (eλ − 1)(1 − λ) ≤ λ, we conclude that for every λ ∈ (0, 1/2), log E[exp(λ(Z − E[Z]))] ≤

λ2 λ2 (v + E[W ]) ≤ (v + E[W ]). (1 − λ)2 1 − 2λ

Therefore, by Markov’s inequality, for every x > 0, P{Z − E[Z] ≥ x} ≤ exp{−λx + (λ/(1 − 2λ))a}, λ ∈ (0, 1/2), where a = v + E[W ]. We have to minimize the right hand side with respect to λ. Fix x > 0. Define λ2 a (a + 2x)λ2 − xλ h(λ) = −λx + = . 1 − 2λ 1 − 2λ

91

Differentiating h(λ), we have by setting b = a + 2x, (1 − 2λ)2 h0 (λ) = (2bλ − x)(1 − 2λ) + 2(bλ2 − xλ) = −2bλ2 + 2bλ − x 1 2 = −2b λ − + 2 1 2 + = −2b λ − 2

b −x 2 a . 2

p Hence the solution of h0 (λ) = 0 on (0, 1/2) is given by λ = 1/2 − a/(4b) =: λ∗ . Furthermore, h0 (λ) < 0 for λ ∈ (0, λ∗ ) and h0 (λ) > 0 for λ ∈ (λ∗ , 1/2), by which we conclude that h(λ) is minimized at λ = λ∗ on (0, 1/2). A straightforward calculation shows that r 2 r r r 1 1 b a a a a x a b −x = −b − − + − +x 2 4b 2 4b 4 4b 4 2 4b r r a a a = −b + +x , 4b 2 4b and r a 1 − 2λ = 2 , 4b by which we have ∗

h(λ ) =

−b +

√

ab + x

2

=

−x − a +

Hence we have

a2 + 2ax

2 √

P{Z ≥ E[Z] + x} ≤ e

√

−x−a+ 2

a2 +2ax

.

.

√ Setting x = 2y + 2 ay for y > 0, we have x+a−

p

a2

√

q √ + 2ax = 2y + 2 ay + a − a2 + 4ay + 4a ay = 2y,

so that P{Z ≥ E[Z] + 2

p (v + E[W ])y + 2y} ≤ e−y .

Applying Corollary 1, we also have E[W ] ≤ nσ 2 + 16E[Z] = v + 16E[Z]. Therefore, we conclude that P{Z ≥ E[Z] + 2

p (2v + 16E[Z])y + 2x} ≤ e−y .

This completes the proof. 92

7.3

A “statistical version” of Talagrand’s inequality

Suppose here that f are not P -centered, and kf k∞ ≤ B and σ 2 ≥ P (f − P f )2 for all f ∈ F. Then inequalities (40) and (42) are rephrased as √ 2Bx −1 2V x + P kPn − P kF ≥ E[kPn − P kF ] + n ≤ e−x , (47) 3n √ 2Bx ≤ e−x , (48) P kPn − P kF ≤ E[kPn − P kF ] − n−1 2V x − n where V = nσ 2 + 4BnE[kPn − P kF ]. Note that we have used the simple inequality kf − P f k∞ ≤ 2B. In statistical applications, it is of importance to have data-dependent “confidence intervals” for the random quantity kPn − P kF . This quantity is a natural measure of the accuracy of the approximation of an unknown distribution by the empirical distribution Pn . However, kPn − P kF itself depends on the unknown distribution P and is not directly available. It is a reasonable idea to construct data-dependent intervals [Lα (X), Uα (X)] such that P{kPn − P kF ∈ [Lα (X), Uα (X)]} ≥ 1 − α, with a given confidence level 1 − α. Inequalities (47) and (48) may be used to assess this problem. Of course, we have to replace the unknown quantities E[kPn − P kF ], σ 2 and B by suitable estimates or bounds. Suppose for the sake of simplicity, σ 2 and B are known, and the only problem is to estimate or bound the expectation E[kPn − P kF ]. We have discussed so far how to bound the expectation E[kPn − P kF ]. However, such bounds typically depend on other unknown constants and may not be sharp. An interesting idea, discussed in Koltchinskii (2011), is to compare

# " n

X

E (f (Xi ) − P f ) (49)

i=1

F

with the random quantity

n

X

ε f (X )

i i ,

i=1

F

where ε1 , . . . , εn are independent Rademacher random variables independent of X1 , . . . , Xn . The latter quantity is computable. However, the latter quantity varies with different draws of ε1 , . . . , εn , and from a statistical perspective, it is more natural to compare (49) with

# " n

X

εi f (Xi ) . (50) Eε

i=1

F

Based on these observations, we consider a version of inequality (47) with (49) replaced by (50), and with different constants, which may be coined as a “statistical version” of Talagrand’s inequality. 93

Theorem 30 (Bartlett et al. (2005)). Let F be a pointwise measurable class of functions S → R. Suppose that there exists a constant B > 0 such that kf k∞ ≤ B for all f ∈ F, and let σ 2 > 0 be any positive constant such that σ 2 ≥ supf ∈F P (f −P f )2 . Let ε1 , . . . , εn be independent Rademacher random variables independent of X1 , . . . , Xn . Then for every x > 0,

# " n ( " n

X

X

2(1 + α)

P (f (Xi ) − P f ) ≥ inf Eε εi f (Xi )

1−α α∈(0,1) i=1 i=1 F F )# √ 1+α 1 1 + Bx ≤ 2e−x . + + σ 2nx + 2 3 α 2α(1 − α) The following proposition will be crucial. Proposition 9 (Boucheron et al. (2003)). Let F be a pointwise measurable class of functions S → R. Suppose that there exists a constant B > 0 such that kf k∞ ≤ B for all f ∈ F. Let ε1 , . . . , εn be P independent Rademacher random variables independent of X1 , . . . , Xn . Let Z = Eε [k ni=1 εi f (Xi )kF ]. Then for every x > 0, p 2 P Z ≥ E[Z] + 2BE[Z]x + Bx ≤ e−x , 3 and P{Z ≤ E[Z] −

p 2BE[Z]x} ≤ e−x .

Proof. The proof is an application of Theorem 29. As before, we may assume that F is finite and B = 1. We first observe that, by setting Fˇ = F ∪ −F = {f, −f : f ∈ F}, n n X X max εi f (Xi ) = max εi f (Xi ). f ∈Fˇ f ∈F i=1

i=1

P

For each 1 ≤ i ≤ n, let Zi = Eε [maxf ∈Fˇ j6=i εj f (Xj )]. Fix X1 , . . . , Xn . For every 1 ≤ i ≤ n, let f˜i be a function (that may depend on ε−i but not on εi ) in Fˇ at which P the maximum in maxf ∈Fˇ j6=i εj f (Xj ) is attained. Then max f ∈Fˇ

≥

n X

εj f (Xj ) − max f ∈Fˇ

j=1

n X

εj f˜i (Xj ) −

j=1

X

X

εj f (Xj )

j6=i

εj f˜i (Xj )

j6=1

= εi f˜i (Xi ). Since f˜i is independent of εi , we have Z − Zi ≥ Eε [εi f˜i (Xi )] = E[εi ]Eε [f˜(Xi )] = 0. 94

On the other hand, let P f˜ be a function (that may depend on ε1 , . . . , εn ) in Fˇ at which the maximum in maxf ∈Fˇ nj=1 εj f (Xj ) is attained. Then max

n X

f ∈Fˇ

≤

εj f (Xj ) − max f ∈Fˇ

j=1 n X

X

j=1

j6=i

εj f˜(Xj ) −

X

εj f (Xj )

j6=i

εj f˜(Xj )

= εi f˜(Xi ) ≤ 1, so that Z − Zi ≤ 1. Furthermore, n n n X X X (Z − Zi ) ≤ Eε [ εi f˜(Xi )] = Eε [max εj f (Xj )] = Z. i=1

f ∈Fˇ

i=1

j=1

Therefore, Theorem 29 can be applied to Z. The desired conclusion directly follows from the theorem. Proof of Theorem 30. Let x > 0. By Theorem 27 and the symmetrization inequality, with probability greater than 1 − e−x ,

n

X

(f (Xi ) − P f )

i=1 F

# " n

X

√ 1 1

< 2(1 + α)E + Bx, ∀α > 0. εi f (Xi ) + σ 2nx + 2

3 α i=1

F

On the other hand, by Proposition 9, with probability greater than 1 − e−x ,

#

#

# v " n " n " n u

X

X

X

u

Eε εi f (Xi )

i=1

F

i=1

i=1

F

F

√

2ab ≤ αa + (2α)−1 b, we have

# v

# " n " n u

X

X

u

E εi f (Xi ) − t2BE εi f (Xi ) x

i=1 i=1 F F

# " n

X

Bx

≤ (1 − α)E εi f (Xi ) + , ∀α ∈ (0, 1),

2α

Using the inequality

i=1

F

e−x ,

so that with probability greater than 1 −

#

# " n " n

X

X 1 Bx

εi f (Xi ) < εi f (Xi ) + E Eε , ∀α ∈ (0, 1).

1−α 2α(1 − α) i=1

F

i=1

F

Combining these results, we obtain the desired conclusion. 95

7.4

A Fuk-Nagaev type inequality

Talagrand’s inequality is not applicable to the case where the envelope function F is unbounded. However, by using truncation, we are able to prove the following FukNagaev type inequality for the supremum of the empirical process. Theorem 31 (Einmahl and Li (2008); Adamczak (2010)). Let F be a pointwise measurable class of functions f : S → R with measurable envelope F such that F ∈ Lp (P ) for some 1 ≤ p < ∞. Suppose that P f = 0 for all f ∈ and F is bounded in L2 (P ) PF n (which of course follows if p ≥ 2). Further, let Z = k i=1 f (Xi )kF and let σ 2 > 0 be any positive constant such that σ 2 ≥ supf ∈F P f 2 . Then for every α, x > 0, P{Z ≥ (1 + α)E[Z] + x} ≤ e−cx

2 /(nσ 2 )

+ c0 E[M p ]/xp ,

(51)

where M := max1≤i≤n F (Xi ) and c, c0 are positive constants that depend only on p, α. We will use the following version of Talagrand’s inequality: under the assumption of Theorem 27, for every α, x > 0, P{Z ≥ (1 + α)E[Z] + x} ≤ e−c1 x

2 /(nσ 2 )

+ e−c2 x/B

(52)

where c1 , c2 are positive constants that depend only on α. This follows from Talagrand’s √ inequality by considering the cases where σ nx > Bx or not. Proof. It is enough to prove the inequality when (1+α)12E[M ] ≤ x/4 since otherwise we can make the right hand side larger than 1 by taking c0 large enough. In what follows, constants c1 , c2 , . . . are those that depend only on p, α; the values of these constants should be understood from the context. Set ρ = 24E[M ], and let

n

n

X

X

Z1 = {(f 1{F ≤ρ} )(Xi ) − P (f 1F ≤ρ )} , Z2 = {(f 1{F >ρ} )(Xi ) − P (f 1{F >ρ} )} .

i=1

i=1

F

F

Because P f = 0 for all f ∈ F, we have Z ≤ Z1 + Z2 . Further, since f = f 1{F ≤ρ} + f 1{F >ρ} and thus f 1{F ≤ρ} = f + (−f 1{F >ρ} ), we have E[Z1 ] ≤ E[Z] + E[Z2 ] and so E[Z] ≥ E[Z1 ] − E[Z2 ]. Hence P{Z ≥ (1 + α)E[Z] + x} ≤ P{Z1 ≥ (1 + α)E[Z] + 3x/4} + P{Z2 ≥ x/4} ≤ P{Z1 ≥ (1 + α)(E[Z1 ] − E[Z2 ]) + 3x/4} + P{Z2 ≥ x/4}. Now, by the symmetrization inequality, for independent Rademacher random variables ε1 , . . . , εn independent of X1 , . . . , Xn ,

# " n

X

E[Z2 ] ≤ 2E εi (f 1{F >ρ} )(Xi ) ,

i=1

F

96

and because

( n )

X

P εi (f 1{F >ρ} )(Xi ) > 0 ≤ P(M > ρ) ≤ ρ−1 E[M ] ≤ 1/24

i=1

F

by our choice of ρ, the Hoffmann-Jørgensen inequality (Proposition 2) yields that E[Z2 ] ≤ 12E[M ] ≤

x . 4(1 + α)

Hence we have P{Z1 ≥ (1 + α)(E[Z1 ] − E[Z2 ]) + 3x/4} ≤ P{Z1 ≥ (1 + α)E[Z1 ] + x/2}. Applying Talagrand’s inequality of the form (52) to Z1 , we conclude that 2 /(nσ 2 )

P{Z1 ≥ (1 + α)E[Z1 ] + x/2} ≤ e−c1 x

+ e−c2 x/E[M ] .

In addition, using the Hoffmann-Jørgensen inequality again, we have (E[Z2p ])1/p ≤ c3 {E[Z2 ] + (E[M p ])1/p } ≤ c4 (E[M p ])1/p , and so Markov’s inequality yields that P{Z2 ≥ x/4} ≤ c5 E[M p ]/xp . Finally, since e−c2 x/E[M ] ≤ e−c2 x/(E[M

p ])1/p

, and e−x /x−p → 0 as x → ∞, we have

e−c2 x/E[M ] ≤ c6 E[M p ]/xp when x/(E[M p ])1/p ≥ 1. But when x/(E[M p ])1/p < 1, the inequality (51) becomes trivial by taking c0 large enough. This completes the proof.

97

8

Rudelson’s inequality

The purpose of this section is to prove the following remarkable inequality by Rudelson (1999). For a matrix A, let kAkop be the operator norm of A, that is, when A has d columns, kAkop := supx∈Rd ,|x|=1 |Ax|. Theorem 32 (Rudelson (1999)). Let X a random vector of dimension d ≥ 2 with Σ := E[XX > ]. Let X1 , . . . , Xn be independent copies of X. Then we have  s

 n

1 X

log d

1/2 E  Xi Xi> − Σ  ≤ max{kΣkop E[ max |Xi |2 ], δ, δ 2 }, δ := C

n

1≤i≤n n i=1

op

where C > 0 is a universal constant. The theorem indeed follows from the following proposition, which we will prove later. Proposition 10. Let A1 , . . . , An be fixed d × d symmetric P matrices. Let ε1 , . . . , εn be independent Rademacher random variables. Let σ 2 := k ni=1 A2i kop . Then we have  

n  X

 2 2

P εi Ai > t ≤ 2de−t /(2σ ) , ∀t > 0, (53) 

 i=1

op

and consequently 

 n

X

E  εi Ai  ≤ Cd σ,

i=1

(54)

op

p p where Cd := 2 log(2d) + π/2. Proof of Theorem 32. By the variational characterization of the operator norm, together with the symmetrization inequality (Theorem 1), we have 

 # " n n

1 X

1 X

E  Xi Xi> − Σ  = E {(α> Xi )2 − E[(α> X)2 ]} sup

n

n d α∈R ,|α|=1 i=1 i=1 op " # n 1 X > 2 ≤ 2E sup εi (α Xi ) α∈Rd ,|α|=1 n i=1  

n

1 X

> = 2E  εi Xi Xi  .

n i=1

op

We shall apply Proposition 10 to the right hand side with Ai = Xi Xi> conditionally on

98

X1 , . . . , Xn . Then 



1/2 n n

X

X

> > 2   Eε εi Xi Xi ≤ Cd (Xi Xi )

i=1

i=1

op

op

1/2 n

X

2 > = Cd |Xi | Xi Xi

i=1 op

n

1/2

X

≤ Cd max |Xi | Xi Xi> .

1≤i≤n i=1

op

Hence we have 

 n

1 X

Xi Xi> − Σ  D := E 

n i=1 op 

1/2  n

1 X

2Cd 

Xi Xi>  ≤ √ E max |Xi |

n

1≤i≤n n i=1 op v  u

 n u X r

1 2Cd u

E[ max |Xi |2 ]tE  Xi Xi>  ≤ √

n

1≤i≤n n i=1

op

r q 2Cd 2 E[ max |Xi | ] D + kΣkop . ≤ √ 1≤i≤n n Solving this inequality gives the desired conclusion. The remainder of this section is devoted to proving Proposition 10. The original proof uses the non-commutative Khinchin inequality. A more simple (to me) proof is given by Oliveira (2010). We shall follow here Tropp (2012b). To prove Proposition 10, we have to prepare some background materials on matrix analysis. Let Symd denote the space of all d × d symmetric matrices, and let Sym+ d denote the space of all d × d symmetric positive definite matrices. For A ∈ Symd , let A = QΛQ> be the spectral expansion of A, that is, Q is an orthogonal matrix and Λ is a diagonal matrix of which the diagonal entries consist of the eigenvalues of A. Let Λ = diag(λ1 , . . . , λd ). A function f : R → R (or (0, ∞) → R) can be extended to a function on Symd (or Sym+ d ) by f (A) := Q diag(f (λ1 ), . . . , f (λd ))Q> . For example, the exponential of A, eA , is defined by eA := Q diag(eλ1 , . . . , eλd )Q> . 99

The Taylor expansion of x 7→ ex leads to A

e =

∞ X Ap p=0

p!

.

Another example is the logarithm of A, for A ∈ Sym+ d , which is defined by log A := Q diag(log λ1 , . . . , log λd )Q> . Observe that log(eA ) = A for all A ∈ Symd . For A ∈ Symd , let λmax (A) denote the largest eigenvalue of A. Observe that kAkop = max{λmax (A), λmax (−A)}. We introduce the partial ordering ≥ on the space of symmetric matrices by A ≥ B ⇔ A − B is positive semi-definite. We now move to proving Proposition 10. At a first sight, one might think that the proposition could be proved by directly mimicking the proof of Hoeffding’s inequality, by using the fact that eλmax (A) = λmax (eA ) ≤ Tr(eA ) for A ∈ Symd . However, the situation is not so simple, as we discuss below. For the matrix exponential, although the equality Tr eA+B = Tr(eA eB ) does not hold in general, the one-sided inequality Tr eA+B ≤ Tr(eA eB ), ∀A, B ∈ Symd , is still valid, which is called the Golden-Thomson inequality (Bhatia, 1997, p.261). However, a version of the Golden-Thompson inequality for three matrices is false, that is, Tr eA+B+C Tr(eA eB eC ) (Bhatia, 1997, Problem IX.8.4). A consequence of this fact is that we can not directly extend the proof of Hoeffding’s inequality to the proof of inequality (53). Instead, we make use of Lieb’s concavity theorem. Theorem 33 (Lieb (1973)). Let B ∈ Symd . Then the map Sym+ d 3 A 7→ Tr exp(B + log(A)) is concave. We do not prove this theorem here. See Tropp (2012a) for a simple proof. An important consequence of Lieb’s theorem is the following. Lemma 43. Let Y1 , . . . , Yn be independent random d × d symmetric matrices. Then we have ( ! ) ( !) n n X X −θt θYi P λmax Yi > t ≤ inf e Tr exp log E[e ] , ∀t > 0. i=1

θ>0

i=1

100

Proof. By Markov’s inequality, for θ > 0, we have ! ) ( n Pn X P λmax Yi > t ≤ e−θt E[eλmax (θ i=1 Yi ) ] i=1 Pn

= e−θt E[λmax (eθ ≤e

−θt

θ

E[Tr e

i=1

Pn

i=1 Yi

Yi

)]

].

Applying Lieb’s concavity theorem (Theorem 33) with A = eθY1 and B = θ conditionally on Y2 , . . . , Yn , we have E[Tr eθ

Pn

i=1

Yi

] = E[Tr exp(B + log(A))] = E[E[Tr exp(B + log(A)) | Y2 , . . . , Yn ]] ≤ E[Tr exp(B + log E[A])] (Jensen’s inequality) " !# n X = E Tr exp θ Yi + log E[eθY1 ] . i=2

Iterating this step leads to the inequality θ

E[Tr e

Pn

i=1 Yi

n X

] ≤ Tr exp

! log E[e

θYi

] .

i=1

This completes the proof. Proof of Proposition 10. We make use of Lemma 43. For θ > 0, we have E[eθεi Ai ] =

eθAi + e−θAi θ2 A2i θ4 A4i 2 2 =I+ + + · · · ≤ eθ Ai /2 , 2 2 4!

and hence Tr exp

n X

n θ2 X 2 Ai 2

! log E[eθεi Ai ]

≤ Tr exp

i=1

i=1

≤ dλmax ( ≤ d exp = de Therefore, we have ( P λmax

n X i=1

!

! εi A i

exp

n θ2 X 2 Ai 2

θ2 λmax 2

θ2 σ 2 /2

i=1 n X

!!

!) A2i

i=1

.

) >t

≤ d inf e−tθ+θ θ>0

101

2 σ 2 /2

2 /(2σ 2 )

= de−t

.

Pn

i=2 Yi

Likewise, we also have ( P λmax

−

n X

! εi A i

) ≤ d inf e−tθ+θ

>t

θ>0

i=1

2 σ 2 /2

2 /(2σ 2 )

= de−t

.

The first assertion (53) follows from combining these two inequalities. ThePsecond assertion (54) follows from the first assertion (53). Indeed, by setting Z := k ni=1 εi Ai kop , we have Z ∞ p p E[(Z/σ − 2 log(2d))+ ] = P{(Z/σ − 2 log(2d))+ > t}dt Z0 ∞ p = P{Z > ( 2 log(2d) + t)σ}dt 0 Z ∞ √ 2 e−( 2 log(2d)+t) /2 dt ≤ 2d Z0 ∞ 2 e−(2 log(2d)+t )/2 dt ≤ 2d Z ∞0 p 2 e−t /2 dt = π/2. = 0

The final conclusion follows from the inequality p p Z ≤ 2 log(2d)σ + (Z/σ − 2 log(2d))+ σ.

102

References Adamczak, R. (2010). A few remarks on the operator norm of random Toeplitz matrices. J. Theoret. Probab. 23 85-108. Adler, R.J. (1990). An Introduction to Continuity, Extrema, and Related Topics for General Gaussian Processes (IMS Lecture Notes-Monograph Series). Institute of Mathematical Statistics. Andersen, N.T. and Dobri´c, V. (1987). The central limit theorem for stochastic processes. Ann. Probab. 15 164-177. Bartlett, P., Bousquet, O. and Mendelson, S. (2005). Local Rademacher complexities. Ann. Statist. 33 1497-1537. Bhatia, R. (1997). Matrix Analysis. Springer. Billingsley, P. (1968). Convergence of Probability Measures. Wiley. Borell, C. (1975). The Brunn-Minkowski inequality in Gauss space. Invent. Math. 30 205-216. Boucheron, S., Lugosi, G. and Massart, P. (2003). A sharp concentration inequality with applications. Random Structures Algorithms 16 277-292. Boucheron, S., Lugosi, G. and Massart, P. (2003). Concentration inequalities using the entropy method. Ann. Probab. 31 1583-1614. Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press. Bousquet, O. (2002). A Bennett concentration inequality and its application to suprema of empirical processes. C.R. Acad. Sci. Paris 334 495-500. Chernozhukov, V., Chetverikov, D., and Kato, K. (2014). Gaussian approximation of suprema of empirical processes. Ann. Statist. 42 1564-1597. Davydov, Y., Lifshits, M. and Smorodina, N. (1998). Local Properties of Distributions of Stochastic Functions (Transaction of Mathematical Monographs, Vol. 173). American Mathematical Society. Dudley, R.M. (1967). The size of compact subsets of Hilbert space and continuity of Gaussian processes. J. Functional Anal. 1 290-330. Dudley, R.M. (1978). Central limit theorems for empirical measures. Ann. Probab. 6 899-929. (Correction: Ann. Probab. 7 909-911.) Dudley, R.M. (1999). Uniform Central Limit Theorems. Cambridge University Press.

103

Dudley, R.M. (2002). Real Analysis and Probability, second edition. Cambridge University Press. Efron, B. and Stein, C. (1981). The jackknife estimate of variance. Ann. Statist. 9 586596. Einmahl, U. and Li, D. (2008). Characterization of LIL behavior in Banach space. Trans. Amer. Math. Soc. 360 6677-6693. Erd¨os, P. and Stone, A.H. (1970). On the sum of two Borel sets. Proc. Amer. Math. Soc. 25 304-306. Gikhman, I.I. and Skorohod, A.V. (1974). The Theory of Stochastic Processes I. Springer. Gin´e, E. (2007). Empirical Processes and Some of Their Applications. Lecture notes available from the author’s website. Gin´e, E. and Guillou, A. (2001). On consistency of kernel density estimators for randomly censored data: rates holding uniformly over adaptive intervals. Ann. Inst. H. Poincar´e Probab. Statist. 37 503-522. Gin´e, E. and Nickl, R. (2009). Uniform central limit theorems for wavelet density estimators. Ann. Probab. 37 1605-1646. Gin´e, E. and Nickl, R. (2016). Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge University Press. Gin´e, E. and Zinn, J. (1984). Some limit theorems for empirical processes. Ann. Probab. 12 929-989. Gross, L. (1975). Logarithmic Sobolev inequalities. Amer. J. Math. 97 1061-1078. Hoffmann-Jørgensen, J. (1991). Stochastic Processes on Polish Spaces. Various Publication Series 39 Aarhus University. The manuscript was available in 1984. Hoffmann-Jørgensen, J., Shepp, L.A. and Dudley, R.M. (1979). On lower tails of Gaussian seminorms. Ann. Probab. 7 319-342. Jain, N. and Marcus, M.B. (1975). Central limit theorems for C(S)-valued random variables. J. Functional Analysis 19 216-231. Koltchinskii, V.I. (1981). On the central limit theorem for empirical measures. Theor. Probab. Math. Statist. 24 71-82. Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Lecture Notes in Math. 2033 Springer. Krein, T. and Rio, E. (2005). Concentration around the mean for maxima of empirical processes. Ann. Probab. 33 1060-1077. 104

Landau, H.J. and Shepp, L.A. (1970). On the supremum of Gaussian processes. Sankhya 32 369-378. Ledoux, M. (1996). On Talagrand’s deviation inequalities for product measures. ESAIM: Probab. Statist. 1 63-87. Ledoux, M. (2001). The Concentration of Measure Phenomenon. American Mathematical Society. Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces. Springer. Li, W.V. and Shao, Q.-M. (2001). Gaussian Processes: Inequalities, Small Ball Probabilities and Applications. In: Stochastic Processes: Theory and Methods, Handbook of Statistics, Vol. 19, pp: 533-598. Lieb, E.H. (1973). Convex trace functions and the Wigner-Yanase-Dyson conjecture. Adv. Math. 11 267-288. Marcus, M.B. and Shepp, L.A. (1971). Sample behavior of Gaussian processes. Proc. Sixth Berkeley Symp. Math. Statist. Probab. 2 423-442. Massart, P. (2000). About the constants in Talagrand’s concentration inequalities for empirical processes. Ann. Probab. 28 863-884. Massart, P. (2007). Concentration Inequalities and Model Selection. Lecture Notes in Math. 1893 Springer. Oliveira, R.I. (2010). Sums of random Hermitian matrices and an inequality by Rudelson. Elec. Comm. in Probab. 15 203-212. Panchenko, D. (2003). Symmetrization approach to concentration inequalities for empirical processes. Ann. Probab. 31 2068-2081. Pisier, G. (1986). Some applications of the metric entropy condition to harmonic analysis. Banach Spaces, Harmonic Analysis, and Probability Theory. Lecture Notes in Mathematics 995 123-154. Springer. Pisier, G. (1989). The Volume of Convex Bodies and Banach Space Geometry. Cambridge University Press. Pollard, D. (1982). A central limit theorem for empirical processes. J. Austral. Math. Soc. Ser. A 33 235-248. Rudelson, M. (1999). Random vectors in the isotropic position. J. Functional Anal. 164 60-72. Slepian, D. (1962). The one-sided barrier problem for Gaussian noise. Bell Sys. Tech. J. 463-501.

105

Steele, J.M. (1986). An Efron-Stein inequality for nonsymmetric statistics. Ann. Statist. 14 753-756. Sudakov, V.N. (1971). Gaussian random processes and measures of solid angles in Hilbert space. Soviet Math. Dokl. 12 412-415. Sudakov, V.N. and Tsirel’son, B.S. (1978). Extremal properties of half-spaces for spherically invariant measures. J. Soviet Math. 9 9-18. Talagrand, M. (1996). New concentration inequalities in product spaces. Invent. Math. 126 505-563. Talagrand, M. (2005). The Generic Chaining. Springer. Tropp, J.A. (2012a). From the joint convexity of quantum relative entropy to a concavity theorem of Lieb. Proc. Amer. Math. Soc. 140 1757-1760. Tropp, J.A. (2012b). User-friendly tail bounds for sums of random matrices. Found. Comput. Math., to appear. van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer. van der Vaart, A.W. and Wellner, J.A. (2011). A local maximal inequality under uniform entropy. Electronic J. Statist. 5 192-203. ˇ Vapnik, V.N. and Cervonenkis, A.Ya. (1981). Necessary and sufficient conditions for uniform convergence of means to mathematical expectations. Theory of Probability and Its Applications 26 147-163. Vershynin, R. (2010). An introduction to the non-asymptotic analysis of random matrix theory. arXiv: 1011.3027.

106

Lecture notes on empirical process theory

Oct 30, 2017 - This completes the proof. Lemma 5 ensures that a finite limit limsât,sâT0 X(s, Ï) exists for every t â T and Ï â Î©0. Define the stochastic process ËX(t),t â T by. Ë. X(t, Ï) = { limsât,sâT0 X(s, Ï) if t â T,Ï â Î©0. 0 otherwise . Then ËX is a version of X with almost all sample paths uniformly continuous. Hence.

Download PDF

644KB Sizes 2 Downloads 372 Views

Report

Lecture notes on empirical process theory

Recommend Documents