A Brief Introduction to Large Deviations Theory

Viewer
Transcript

A Brief Introduction to Large Deviations Theory Gilles Wainrib

Abstract In this paper we introduce the main concepts of large deviations theory. We state some of the main theorems with several examples, from Cram´er theorem for the sum of independent random variables, to Freidlin-Wentzell theory of random perturbation of dynamical systems.

1 Introduction Large deviations theory is concerned with an asymptotic description of the fluctuations of a system around its most probable behavior. The first example of such a description goes back to Boltzmann’s 1877 calculation [4] for a system of independent particles, establishing a fundamental link between the notion of entropy and the asymptotic exponential behavior of multinomial probabilities. The entropy of a system at equilibrium measures the number of microscopic configurations leading to a given macroscopic state, and the state of maximum entropy corresponds to the most probable state. Not only at the core of thermodynamics and statistical physics, entropy has played a major role in many areas of science. In life sciences, entropy is an important concept, from evolution theory to protein unfolding and self assembling and molecular motors, not to mention its links with information theory which is widely applied in genetics or in neuroscience. Sharing the perspective of [10], large deviations theory may be viewed as a mathematical investigation of the concept of entropy. Describing fluctuations beyond the Central Limit Theorem (CLT), this theory provides exponential estimates for rare events analysis, which is a field of growing interest in many applied sciences. Let us give an illustration with an elementary example. If one throws n coins, then for large n the proportion of heads will be close to 1/2 with high probability. Gilles Wainrib 1 CREA (Ecole polytechnique - CNRS), 2 IJM ( CNRS - Paris 7 - Paris 6), 3 LPMA (Paris 6 - Paris 7 - CNRS), e-mail: [email protected]

1

2

Gilles Wainrib

This is the √ law of large numbers. The CLT states that the typical fluctuations of order 1/ n around this value are asymptotically normally distributed. This result is valid to evaluate for instance the probability to have between 480 and 510 heads if n = 1000. However, if one wants to evaluate the probability of having over 700 heads, which is a number exponentially small in n, then it is necessary to use the information contained in the higher moments of the random variable ”coin toss”, whereas the CLT only uses the first two moments. Large deviations theory provides answers to this question through an appropriate transform of the moment generating function (exponential moments) that is related to the concept of relative entropy. Historically, the first mathematical result describing the large fluctuations around its mean for a sum of independent random variables is due to Cram´er [6] (section 2). The mathematical pendent to Boltzmann’s calculation is Sanov’s theorem [17] (section 2) for the empirical measure of independent random variables. The general theoretical framework (section 3) for this type of asymptotic results has been developed afterward, in particular by Stroock and Varadhan. A key result in this framework is the G¨artner-Ellis theorem (section 3), generalizing the results of Sanov and Cramer to the case of dependent random variables. Small random perturbations of dynamical systems (section 4) have been investigated by Freidlin and Wentzell [13] within the framework of large deviations for sample paths, and have many applications, for example for the problem of exit from a domain. This paper is not intended to be a detailed and precise account of large deviations results, but rather an introductory guide, and we encourage the curious reader to consult the standard mathematical textbooks [8, 7] on this topic.

2 Sum of independent random variables Consider a sequence of independent and identically distributed (i.i.d) real random variables ξ1 , ξ2 , .... If m := E(ξ1 ) < ∞, then by the strong Law of Large Numbers (LLN), the empirical average 1 n (1) An = ∑ ξi n i=1 converges almost surely to m when n → ∞. When n is large but finite, it is of interest to characterize the fluctuations of An around m. A first answer to this question is given by the of √ Central Limit Theorem (CLT), and concerns typical fluctuations √ order O(1/ n) around m. More precisely, if σ 2 := Var(ξ1 ) < ∞, then n(An − m) converges in law to a Gaussian random variable N √ (0, σ 2 ). However, the CLT does not describe properly fluctuations larger than O(1/ n). From the LLN, we know that with a > m, pn (a) := P [An > a] converges to 0 when n → ∞, and we would like to estimate the speed of this convergence according to the value of a. The event {An > a} is often called a rare event, since we will see that pn (a) becomes exponentially small when n is large. A first historical example comes from a problem related to the insurance industry : if Xi is the claim of policy

A Brief Introduction to Large Deviations Theory

3

holder i, what is the probability that the total claim exceeds na, with a > m ? That is, the focus is on the distribution tail of the total claim. Such a question is crucial, since the insurance company may not be able to refund policy holders above a critical value na∗ , and pn (a∗ ) is then the probability of ruin. Contrary to the CLT where only the first two moments of X1 characterize the asymptotic rescaled behavior, describing rare events requires exponential moments to integrate the distribution tail behavior. In the exponential scale, rare events have on average a significant contribution. Theorem 2.1 [6]) (Cramer Assume E eθ X1 < ∞ for all θ ∈ R and define: i h Λ (θ ) := ln E eθ X1 and I(x) := sup {θ x − Λ (θ )}

(2)

θ ∈R

the Legendre transform of Λ . Then, for all a ∈]m, 1] and a0 ∈ [0, m[: lim

n→∞

1 1 ln P [An > a] = − inf I(x) and lim ln P An < a0 = − inf0 I(x) x>a n→∞ n n x
(3)

The complete proof can be found in [8]. An upper bound of n−1 ln pn (a) can be obtained with h i h −1 in h −1 in E eθ An = E en θ X1 > en θ a pn (a) so that n−1 ln pn (a) < Λ (θ ) − θ a ≤ − sup {θ a − Λ (θ )} θ ∈R

The lower bound is less straightforward and can be derived using an appropriate change of probability. We will see in the next section that An is said to satisfy a large deviation principle of speed n and rate function I. Example 2.1 Coin tossing We want to estimate the probability of having k = na heads in n throws. The random variables ξi are then Bernoulli variable θ with P[ξi = 1] = 1/2. To apply Cram´er’s Theorem, we compute Λ (θ ) = ln e 2+1 < ∞, then I(x) = supθ ∈R {θ x − Λ (θ )} is x ), which gives obtained solving x = Λ 0 (θ ), obtained for θ ∗ (x) = ln( 1−x

I(x) = x ln(x) + (1 − x) ln(1 − x) + ln(2)

(4)

A plot of I is given in Fig. 4 : I is non negative and has a unique zero at x = 1/2. Thus, by Cram´er’s Theorem : lim

n→∞

1 ln P [An > a] = − inf I(x) = −a ln(a) − (1 − a) ln(1 − a) − ln(2) x>a n

4

Gilles Wainrib

Fig. 1 Rate function I(x) = x ln(x) + (1 − x) ln(1 − x) + ln(2) for the coin tossing example.

This asymptotic result for P [An > a] can be obtained directly since P(nAn = k) = and using Stirling’s approximation n! ≈ nn e−n one retrieves the same expression for I(x). An elementary calculus shows that −nI(x) is also the relative entropy or Kulback-Leibler distance between the Binomial(n, x) and Binomial(n, 1/2) distributions. A given macroscopic state An = a can be achieved by many microscopic states (X1 , ..., Xn ) ∈ {0, 1}n . Essentially, entropy counts the number of those microscopic states. Saying that the maximum of entropy corresponds to the most likely realization An = 1/2 is equivalent to the fact that I(x) has its minimum (zero) at x = 1/2. 1 n! 2n k!(n−k)! ,

Empirical measure It is possible to generalize Cram´er’s Theorem to the empirical measure associated with the sequence (ξi )i≥1 . We assume that ξi take values in a finite set E = {1, ..., d}, and are i.i.d with distribution (ρk )k∈E , and each ρk > 0. We define the empirical measure: 1 n Ln = ∑ δξi (5) n i=1 The empirical measure is a random probability measure on E: it belongs to the probability simplex M (E) = {ν ∈ [0, 1]d ; ∑dk=1 νk = 1}. Our purpose is to estimate the probability that Ln is away from ρ. We thus need a distance on M (E): we consider the total variation distance d(µ, ν) = 21 ∑ds=1 |µs − νs |. The strong LLN implies that limn→∞ d(Ln , ρ) = 0 with probability one. We define the ball of radius a > 0 with this distance: Ba (ρ) = {ν ∈ M (E ); d(ν, ρ) ≤ a} and its complementary Ba (ρ) = M (E) − Ba (ρ).

A Brief Introduction to Large Deviations Theory

5

Theorem 2.2 (Sanov) For all a > 0 1 ln P Ln ∈ Ba (ρ) = − inf Iρ (ν) n→∞ n ν∈Ba (ρ) lim

(6)

with d

Iρ (ν) :=

∑ νs ln

s=1

νs ρs

(7)

This result can be proved directly, using Stirling’s approximation to the multinomial law satisfied by Ln . The quantity Iρ (ν) is actually the relative entropy H(ν, ρ) of ν with respect to ρ. More details can be found in [17, 8].

3 General theory Cram´er’s and Sanov’s Theorem presented in the previous section can be seen as specific examples in a wider theory of asymptotic results concerning large fluctuations of various random objects. Large deviation theory has been developed by several authors, in particular Stroock and Varadhan. A common general framework is to consider a sequence of probability spaces (Ωn , Fn , Pn ) and a sequence of random variables (Xn ), taking values in S a complete separable metric space. To the sequence (Xn ) is associated a sequence of laws (Pn ), defined by Pn (C) = Pn (Xn ∈ C). For instance, in the case of stochastic processes, S is a function space and Pn is the law of process Xn . Let (an ) such that limn→∞ an = +∞. Definition 3.1 Large Deviation Principle (LDP) The sequence (Xn ) satisfies a large deviation principle of speed an and rate function I(x) if : 1. For all C closed subset of S, lim sup n→∞

1 ln Pn (C) ≤ − inf I(x) = I(C) x∈C an

1 ln Pn (O) ≥ − inf I(x) = I(O) x∈O an 3. I is lower semi-continuous with compact level sets. 2. For all O open subset of S, lim inf n→∞

In terms of notations, instead of 1. and 2., we will write Pn (K) ≈ e−an I(K) . As a first example, one can show that Cram´er’s Theorem can be reformulated as : (An ) satisfies a LDP of speed n and rate function I(x) given in (4). The natural questions arising after this definition are how to prove a LDP and how to compute the rate function I. A rather general answer is given by the fundamental theorem of G¨artner-Ellis [14, 11], originally stated in finite dimension and later d generalized to infinite dimension a θ[2]. We consider the case where the Xi are R −1 .X n n the scaled cumulant generating function for valued. Let λn (θ ) := an ln E e θ ∈ Rd . Theorem 3.1 (G¨artner-Ellis) If lim λn (θ ) := λ (θ ) is finite, and differentiable for n→∞

all θ ∈ Rd , then (Xn ) satisfies a LDP of speed an and rate function

6

Gilles Wainrib

I(x) = sup {θ .x − λ (θ )}

(8)

θ ∈Rd

To give some heuristics behind the derivation of the rate function, we suppose that a LDP effectively holds. In this case, denoting ρn (x) is the density of Xn : Z Z E ean θ .Xn = ean θ .x ρn (x)dx ≈ ean θ .x e−an I(x) dx (9) ! ≈ exp an sup {θ .x − I(x)}

(10)

θ ∈Rd

where the last line is obtained by the Laplace Principle. This principle is a genR eral result enabling to approximate integrals of the form A exp (ξ φ (x)) dx by exp ξ supφ (x) for large ξ , in the sense that: x∈A

1 lim ln ξ →∞ ξ

Z

exp (ξ φ (x)) dx = supφ (x)

A

x∈A

We refer to [7] for more details about the Laplace Principle. As a consequence, λn (θ ) → supθ ∈Rd {θ .x − I(x)} and as λ (θ ) is differentiable, one can show that I is strictly convex and the Legendre transform is reversible in the sense that supθ ∈Rd {θ .x − I(x)} = λ (θ ) is equivalent to (8). Example 3.1 Applying G¨artner-Ellis theorem h i to An as defined in (1) with an = n yields λn (θ ) = nθ A −1 n n ln E e → ln E eθ ξ = Λ (θ ). In this case of a sum of i.i.d random variables, Λ (θ ) is analytic and in particular differentiable. The strength of the G¨artnerEllis Theorem is that it deals with sum of dependent variables as well.

Remark on convex and non-convex rate functions: We have presented here a weak version of the G¨artner-Ellis theorem : the differentiability condition for λ (θ ) can be weakened. It is related to the convexity properties of the rate function. At this stage, remark that if a LDP is obtained with G¨artner-Ellis theorem, then the rate function is necessarily strictly convex. This comes from the fact that the Legendre transform of a differentiable function (here λ ) is necessarily strictly convex ([16]). Hence, this theorem cannot be used to obtain nonconvex rate functions, especially with several minima. A detailed discussion of this question as well as several interesting examples can be found in [20].

A Brief Introduction to Large Deviations Theory

7

Varadhan’s Lemma and change of measure: Note that a more general convergence result, known as Varadhan’s Lemma, extends Laplace approximation to a wider setting, namely: h i λ ( f ) = lim an−1 ln E ean f (Xn ) = sup{ f (x) − I(x)} n→∞

x

Another form of this result can be used to derive a LDP from another one: if (Xn ) satisfies a LDP of speed n with rate function IX and if (Yn ) is such that its law PnY is defined by PnY (A) =

Z

A

enF(x) PnX (dx) /

Z

S

enF(x) PnX (dx)

(11)

(this means that the relative entropy between PnX and PnY is of order n), then Yn satisfies a LDP of speed n with rate function IY (x) = sup{F(y) − I(y)} − (F(x) − I(x))

(12)

y∈S

Contraction principle: Another useful tool to obtain LDP deals with the case of a sequence Yn defined as Yn = F(Xn ), with F continuous and knowing that Xn satisfies a LDP of speed an and rate function IX . The contraction principle states that Yn satisfies a LDP of same speed an and rate function IY (y) =

inf x: F(x)=y

IX (x)

Remark that with the contraction principle, one can deduce Cramer Theorem from Sanov Theorem, with a function F : M (E) → R such that F(ν) = ∑dk=1 kνk . Relationship between LDP, LLN and CLT To conclude this section, we go back to the LLN and the CLT, which can be derived from a large deviation principle. We consider the case where the assumptions of G¨artner-Ellis theorem hold. If the rate function I(x) has a global minimum at x∗ and if I(x∗ ) = 0 then x∗ = λ 0 (0) = limn→∞ E(Xn ). More precisely, Xn gets concentrated around x∗ since Pn (dx∗ ) converges to 1 exponentially fast when n → ∞. Note that in this case, one also has I 0 (x∗ ) = 0. Moreover, if I is twice differentiable at x∗ , then 00 ∗ ∗ 2 I(x) ∼ = 21 I 00 (x∗ )(x − x∗ )2 so that Pn (dx∗ ) ≈ e−nI (x )(x−x ) . For i.i.d sums, I 00 (x∗ ) = 1/λ 00 (0) = 1/σ 2 as expected by the CLT. However this relationship between LDP and CLT requires some specific assumptions to be valid [3], and the two following examples show that this question may be delicate.

8

Gilles Wainrib

Example 3.2 1. Large deviations principle does not imply the central limit theorem [5]. Consider symmetric random variables {Xt }t≤1 with distributions P(|Xt | > x) = exp(−x2t). The moment generating functions 1 1 E{exp(tyXt )} = 1 + yt −1 2 exp( ty2 ) 2 4

Z y√t/2

2 s−u du √ −y t/2

are analytic; their normalized logarithms are real-analytic and converge to (analytic) L(y) = 14 y2 , but the convergence holds for the real arguments y only and the central limit theorem fails. On the other hand, the large deviation principle holds with the Gaussian rate function. 2. The central limit theorem does not imply that the rate function has a quadratic minimum [20]. Let Sn be the mean of n IID random variables X1 , X2 , ..., Xn distributed according to the Pareto density p(x) =

a (|x| + b)β

with β > 3, a, b > 0. For β > 3, the variance is finite and the CLT holds for n1/2 Sn . However, the rate function of Sn is everywhere equal to zero (since the density of Sn has the same power-law tails as those of p(x)).

4 Some large deviations principles for stochastic processes 4.1 Sanov Theorem for Markov chains Let ξ1 , ξ2 , ... a Markov chain, on a finite state space E = {1, ..., d}, and with a transition matrix Q = (Qi j )i, j∈E . We assume that Qi, j > 0 for all i, j ∈ E. We keep the same notations as in Section 1, and define the empirical measure: Ln =

1 n ∑ δξi n i=1

(13)

If n is seen as time, Ln (k) is the proportion of time the chain spends in state k ∈ E. With our assumptions, the stationary distribution π for the Markov chain is unique, and we know that Ln to converges to π. We ask the question of the deviations of Ln from π. Here, the appropriate space is M (E), with the total variation distance, which constitutes a complete separable metric space, on which the general theory applies. We present here a theorem for Markov chains in discrete time, but a similar result exists in the continuous time setting. Theorem 4.1 The sequence (Ln ) satisfies a LDP on of speed n with rate function:

A Brief Introduction to Large Deviations Theory

" IQ (ν) = sup u∈(0,∞)d

9 d

(Qu)k − ∑ νk ln uk k=1

# (14)

The rate function IQ is finite, positive, continuous and strictly convex on M (E), and the stationary distribution π is the only zero of IQ . There are two main ways to prove this theorem. For both, the idea is to introduce the pair empirical measure Zn = n−1 ∑ni=1 δ{ξi ,ξi+1 } and then to go back to the empirical measure by applying the contraction principle. The first way [10] is based on the relative entropy method: to obtain a LDP for the Markov chain (dependent sequence), one uses an existing LDP in the independent case and then shifts the rate function for the independent case with the relative entropy between the dependent and independent laws. This method of relative entropy is also useful to prove more difficult large deviation principles for system of interacting particles [21]. The other method [8] is to apply the G¨artnerEllis Theorem to the sequence Zn : with θ ∈ Rd × Rd : λn (θ ) =

h i d 1 ln E enθ .Zn = ∑ πi Pinj (θ )eθ ji n i, j=1

(15)

where Pi j (θ ) = Qi j eθi j By the theory of Perron-Frobenius, one can show that λn (θ ) converges to λ (θ ) the logarithm of the unique largest eigenvalue of P(θ ) when n → ∞. Then working on the Legendre transform of λ (θ ) one ends up with a formula for the rate function, and concludes by the contraction principle. Example 4.1 Consider a Markov chains, with two states 0 and 1 and transition probability Q00 = Q11 = p and Q00 = Q11 = 1 − p. Then the rate function for the empirical measure is obtained by finding the supremum in (14), which can be rewritten by setting v = u1 /u2 : IQ (ν) = sup [−ν0 ln(p + (1 − p)/v) − ν1 ln(p + (1 − p)v)] v

The supremum is attained for v solution of pν1 v2 − (1 − p)(ν0 − ν1 )v − pν0 = 0, which gives a complicated expression for IQ . In the case p = 1/2, the Markov chain is actually just a sequence of i.i.d random variables, and one finds IQ (ν) is the relative entropy between the distribution ν and the distribution ρ = (1/2, 1/2).

4.2 Small noise and Freidlin-Wentzell theory In the above discussion on Markov chains, the asymptotic parameter n can be interpreted as time. Here, we are interested in a situation where the level of noise is the asymptotic parameter and is going to zero. Thus, the focus is on the behavior of a system around its deterministic trajectory. Two such situations arise naturally in

10

Gilles Wainrib

biological models. First, dynamical systems may be subject to external small random perturbation, which leads in particular to the study of stochastic differential equations as ε → 0: dXth = b(Xth )dt + εσ (Xth )dWt Another common situation is when considering a population of continuous-time Markov chains, which appears for instance in chemical kinetics or epidemiological models. In this case, the ”noise” parameter is the inverse size of the population: indeed, when the population is infinite, the law of large numbers ensures a deterministic limit for an aggregate variable (such as the proportion of individuals in a given state), however when the population size is finite, fluctuations remains, and large deviation theory may help in characterizing these finite size intrinsic fluctuations around the deterministic limit. Both case are included in a more general setting, where the process Xth is a Markov process with initial distribution Pxh and an infinitesimal generator, defined by, for f ∈ C2 with compact support: Ah f (x) = ∑ bi (x) fi0 (x) + i

1 + h

Z Rr∗

h ai, j (x) fi00j (x) 2∑ i, j

[ f (x + hβ ) − f (x) − h ∑ βi fi0 (x)]µx (dβ ) i

where µx is a measure on Rd − {0} such that |β |2 µx (dβ ) < ∞. The first term in the above sum corresponds to the drift, the second term to the diffusion and the third term to the jumps. The small noise assumption is twofold here: the diffusion is multiplied by a small parameter h, and the jumps are assumed to be small of order h with a frequency of order 1/h, which is typically the case when studying proportions in a population of size 1/h. In the context of stochastic processes, the state space is a functional space and the rate function is a functional of a given trajectory. With A being a set of trajectories, we are interested in quantities of the form: h i lim κ(h)−1 ln P (Xth )t≥0 ∈ A (16) R

h→0

If X h satisfies a LDP when h → 0, with speed κ(h) and rate function I, then the above limit will be roughly − inf I(φ ). φ ∈A

4.2.1 Action functional Our aim is to show how to construct the rate function I from the generator Ah . Of course, several technical conditions will be required for the LDP to be valid. We are going to consider exponential moments and Legendre transforms, with analogy with the Cramer theorem for sums of independent variables (here the role

A Brief Introduction to Large Deviations Theory

11

played by those independent variables is played by the independent increments of the process), or with the G¨artner-Ellis theorem. Definition 4.1 Let H(x, α) = h exp(−h−1 αx) Ah exp(h−1α.) (x), called the Hamiltonian: H(x, α) = ∑ bi (x)αi + ∑ ai, j (x)αi α j + i, j

i

Z Rr∗

[e(α,β ) − 1 − (α, β )]µx (dβ )

Then we denote L(x, β ) the Lagrangian, defined as the Legendre transform of H(x, α): L(x, β ) = sup{(α, β ) − H(x, α)} α

Rr

Definition 4.2 For a valued function φt , T1 ≤ t ≤ T2 , we define the action functional: R T2 ˙ T1 L(φt , φt )dt if φ is abs. continuous and the integral converges ST1 T2 (φ ) = +∞ otherwise Under some restrictions on H and L, Freidlin and Wentzel prove a theorem (Thm 2.1 p146 [13]) that establishes a LDP for X h : Theorem 4.2 (Freidlin-Wentzell) Under the following assumptions: ˆ 1. There exists an everywhere finite nonnegative convex function H(α) such that ˆ ˆ H(0) = 0 and H(x, α) ≤ H(α) for all x, α. 2. The function L(x, β ) is finite for all values of the arguments ; for any R > 0 there exists positive constants M and m such that L(x, β ) ≤ M,|∂β L(x, β )| ≤ M, ∑i j ∂ L/(∂ β i β j )(x, β )ci c j ≥ m ∑i ci for all x, c ∈ Rk and all β , |β | < R. L(y0 , β ) − L(y, β ) → 0 as δ 0 → 0. 3. ∆L (δ 0 ) = sup sup 1 + L(y, β ) 0 0 |y−y |<δ β Then the process (Xth )t∈[0,T ] satisfies a LDP of action function S0T (φ ) and with speed h−1 as h → 0 and uniformly in the initial point x.

Case 1 : Diffusion process For the Gaussian perturbation case, under the assumptions that the drift and diffusion parameters are bounded and uniformly continuous, and that the diffusion matrix is uniformly non degenerate, we can apply the above theorem and we find, with ai, j the inverse matrix of the diffusion matrix ai, j : S0,T (φ ) =

1 2

Z T 0

∑ ai, j (φt )(φ˙ti − bi (φt ))(φ˙tj − b j (φt ))dt i, j

It is still true in the case where the drift bh depends on h provided that there is uniform convergence to b. (Thm 3.1 Chap 5 in [13]). Note that if φt is solution of

12

Gilles Wainrib

φ˙t = b(φt ), then S0,T (φ ) = 0, which is consistent with the limit ε → 0. The value of S0,T (φ ) quantifies on a logarithmic scale the ”probability” that X ε follows a given trajectory φ between 0 and T . A special case is the case where Xtε = εWt . This case was considered before Freidlin-Wentzell by Schilder [18] and the action functional reads: S0,T (φ ) =

1 2

Z T 0

|φ˙t |2 dt

It is possible to use the Contraction Principle to derive a LDP for a wide class of Stochastic Differential Equations (SDEs) from Schilder LDP for the Brownian motion. Notice also that by scaling εWt has the same law as Wε 2 t so that the LDP also provides information about the small time behavior (see [1]) of the Brownian motion.

Case 2: Markov jump process For h > 0, let Xth a Markov jump process with state space Eh (all the points that are multiple of h), intensity λh (x) = h−1 [r(x) + l(x)] and with jump law µh (x, x + h) = r(x)/ (r(x) + l(x)) and µh (x, x − h) = l(x)/ (r(x) + l(x)), where r and l are two nonnegative and bounded real functions. It means that the process jumps to x + h with rate h−1 r(x) and to x−h with rate h−1 l(x). Consider as an example a population of N = 1/h individuals, each one jumping between states 0 and 1 with rates Ai, j , for i, j ∈ {0, 1}, i 6= j. Then define X h (t) the proportion of individuals that are in the state 1 at time t. In this case r(x) = (1 − x)A0,1 and l(x) = xA1,0 . Here, H(x, α) = (eα − 1) r(x) + e−α − 1 l(x) ! q p β + β 2 + 4r(x)l(x) − β 2 + 4r(x)l(x) + l(x) + r(x) L(x, β ) = β ln 2r(x) and the conditions 1.2.3. of Theorem 4.2 are satisfied so that the action function for this process is given by S(φ ) with: RT ˙ 0 L(φt , φt )dt if φ is abs. continuous and the integral converges S(φ ) = +∞ otherwise One checks without difficulty that L(φt , φ˙t ) = 0 when φ˙t = r(φt ) − l(φt ), which corresponds to the deterministic limit h → 0 coming from the law of large numbers.

A Brief Introduction to Large Deviations Theory

13

4.2.2 Quasipotential and asymptotic estimates for the problem of exit from a domain The purpose of this section is to show what are the consequences of a LDP to solve asymptotically exit problems arising very frequently in many applications. For instance, in population models this is related to the question of extinction, and for neuronal models to a threshold crossing responsible for spike generation [19]. Let D be a domain of Rr with a smooth boundary ∂ D. Let us distinguish two different cases: 1. if the deterministic limit xt starting at a point x ∈ D exits from D in a finite time T , then the stochastic process will also leave D in a finite time with probability 1 as h → 0 by a point of the boundary that is close to the deterministic “exit point” xT 2. in the case where, for x ∈ ∂ D, (b(x), n(x)) < 0 with n being the exterior normal, then xt does not leave D, but Xth will leave D with probability 1 as h → 0. To determine the exit time and the exit point, we will introduce the quasipotential, which is the infimum of the action function for trajectories starting at x ∈ D and ending on the boundary. Suppose we are in the second case, and suppose that O is an asymptotically stable equilibrium point, and ∀x ∈ D, xt (x) → 0 as t → ∞ without leaving D (we say D is attracted by O). In the case of gradient systems, the problem of exit from a domain is well studied, and the most famous result is of course the Kramer’s escape rate for a double-well potential. However, in the non-gradient case, a quantity called the quasipotential plays the role of a ”probabilistic landscape”, with reference to the language of energetic landscape for potentials. Definition 4.3 We define the quasipotential as: V (x, y) =

inf

{S0T (φ ); φ0 = x, φT = y}

φ abs. cont.,T >0

Note that for gradient systems perturbed by additive white noise (constant diffusion coefficient), the quasipotential is just twice the actual potential. With this setting we have the following theorem: Theorem 4.3 (Freidlin-Wentzell) 1. For the mean exit time τ h := inf{t; Xth ∈ / D}: for all x ∈ D, lim λ (h) ln Ex [τ h ] = inf V (0, y) = V0

h→0

y∈∂ D

2. For the exit point: if there exist a unique y0 ∈ ∂ D such that V (0, y0 ) = inf V (0, y) y∈∂ D

14

Gilles Wainrib

then, ∀δ > 0, ∀x ∈ D, lim Px [|Xτhh − y0 | < δ ] = 1

h→0

Our aim is now to consider a situation where one can obtain analytically an expression of the quasipotential, which is generally not possible. Numerical methods are presented in [23]. We also want show in the following example what we have announced in the introduction, namely that LDP is a sharper result than the CLT, and that this difference can be dramatic when considering rare events. This example and further theoretical results can be found in [15]. Example 4.2 We recall the example stated above in case 2. With N = 1/h, consider a population of N individuals, each one jumping between states 0 and 1 with rates Ai, j , for i, j ∈ {0, 1}, i 6= j. Then define X h (t) the proportion of individuals that are in the state 1 at time t. Our aim is to compare the behavior of this jump process with a diffusion approximation obtained in the asymptotic regime of a large population size. First, we recall that from [12], when N → ∞, X h converges in probability on finite time intervals to the solution of a deterministic differential equation : x˙ = (1 − x)A0,1 − xA1,0 . Moreover, it is possible to build a diffusion approximation, also called Langevin approximation X˜ h of the process X h as: √ q d X˜ h (t) = [r(X˜ h (t)) − l(X˜ h (t))]dt + h r(X˜ h (t)) + l(X˜ h (t))dWt where r(x) = (1 − x)A0,1 and l(x) = xA1,0 . To compare X h and X˜ h , we will consider the problem of exit from a domain and apply Theorem 4.3. To this end, we need to compute the quasipotentials associated with X h and X˜ h . Obtaining the Hamiltonians HM and HL associated respectively with X h and X˜ h is the first step towards this computation. By Theorem 4.3 Chap 5 p.159 of [13], we have a way to compute the quasipotential : find a function U, vanishing at x0 := A0,1 /(A0,1 +A1,0 ), continuously differentiable and satisfying H(x,U 0 (x)) = 0 for x 6= x0 and such that U 0 (x) 6= 0 for x 6= x0 where: • in the jump Markov case: HM (x, α) = (eα − 1)r(x) + (e−α − 1)l(x) • in the Langevin approximation: HL (x, α) = (r(x) − l(x))α + (r(x) + l(x))α 2 Here we note that HL is the second order expansion of HM in α. Actually, solving HL (x,UL0 (x)) = 0 and HM (x,UM0 (x)) = 0, we can find the quasipotentials UL and UM explicitly: • in the jump Markov case: Z x

UM (x) =

ln(l(u)/r(u))du x0

• in the Langevin approximation: UL (x) = −2

Z x r(u) − l(u) x0

r(u) + l(u)

du

A Brief Introduction to Large Deviations Theory

15

Then, consider the double barrier exit problem. Define the first passage times τah := inf{t ≤ 0, X h (t) < a} and τbh := inf{t ≤ 0, X h (t) > b} , with 0 < a < x0 < b < 1, and suppose X h (0) = x0 the stable equilibrium point for the deterministic equation. Then, from Theorem 4.3 the probability P τah < τbh of escaping first from a tends to 1 if the value of the quasipotential evaluated at a is strictly below its value at b, and tends to 0 otherwise. With some values of the parameters, the following situation arises: UM (a) < UM (b) but UL (a) > UL (b). This means that for small h, the original Markov jump process will escape almost always from a whereas its diffusion approximation, derived from a CLT, will escape almost always from b, as shown in more details in [15].

5 Conclusion In this brief note, we have shown some of the key results of large deviations theory, from Cramer Theorem for the sum of independent random variables, to the Freidlin-Wentzell theory of small random perturbations of dynamical systems. We have chosen to make a synthetic presentation without the proofs to give a concise overview of this theory, which has many ramifications. The results presented here are only a fraction of all the available results, and many sharpenings, ramifications and generalizations are available. We deeply encourage the reader to refer to the mathematical textbooks [8, 7, 9, 22], to [13] for the small noise problems, and to [10, 20] for the relationship with statistical physics and entropy. This relationship seems to be one of the key paths towards an application of large deviations theory in biology. For instance, statistical physics techniques have been introduced successfully in the last decade to study complex biological networks, such as neuronal networks. Moreover, the issues raised by small random perturbations are of course of great interest for the study of many biological processes, especially when rare events may be amplified by feedback loops and non-linearities. Large deviations tools are a first step towards the analysis of such events, and can also help designing efficient simulations techniques, as discussed in [23] in the present volume.

References 1. Azencott R. Grandes d´eviations et applications, in: Ecole d’Et´e de Probabilits de SaintFlour VIII-1978 Lecture Notes in Math, Springer, Berlin,, 774:1–176, 1980 2. Baldi P. Large deviations and stochastic homogenization, Ann. Mat. Pura Appl., 151:161–177, 1988 3. Bolthausen, E. Laplace approximations for sums of independent random vectors, Probability theory and related fields, 71(2):167–206, 1987 ¨ 4. Boltzmann L. Uber die beziehung zwischen dem zweiten hauptsatze der mechanischen w¨armetheoreie un der Wahrscheinlichkeitrechnung respektive den s¨atzen u¨ ber das w¨armegleichgewicht (On the relationship between the second law of the mechanical theory of heat and the probability calculus) Wiener Berichte, 2(76) :373–435, 1877.

16

Gilles Wainrib 5. Bryc W. A remark on the connection between the large deviation principle and the central limit theorem Statist. Probab. Lett., 18 (4):253–256, 1993. 6. Cram´er H. Sur un nouveau th´eor`eme limite dans la th´eorie des probabilit´es, in: Colloque consacr´e a` la th´eorie des probabilit´es, Hermann, Paris, 3 :2–29, 1938. 7. Dembo A., Zeitouni O. Large deviations techniques and applications, 2nd Edition Springer, New York,1998. 8. den Hollander F. Large Deviations Fields Institute Monograph, Amer. Math. Soc., Providence, R.I., 2000. 9. Deuschel JD, Stroock DW. Large deviations Academic Press, Boston, 1989. 10. Ellis RS. Entropy, large deviations, and statistical mechanics, Springer, New York,1985. 11. Ellis RS. Large deviations for a general class of random vectors Ann. Probab,12:1–12, 1984. 12. Ethier SN, Kurtz TG. Markov processes John Wiley and Sons, 1986. 13. Freidlin M., Wentzell, A.D. Random perturbations of dynamical systems, 2nd edition. Springer-Verlag New York, 1998 14. G¨artner, J. On large deviations from the invariant measure Th. Prob. Appl, 22:24–39, 1977. 15. Pakdaman K, Thieullen M, Wainrib G Diffusion approximation of birth-death processes : comparison in terms of large deviations and exit point Stat. Probab. Lett. , 80(13-14) :1121-1127, 2010 16. Rockafellar, R.T. Convex analysis Princeton Univ Pr, 1997 17. Sanov IN. On the probability of large deviations of random variables Selected translations in mathematical statistics and probability I, 213–244, 1961 18. Schilder M. Some asymptotic formulae for Wiener integrals Trans. Amer. Math. Soc., 125:63–85, 1966. 19. Thieullen, M. Deterministic and Stochastic FitzHugh-Nagumo Systems in Stochastic Differential Equations Models with Applications to the Insulin-Glucose System and Neuronal Modelling, Springer Lecture Notes in Mathematical Biosciences. 20. Touchette H. The large deviations approach to statistical mechanics Physics Reports, 478:1–69, 2009. 21. Varadhan SRS. Lectures on Hydrodynamic Scaling, in: Hydrodynamic limits and related topics Fields Institute Communication, Amer. Math. Soc., Providence, R.I., 27:3–42, 2009. 22. Varadhan SRS Large deviations and applications SIAM, Philadelphia, 1984. 23. Wainrib G Some numerical methods for rare events simulation and analysis. in Stochastic Differential Equations Models with Applications to the Insulin-Glucose System and Neuronal Modelling, Springer Lecture Notes in Mathematical Biosciences.

Learning, large deviations and rare events