Calibration and Internal No-Regret with Random Signals

Viewer
Transcript

Calibration and Internal No-Regret with Random Signals Vianney Perchet ´ Equipe Combinatoire et Optimisation, FRE 3232 CNRS, Universit´e Pierre et Marie Curie - Paris 6, 175 rue du Chevaleret, 75013 Paris [email protected]

Abstract. A calibrated strategy can be obtained by performing a strategy that has no internal regret in some auxiliary game. Such a strategy can be constructed explicitly with the use of Blackwell’s approachability theorem, in an other auxiliary game. We establish the converse: a strategy that approaches a convex B-set can be derived from the construction of a calibrated strategy. We develop these tools in the framework of a game with partial monitoring, where players do not observe the actions of their opponents but receive random signals, to define a notion of internal regret and construct strategies that have no such regret.

1

Introduction

Consider an agent trying to predict a sequence of outcomes. For example, a meteorologist announces each day the probability that it will rain the following day. He will do this with a given accuracy (for instance, he chooses between {0, 0.1, 0.2, . . . , 1}). The predictions will be considered successful if on the days when the meteorologist forecasts 0.5, nearly half of these days are rainy and half sunny. And this should be true for every possible prediction. Foster and Vohra [6] called this property calibration and proved the existence of calibrated strategies, without any assumption on the sequence of outcomes and on the knowledge of the predictor. The first section deals with the connections between three tools: calibration, approachability and no-regret. The notion of regret in full monitoring has been introduced by Hannan [9]: a player has asymptotically no external regret if his average payoﬀ could not have been better by knowing in advance the empirical distribution of moves of the other players. Hannan [9] proved the existence of such strategies and Blackwell [4] gave an alternative proof using his approachability theorem. Foster and Vohra [7] (see also Fudenberg and Levine [8]) extended Hannan’s result by proving the existence of strategies with no internal regret, which is a more precise notion: a player has asymptotically no internal regret, if for each of his action, he has no external regret on the set of stages where he played it. We refer to Cesa-Bianchi and Lugosi [5] for a survey on sequential prediction and regret. R. Gavald` a et al. (Eds.): ALT 2009, LNAI 5809, pp. 68–82, 2009. c Springer-Verlag Berlin Heidelberg 2009 ⃝

Calibration and Internal No-Regret with Random Signals

69

A calibrated strategy can be obtained through the construction of a strategy with no internal regret in an auxiliary game (see Sorin [17]). And this construction can be done explicitly using Blackwell’s approachability theorem [3] for an orthant in IRd (see Hart and Mas-Colell [10]). We will provide a kind of converse result: we derive an explicit construction of an approachability strategy for a convex B-set through the use of a calibrated strategy, in some auxiliary game. In the second section, we consider repeated games with partial monitoring, where players do not observe the action of their opponents, but only receive random signals and we focus on strategies that have no regret, in the following sense. A player has asymptotically no external regret if his average payoﬀ could not have been better by knowing in advance the empirical distribution of signals (see Rustichini [15]). The existence of strategies with no external regret was proved by Rustichini [15] and Lugosi, Mannor and Stoltz [14] constructed explicitly such strategies. Lehrer and Solan [13] defined a notion of internal regret in the partial monitoring framework and proved the existence of strategies with no such regret. We will generalize these results by constructing strategies that have no regret, for a more precise notion of regret.

2

Full Monitoring Case: Approachability Implies Calibration

This section is devoted to the full monitoring case. We recall the main results about calibration of Foster and Vohra [6], approachability of Blackwell [3] and regret of Hart and Mas-Colell [10]. We will prove some of these results in details, since they give the main ideas about the construction of strategies in the partial monitoring framework, given in section 4. 2.1

Calibration

Let S be a finite set of states. We consider a two-person repeated game where, at stage n ∈ IN, Nature (Player 2) chooses a state sn ∈ S and Predictor (Player 1) chooses µn ∈ ∆(S) the set of probabilities over S. We assume that µn belongs to a finite set M = {µ(l), l ∈ L}. Let ε > 0 such that for every probability µ ∈ ∆(S), there exists µ(l) ∈ M such that ∥µ − µ(l)∥ ≤ ε where ∆(S) is seen as a subset of IR|S| . Then M is called an ε-grid of ∆(S). With this notations, the prediction at stage n is the choice of an element ln ∈ L, called the type of that stage. The choices of ln and sn are functions of the past observations (or the finite history) hn−1 = (l1 , s1 , . . . , ln−1 , sn−1 ) and may be random. Explicitly, the set ! n 0 of finite histories is denoted by H = n∈IN (L × S) , with (L × S) = ∅ and a strategy σ of Player 1 (resp. τ of Player 2) is a function from H to ∆(L) (resp. ∆(S)) and σ(hn ) (resp. τ (hn )) is the law of ln+1 (resp. sn+1 ) after hn . A couple IN of strategies (σ, τ ) generates a probability, denoted by IPσ,τ , over H = (L × S) , the set of plays embedded with the cylinder σ-field.

70

V. Perchet

We will use the following notations. For any families {am ∈ IRd , lm ∈ L}m∈IN and n ∈ IN, Nn (l) " = {1 ≤ m ≤ n, lm = l} is the set of stages of type l (before the n-th), an (l) = m∈Nn (l) am /|Nn (l)| is the average of {am } on this set and "n an = m=1 am /n is the average over all the stages (before the n-th). Definition 1. Foster-Vohra [6] A strategy σ of Player 1 is calibrated (with respect to the ε-grid M) if for every l ∈ L and every strategy τ of Player 2: $ # |Nn (l)| lim sup ∥sn (l) − µ(l)∥2 − ε2 ≤ 0, IPσ,τ -as . n n→+∞

In words, a strategy of Player 1 is calibrated if, on the set of stages where µ(l) is forecast, the empirical distribution of states is asymptotically close to µ(l) (as long as the frequency of l is not too small). Foster-Vohra [6] proved the existence of such strategies with an algorithm based on the Expected Brier Score. 2.2

Approachability

We will prove that calibration will follow from no-regret and that no-regret will follow from approachability (following respectively Sorin [17] and Hart and Mas-Colell [10]). We present here the notion of approachability introduced by Blackwell [3]. Consider a two-person repeated game in discrete time with vector payoﬀs, where at stage n ∈ IN, Player 1 (resp. Player 2) chooses the action in ∈ I (resp. jn ∈ J), with both I and J finite. The corresponding vector payoﬀ is ρn = ρ(in , jn ) where ρ : I × J → IRd . As usual, a strategy σ (resp.!τ ) of Player 1 n (resp. Player 2) is a function from the set of finite histories H = n∈IN (I × J) to ∆(I) (resp. ∆(J)). For a closed set E ⊂ IRd and δ ≥ 0, we denote by E δ = {z ∈ IRd , dE (z) ≤ δ} the δ-neighborhood of E and ΠE (z) = {e ∈ E, dE (z) = ∥z − e∥} the set of closest point to z in E, where dE (z) = inf e∈E ∥z − e∥. Definition 2. i) A closed set E ⊂ IRd is approachable by Player 1 if for every ε > 0, there exists a strategy σ of Player 1 and N ∈ IN, such that for every strategy τ of Player 2 and every n ≥ N : # $ Eσ,τ [dE (ρn )] ≤ ε and IP sup dE (ρn ) ≥ ε ≤ ε . n≥N

Such a strategy is called an approachability strategy of E. ii) A set E is excludable by Player 2, if there exists δ > 0 such that the complement of E δ is approachable by Player 2. In words, a set E ⊂ IRd is approachable by Player 1, if he has a strategy such that the average payoﬀ converges almost surely to E, uniformly with respect to the strategies of Player 2. Blackwell [3] gave a suﬃcient geometric condition for a closed set E to be approachable by Player 1. Denote by P 1 (x) = {ρ(x, y), y ∈ ∆(J)}, the set of expected payoﬀs compatible with x ∈ ∆(I) and define similarly P 2 (y).

Calibration and Internal No-Regret with Random Signals

71

Definition 3. A closed subset E of IRd is a B-set, if for every z ∈ IRd , there exist p ∈ ΠE (z) and x (= x(z, p)) ∈ ∆(I) such that the hyperplane through p and perpendicular to z − p separates z from P 1 (x), or formally: ∀z ∈ IRd , ∃p ∈ ΠE (z), ∃x ∈ ∆(I), ⟨ρ(x, y) − p, z − p⟩ ≤ 0,

∀y ∈ ∆(J) .

(1)

Informally, from any point z outside E there is a closest point p and a probability x ∈ ∆(J) such that, whatever being the choice of Player 2, the expected payoﬀ and z are on diﬀerent sides of the hyperplane through p and perpendicular to z − p. In fact, this definition (and the following theorem) does not require that J is finite: one can assume that Player 2 chooses an outcome vector U ∈ [−1, 1]|I| so that the expected payoﬀ is ⟨x, U ⟩. Theorem 1. Blackwell [3] If E is a B-set, then E is approachable by Player 1. Moreover, the strategy σ of Player 1 defined by σ(hn ) = x(ρn ) is such that, for every strategy τ of Player 2: $ # 4B 8B 2 and IPσ,τ sup dE (ρn ) ≥ η ≤ 2 , (2) Eσ,τ [dE (ρn )] ≤ n η N n≥N with B = supi,j ∥ρ(i, j)∥2 . In the case of a convex set C, there is a complete characterization: Corollary 1. Blackwell [3] A closed convex set C ⊂ IRd is approachable by Player 1 if and only if: P 2 (y) ∩ C ̸= ∅,

∀y ∈ ∆(J) .

(3)

In particular, a closed convex set C is either approachable by Player 1, or excludable by Player 2. Remark 1. Corollary 1 implies that there are (at least) two diﬀerent ways to prove that a convex set is approachable. The first one, called direct proof, consists in proving that C is a B-set while the second one, called undirect proof, consists in proving that C is not excludable by Player 2, which reduces to find, for every y ∈ ∆(J), some x ∈ ∆(I) such that ρ(x, y) ∈ C. 2.3

Approachability Implies Internal no-Regret

Consider a two-person repeated game in discrete time, where at stage n ∈ IN Player 1 chooses in ∈ I as above and Player 2 chooses a vector Un ∈ [−1, 1]c (with c = |I|). The associated payoﬀ is Unin , the in -th coordinate of Un . The internal regret of the stage is the matrix Rn = R(in , Un ), where the function 2 R : I × [−1, 1]c → IRc is defined by: % 0 if i′ ̸= i (i′ ,j) R(i, U ) = j i U − U otherwise.

72

V. Perchet

With this definition, the average internal regret Rn is defined by: &" ' j () * + i ( |Nn (i)| ' m∈Nn (i) Um − Um j i Rn = = U n (i) − U n (i) j∈I . n n i∈I i,j∈I

Definition 4. Foster and Vohra [7]: A strategy σ of Player 1 is internally consistent if for any strategy τ of Player 2: lim sup Rn ≤ 0, n→∞

IPσ,τ -as .

The existence of such strategies have been proved by Foster and Vohra [7] and Fudenberg and Levine [8]. Theorem 2. There exist internally consistent strategies. Note that an internally consistent strategy can be obtained by constructing a 2 strategy that approaches the negative orthant Ω = IRc− in the auxiliary game where the vector payoﬀ at stage n is Rn . The proof of Hart and Mas-Colell [10] of the fact that Ω is a B-set relies on the two followings lemmas: Lemma 1 gives a geometric property of Ω and Lemma 2 gives a property of the function R. 2

Lemma 1. Let ΠΩ (·) be the projection onto Ω. Then, for every A ∈ IRc : ⟨ΠΩ (A), A − ΠΩ (A)⟩ = 0 .

(4)

2

Proof. Note that since Ω = IRc− then A+ = A−ΠΩ (A) where A+ ij = max (Aij , 0) and similarly A− = ΠΩ (A). The result is just a rewriting of ⟨A− , A+ ⟩ = 0. ⊓ 1 For every non-negative (c × c)-matrix A = (aij )i,j∈I , λ ∈ ∆(L) is an invariant probability of A if for every i ∈ I: , , λ(j)aji = λ(i) aij . j∈I

j∈I

The existence of an invariant probability follows from the similar result for Markov chains. Lemma 2. Let A = (aij )i,j∈I be a non-negative matrix. Then for every λ, invariant probability of A, and every U ∈ IRc : ⟨A, Eλ [R(·, U )]⟩ = 0 .

( ' Proof. The (i, j)-th coordinate of Eλ [R(·, U )] is λ(l) U j − U i , therefore: , ( ' ⟨A, Eλ [R(·, U )]⟩ = aij λ(i) U j − U i

(5)

i,j∈I

"

" and the coeﬃcient of each U i is j∈I aij λ(i) − j∈I aji λ(j) = 0, because λ is an invariant measure of A. Therefore ⟨A, Eλ [R(·, U )]⟩ = 0. ⊓ 1

Calibration and Internal No-Regret with Random Signals

73

Proof of Theorem 2. Summing equations (4) (with A = Rn ) and (5) (with ' (+ A = Rn ) gives: . Eλn [R(·, U )] − ΠΩ (Rn ), Rn − ΠΩ (Rn ) = 0 , +

for every λn invariant probability of Rn and every U ∈ [−1, 1]I . Define the strategy σ of Player 1 by σ(hn ) = λn . The expected payoﬀ at stage n + 1 (given hn and Un+1 = U ) is Eλn [R(·, U )], so Ω is a B-set and is approachable by Player 1. 1 ⊓ Remark 2. The construction of the strategy is based on approachability properties therefore the convergence is uniform with respect to the strategies of Player 2. Theorem 1 implies that for every η > 0, and for every strategy τ of Player 2: $ # # $ ( 1 |Nn (i)| ' IPσ,τ ∃n ≥ N, ∃i, j ∈ i, U n (i)j − U n (i)i > η = O n η2 N * + # $ (+ 1 |Nn (l)| ' U n (i)j − U n (i)i =O √ and Eσ,τ sup . n n i∈I 2.4

Internal Regret Implies Calibration

Sorin [17] proved that the construction of calibrated strategy can be reduced to the construction of internally consistent strategy. The proof relies on the following lemma: Lemma 3. Let (am )m∈IN be a sequence in IRd and α, β two points in IRd . Then for every n ∈ IN∗ : "n 2 2 2 2 m=1 ∥am − β∥2 − ∥am − α∥2 = ∥an − β∥2 − ∥an − α∥2 , (6) n with ∥ · ∥2 the L2 -norm of IRd . Proof. Develop the sums in equation (6) to get the result.

1 ⊓

Now, we can prove the following: Theorem 3. Foster and Vohra [6] For every finite grid of ∆(S), there exist calibrated strategies of Player 1. Proof. We start with the framework described in 2.1. Consider the auxiliary two-person game with vector payoﬀ defined as follows. At stage n ∈ IN, Player 1 (resp. Player 2) chooses the action ln ∈ L (resp. sn ∈ S) which generates the payoﬀ Rn = R(ln , Un ) ∈ IRd , where R is as in 2.3, with: / 0 2 ∈ IRc . Un = − ∥sn − µ(l)∥2 l∈L

74

V. Perchet

By definition of R and using Lemma 3, for every n ∈ IN∗ : 1" 2 22 |Nn (l)| lk m∈Nn (l) ∥sm − µ(l)∥2 − ∥sm − µ(k)∥2 Rn = n |Nn (l)| 0 / |Nn (l)| 2 2 ∥sn (l) − µ(l)∥2 − ∥sn (l) − µ(k)∥2 . = n

Let σ be an internally consistent strategy in this auxiliary game, then for every l ∈ L and k ∈ L: 0 |Nn (l)| / 2 2 lim sup ∥sn (l) − µ(l)∥2 − ∥sn (k) − µ(k)∥2 ≤ 0, IPσ,τ -as . n n→∞ Since {µ(k), k ∈ L} is a ε-grid of ∆(S), for every l ∈ L, and every n ∈ IN∗ , there exists k ∈ L such that ∥sn (l) − µ(k)∥22 ≤ ε2 , hence: lim sup n→∞

0 |Nn (l)| / 2 ∥sn (l) − µ(l)∥2 − ε2 ≤ 0, n

IPσ,τ -as . 1 ⊓

Remark 3. We have proved that σ is such that, for every l ∈ L, sn (l) is closer to µ(l) than to any other µ(k), as soon as |Nn (l)|/n is not too small. The fact that sn belongs to a finite set S and {µ(l)} are probabilities over S is irrelevant: one can show that for any finite set {a(l) ∈ IRd , l ∈ L}, Player 1 has a strategy σ such that for any bounded sequence (am )m∈IN in IRd and for every l and k : $ # |Nn (l)| lim sup ∥an (l) − a(l)∥2 − ∥an (l) − a(k)∥2 ≤ 0 . n n→∞

3

Calibration Implies Approachability

The proof of Theorem 3 shows that the construction of a calibrated strategy can be obtained through an approachability strategy of an orthant in an auxiliary game. Conversely, we will show that the approachability of a convex B-set can be reduced to the existence of a calibrated strategy in an auxiliary game, and so give a new proof of Corollary 1. Alternative proof of Corollary 1. The idea of the proof is very natural: given ε > 0, we construct a finite covering {Y (l), l ∈ L} of ∆(J) and associate to Y (l) a probability x(l) ∈ ∆(I) such that ρ(x(l), y) ∈ C ε for every y ∈ Y (l). Player 1 will always choose his action accordingly to one of the {x(l)}. Assume that on the stages when Player 1 played x(l), the empirical action of Player 2 is in Y (l), then the average payoﬀ on these stages is in the convex set C ε (by linearity of ρ). And if this property is true for every l ∈ L, then the average payoﬀ is also in C ε (by convexity).

Calibration and Internal No-Regret with Random Signals

75

Formally, assume that condition (3) is satisfied and rephrased as: ∀y ∈ ∆(J), ∃x(= xy ) ∈ ∆(I), ρ(xy , y) ∈ C .

(7)

Since ρ is multilinear and therefore continuous on ∆(I) × ∆(J), for every ε > 0, there exists δ > 0 such that: ∀y, y ′ ∈ ∆(J), ∥y − y ′ ∥2 ≤ 2δ ⇒ ρ(xy , y ′ ) ∈ C ε . We introduce the auxiliary game Γ where Player 2 chooses action (or state) j ∈ J and Player 1 forecasts it, using {y(l), l ∈ L}, a finite grid of ∆(J) whose diameter is smaller than δ. Let σ be a calibrated strategy for Player 1, so that ȷn (l), the empirical distribution of actions of Player 2 on Nn (l), is asymptotically δ-close to y(l). Define the strategy of Player 1 in the initial game by performing σ and if ln = l by playing accordingly to xy(l) = x(l) ∈ ∆(I), as depicted in (7). Since the choices of actions of the two players are independent, ρn (l) will be close to ρ (x(l), ȷn (l)), hence close to ρ(x(l), y(l)) and finally close to C ε , as soon as |Nn (l)| is not too small. Indeed, by construction of σ, for every η > 0 there exists N 1 ∈ IN such that, for every strategy τ of Player 2: $ # 0 |Nn (l)| / 2 ∥ȷn (l) − y(l)∥2 − δ 2 ≤ η ≥ 1 − η . (8) IPσ,τ ∀l ∈ L, ∀n ≥ N 1 , n Hoeﬀding-Azuma inequality for sum of bounded martingale diﬀerences (see [2,11]) implies that for any η ∈ (0, 1) with probability at least 1 − η, 3 # $ 2 2 |ρn (l) − ρ(x(l), ȷn (l)| ≤ ln , |Nn (l)| η

and therefore there exists N 2 ∈ IN such that for every l ∈ L: 4 # $ 4 IPσ,τ ∀m ≥ n, |ρn (l) − ρ(x(l), ȷn (l))| ≤ η 44|Nn (l)| ≥ N 2 ≥ 1 − η .

(9)

Equations (8) and (9), taken with η ≤ ε/L, imply that, with probability at least 2 1 − 2ε, for every n ≥ max{N 1 , LN 2 /ε}, |ρn (l) − ρ(x(l), ȷn (l))| ≤ η ≤ ε, and 2 2 2 if Nn (l)/n ≥ ε/L then |Nn (l)| > N , so ∥ȷn (l) − y(l)∥ ≤ 2δ , and therefore dC (ρn (l)) ≤ 2ε. Since C is a convex set, dC (·) is convex and with probability at least 1 − 2ε: 1 2 , |Nn (l)| , |Nn (l)| dC (ρn ) = dC dC (ρn (l)) ρn (l) ≤ n n l∈L

≤

,

l:Nn (l)/n≥ε/L

≤ 2ε + ε = 3ε.

l∈L

|Nn (l)| dC (ρn (l)) + n

,

l:Nn (l)/n<ε/L

|Nn (l)| n

76

V. Perchet

Therefore C is approachable by Player 1. On the other hand, if there exists y such that P 2 (y) ∩ C = ∅, then Player 2 can approach P 2 (y), by playing at every stage accordingly to y. Therefore C is not approachable by Player 1. 1 ⊓ Remark 4. Blackwell’s proof of this result is not explicit. He showed that the condition (7) implies that C is a B-set and his proof relies on the use of Von Neumann’s minmax theorem. In words, let z be a fixed point outside C. Assume that if Player 1 knows y ∈ ∆(J) the law of the action of Player 2, then there is a law xy ∈ ∆(I) such that the expected payoﬀ ρ(xy , y) and z are in diﬀerent sides of the hyperplane described in the definition of a B-set. The minmax theorem implies that there exists x ∈ ∆(I) such that for every y ∈ ∆(I), z and ρ(x, y) are on diﬀerent sides and therefore C is a B-set. This gives the existence of an approachability strategy of C. One of the major interest in calibration, is that it transforms this implicit proof into an explicit constructive proof: while performing a calibrated strategy (in an auxiliary game where J plays the role of the set of states), Player 1 can enforce the property that, for every l ∈ L, the average move of Player 2 is almost y(l) on Nn (l). So he just has (and could not do better) to play xy(l) on these stages. Remark 5. 1) Hoeﬀding-Azuma’s inequality for sums of bounded martingale diﬀerences implies that for every strategy τ of Player 2: 2 15 + * |Nn (l)| ln(n) |ρn (l) − ρ (x(l), y n (l))| = O Eσ,τ sup n n l∈L The strategy σ is based on approachability properties and on HoeﬀdingAzuma’s inequality, so one can show that: 2 15 ln(n) Eσ,τ [dC (ρn ) − ε] ≤ O . n 2) To deduce that ρn is in C ε from the fact that ρn (l) is in C ε for every l ∈ L, it is necessary that C (or dC (·)) is convex.

4

Internal Regret in the Partial Monitoring Framework

Consider a two person repeated game in discrete time. At stage n ∈ IN, Player 1 (resp. Player 2) chooses in ∈ I (resp. jn ∈ J), which generates the payoﬀ ρn = ρ(in , jn ) with ρ : I × J → IR. Player 1 does not observe this payoﬀ, he just receives a signal sn ∈ S whose law is s(in , jn ) with s : I × J → ∆(S). The three sets I, J and S are finite, the two functions ρ and s are extended multilineary to ∆(I) × ∆(J) and we define s : ∆(J) → ∆(S)I by s(y) = (s(i, y))i∈I , where ∆(S)I is the set of vectors of probability over S. We call any such vector a flag. As usual, a strategy σ of Player 1 (resp. τ of Player 2) is a function from the

Calibration and Internal No-Regret with Random Signals

77

! set of ! finite histories for Player 1, H 1 = n∈IN (I × S)n , to ∆(I) (resp. from n H 2 = n∈IN (I × S × J) to ∆(J)). A couple (σ, τ ) generates a probability IPσ,τ IN over H = (I × S × J) . 4.1

External Regret

Rustichini [15] defined the regret in the partial monitoring framework as follows: a strategy σ of Player 1 has no external regret if IPσ,τ -as: lim sup max

ρ(x, y) − ρn ≤ 0 . min y ∈ ∆(J), ⎩ s(y) = s(ȷ ) n

⎧ n→+∞ x∈∆(I) ⎨

where s(ȷn ) ∈ ∆(S)I is the average flag. In words, the average payoﬀ of Player 1 could not have been better uniformly if he had known the average distribution of flags before the beginning of the game. In this framework, given a flag µ ∈ ∆(S)I , the function miny∈s−1 (µ) ρ(·, y) may not be linear. So the best response of Player 1 might not be a pure action in I, but a mixed action x ∈ ∆(I) and any pure action in the support of x may be a bad response. Note that this also appears in Rustichini’s definition, since the maximum is taken over ∆(I) and not just over I as in the usual definition of external regret in full monitoring. 4.2

Internal Regret

We consider here a generalization of the previous’s framework: At stage n ∈ IN, Player 2 chooses a flag µn ∈ ∆(S)I while Player 1 chooses an action in and receives a signal sn whose law is the in -th coordinate of µn . Given a flag µ and x ∈ ∆(I), Player 1 evaluates the payoﬀ through an evaluation function G : ∆(I) × ∆(S)I → IR, which is not necessarily linear. There are two requirements to define internal regret: we have to define a finite partition of IN and for every element of that partition, Player 1 must choose a point in ∆(I) that is a best response (or at least an ε-best response) to some flag. Hence we have to distinguish the stages, not as a function of the action played, but as a function of the law of the action. We also assume that the strategy of Player 1 can be described by a finite family {x(l) ∈ ∆(I), l ∈ L} such that, at stage n ∈ IN, Player 1 chooses a type ln and the law of its action in is x(ln ). Definition 5. Lehrer-Solan [13] For every n ∈ IN and every l ∈ L, the average internal regret of type l at stage n is Rn (l) = sup [G(x, µn (l)) − G(ın (l), µn (l))] . x∈∆(I)

A strategy σ of Player 1 is (L, ε)-internally consistent if for every strategy τ of Player 2: $ # |Nn (l)| lim sup Rn (l) − ε ≤ 0, ∀l ∈ L, IPσ,τ -as . n n→+∞

78

V. Perchet

Remark 6. Note that this definition is not intrinsic (unlike in the full monitoring case) since it depends on the choice of {x(l), l ∈ L}, and is based uniquely on the potential observations (ie the sequences of flags (µn )n∈IN ) of Player 1. In order to construct (L, ε)-internally consistent strategies, some regularity over G is required: 9 : Assumption 1. For every ε > 0, there exist µ(l) ∈ ∆(S)I , x(l) ∈ ∆(I), l ∈ L two finite families and η, δ > 0 such that: ! 1. ∆(S)I ⊂ l∈L B(µ(l), δ); 2. For every l ∈ L, if ∥x − x(l)∥ ≤ 2η and ∥µ − µ(l)∥ ≤ 2δ, then x ∈ BRε (µ), ; < where BRε (µ) = x ∈ ∆(I) : G(x, µ) ≥ supz∈∆(I) G(z, µ) − ε is the set of ε9 : best response to µ ∈ ∆(S)I and B(µ, δ) = µ′ ∈ ∆(S)I , ∥µ′ − µ∥ ≤ δ . In words, Assumption 1 implies that G is regular with respect to µ and with respect to x: given ε, the set of flags can be covered by a finite number of balls centered in {µ(l)}, such that x(l), a best response to µ(l), is an ε-best response to any µ in this ball. And if x is close enough to x(l), then x is also an ε-best response to any µ close to µ(l). Theorem 4. If G fulfills Assumption 1, there exist (L, ε)-internally consistent strategies. Some parts of the proof are quite technical, however the insight is very simple, so we give firstly the main ideas. First assume that, in the one stage game, µ ∈ ∆(S)I is observed by Player 1, then there exists x ∈ ∆(I) such that x ∈ BRϵ (µ). Using an minmax argument, like Blackwell did for the proof of Corollary 1, one could prove that Player 1 has an (L, ε)-internally consistent strategy (as did Lehrer and Solan [13]). The idea is to use calibration, as in the alternative proof of Corollary 1, to transform this implicit proof into a constructive proof. Fix ε > 0 and assume for the moment that Player 1 observes each µn . Consider the game where Player 1 predicts the sequence (µn )n∈IN using the δ-grid {µ(l), l ∈ L} given by Assumption 1. A calibrated strategy of Player 1 chooses a sequences (ln )n∈IN in such a way that µn (l) is asymptotically δ-close to µ(l). Hence Player 1 just has to play accordingly to x(l) ∈ BRε (µ(l)) on these stages. Indeed, since the choices of action are independent, ın (l) will be asymptotically η-close to x(l) and the regularity of G will imply then that ın (l) ∈ BRε (µn (l)) and so the strategy will be (L, ε)-internally consistent. The only issue is that in the current framework the signal depends on the action of Player 1 who does not observe µn . The existence of calibrated strategies is therefore not straightforward. However, it is well known that, up to a slight perturbation of x(l), the information available to Player 1 after a long time is close to µn (l) (as in the multi-armed bandit problem, some calibration and no-regret frameworks, see chapter 6 in [5] for a survey on these techniques).

Calibration and Internal No-Regret with Random Signals

79

For every x ∈ ∆(I), define xη ∈ ∆(I), the η-perturbation of x by xη = (1 − η)x + ηu with u the uniform probability over I and for every stage n of type l, define s=n by: s=n = (0, . . . , 0,

sn , 0, . . . , 0) , xη (l)[in ]

with xη (l)[in ] the weight put by xη (l) on in and denote by s>n (l), the average of {= sm } on Nn (l):

Lemma 4. For every θ > 0, there exists N ∈ IN such that, for every l ∈ L: sn (l) − µn (l)∥ ≤ θ| Nn (l) ≥ N ) ≥ 1 − θ . IPσ,τ (∀m ≥ n, ∥>

Proof. Since for every n ∈ IN, the choices of in and µn are independent: $ # ,, s ,...,0 Eσ,τ [ s=n | hn−1 , ln , µn ] = µin [s]xη (ln )[i] 0, . . . , xη (ln )[i] i∈I s∈S ,, ,' ( = 0, . . . , µin , . . . , 0 µin [s] (0, . . . , s, . . . , 0) = i∈I s∈S

( ' = µ1n , . . . , µIn = µn .

i∈I

Therefore s>n (l) is an unbiased estimator of µn (l) and Hoeﬀding-Azuma’s inequality implies that for every θ > 0 there exists N ∈ IN such that, for every l ∈ L: sn (l) − µn (l)∥ ≤ θ| Nn (l) ≥ N ) ≥ 1−θ . IPσ,τ (∀m ≥ n, ∥>

1 ⊓

Assume now that Player 1 uses a calibrated strategy to predict the sequences of s=n (this game is in full monitoring), then he knows that asymptotically s>n (l) is closer to µ(l) than to any µ(k) (as soon as the frequency of l is big enough), therefore it is δ-close to µ(l). Lemma 4 implies that µn (l) is asymptotically close to s>n (l) and therefore 2δ-close to µ(l). Note that instead of trying to compute the sequence of payoﬀs from the signals, we consider an auxiliary game defined on the signal space (ie the observations) so that this new game is in fact (almost) in full monitoring. Proof of Theorem 4. Consider the families {x(l) ∈ ∆(I), µ(l) ∈ ∆(S)I , l ∈ L} and δ > 0 given by Assumption 1 for a fixed ε > 0. Let Γ ′ be the auxiliary repeated game where at stage n Player 1 (resp Player 2) chooses ln ∈ L (resp. µn ∈ ∆(S)I ). Given these choices, in (resp. sn ) is drawn accordingly to xη (ln ) (resp. µinn ). By Lemma 4, for every θ > 0, there exists N 1 ∈ IN such that for every l ∈ L: ( (10) sn (l) − µn (l)∥ ≤ θ| Nn (l) ≥ N 1 ≥ 1 − θ . IPσ,τ (∀m ≥ n, ∥> Let σ be a calibrated strategy associated to (> sn )n∈IN in Γ ′ . For every θ > 0, 2 there exists N ∈ IN such that with IPσ,τ -probability greater than 1 − θ:

80

V. Perchet

|Nn (l)| ∀n ≥ N , ∀l, k ∈ L, n 2

#

2

sn (l) − µ(k)∥ ∥> sn (l) − µ(l)∥ − ∥>

2

$

≤θ .

(11)

Since {µ(k), k ∈ L} is a grid of ∆(S)I , for every n ∈ IN and l ∈ L, there exists k ∈ L such that ∥> sn (l) − µ(k)∥ ≤ δ. Therefore, combining equation (10) and (11), for every θ > 0 there exists N 3 ∈ IN such that: # $ $ # |Nn (l)| 2 IPσ,τ ∀n ≥ N 3 , ∀l ∈ L, ∥µn (l) − µ(l)∥ − δ 2 ≤ θ, ≥ 1 − θ . (12) n For every stage of type l ∈ L, in is drawn accordingly to xη (l) and by definition ∥xη (l) − x(l)∥ ≤ η. Therefore Hoeﬀding-Azuma’s inequality implies that, for every θ > 0 there exists N 4 ∈ IN such that: $ $ # # |Nn (l)| IPσ,τ ∀n ≥ N 4 , ∀l ∈ L, ∥ın (l) − x(l)∥ − η ≤ θ, ≥ 1 − θ . (13) n Combining equation (12), (13) and using Assumption 1, for every θ > 0, there exists N ∈ IN such that for every strategy τ of Player 2: $ $ # # |Nn (l)| (14) IPσ,τ ∀n ≥ N, ∀l ∈ L, Rn (l) − ε ≤ θ, ≥ 1 − θ , n and σ is (L, ε)-internally consistent.

1 ⊓

Remark 7. The strategy constructed is based on δ-calibration and HoeﬀdingAzuma’s inequality, therefore one can show that: 2 15 $+ * # |Nn (l)| ln(n) Eσ,τ sup . ≤O Rn (l) − ε n n l∈L 4.3

Back to Payoﬀ Space

Assumption 1 can be fulfilled with some continuity assumptions over G: Proposition 1. Let G : ∆(I) × ∆(S)I be such that for every µ ∈ ∆(S)I , G(·, µ) is continuous and the family of function {G(x, ·), x ∈ ∆(I)} is equicontinuous. Then G fulfills Assumption 1. Proof. Since {G(x, ·), x ∈ ∆(I)} is equicontinuous and ∆(S)I compact, for every ε > 0, there exists δ > 0 such that: ∀x ∈ ∆(I), ∀µ, µ′ ∈ ∆(S)I , ∥µ − µ′ ∥ ≤ 2δ ⇒ |G(x, µ) − G(x, µ′ )| ≤

ε . 2

Let {µ(l), l ∈ L} be a finite δ-grid of ∆(S)I and for every l ∈ L, x(l) ∈ BR(µ(l)) so that G(x(l), µ(l)) = maxz∈∆(I) G(z, µ(l)). Since G(x(l), ·) is continuous, there exists η(l) > 0 such that: ∥x − x(l)∥ ≤ η(l) ⇒ |G(x, µ(l)) − G(x(l), µ(l))| ≤ ε/2 .

Calibration and Internal No-Regret with Random Signals

81

Define η = minl∈L η(l) and let x ∈ ∆(I), µ ∈ ∆(S)I and l ∈ L such that ∥x − x(l)∥ ≤ η and ∥µ − µ(l)∥ ≤ δ, then: G(x, µ) ≥ G(x, µ(l)) −

ε ≥ G(x(l), µ(l)) − ε = max G(z, µ(l)) − ε , 2 z∈∆(I)

and x ∈ BRε (µ).

1 ⊓

This proposition implies that the evaluation function used by Rustichini fulfills Assumption 1 (Lugosi, Mannor and Stoltz [14]). Before proving that, we introduce S, the range of s, which is a closed convex subset of ∆(S)I , and ΠS (·) the projection onto it. Corollary 2. Define G : ∆(I) × ∆(S)I → IR by: % inf y∈s−1 (µ) ρ(x, y) if µ ∈ S G(x, µ) = G (x, ΠS (µ)) otherwise. Then G fulfills Assumption 1. " Proof. The function s can be extended linearly to IR|J| by s(y) = j∈J y(j)s(j) where y = (y(j))j∈J . Therefore, by Aubin and Frankowska [1] (Theorem 2.2.1, page 57), the multivalued application s−1 : S → ∆(J)I is λ-Lipschitz, and since ΠS is 1-Lipschitz (because S is convex), G(x, ·) is also λ-Lipschitz, for every x ∈ ∆(I). Therefore, {G(x, ·), x ∈ ∆(I)} is equicontinuous. For every µ ∈ ∆(S)I , G(·, µ) is 1-Lipschitz (see [14]), therefore continuous. Hence, by Proposition 1, G fulfills Assumption 1. 1 ⊓ Concluding Remarks The definitions and proofs rely uniquely on Assumption 1: it is not relevant to assume that Player 1 faces only one opponent nor that the action set of its opponent is finite. The only requirement is that given his information (a probability in ∆(I) and a flag in ∆(S)I ), Player 1 can evaluate his payoﬀ, no matter how this payoﬀ is obtained: for example we could have assumed that Player 2 chooses at each stage an (unobserved) outcome vector U ∈ [−1, 1]|I| and Player 1 chooses a coordinate, which is his observed payoﬀ. In the full monitoring framework, many improvements have been made in the past years about calibration and regret (see for instance [12,16,18]). Here, we aimed to clarify the links between the original notions of approachability, internal regret and calibration in order to extend applications (in particular, to get rid of the finiteness of J), to define the internal regret (with signals) as calibration over an appropriate space and to give a proof derived from no-internal regret (in full monitoring), itself derived from the approachability of an orthant in this space. Acknowledgments. I deeply thanks my advisor Sylvain Sorin for its great help and numerous comments. I also acknowledge helpful remarks from Eilon Solan and Gilles Stoltz.

82

V. Perchet

References 1. Aubin, J.-P., Frankowska, H.: Set-valued Analysis. Birkh¨ auser Boston Inc., Basel (1990) 2. Azuma, K.: Weighted sums of certain dependent random variables. Tˆ ohoku Math. J. 19(2), 357–367 (1967) 3. Blackwell, D.: An analog of the minimax theorem for vector payoﬀs. Pacific J. Math. 6, 1–8 (1956) 4. Blackwell, D.: Controlled random walks. In: Proceedings of the International Congress of Mathematicians, 1954, Amsterdam, vol. III, pp. 336–338 (1956) 5. Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, Cambridge (2006) 6. Foster, D.P., Vohra, R.V.: Asymptotic calibration. Biometrika 85, 379–390 (1998) 7. Foster, D.P., Vohra, R.V.: Regret in the on-line decision problem. Games Econom. Behav. 29, 7–35 (1999) 8. Fudenberg, D., Levine, D.K.: Conditional universal consistency. Games Econom. Behav. 29, 104–130 (1999) 9. Hannan, J.: Approximation to Bayes risk in repeated play. In: Contributions to the theory of Games. Annals of Mathematics Studies, vol. 3(39), pp. 97–139. Princeton University Press, Princeton (1957) 10. Hart, S., Mas-Colell, A.: A simple adaptive procedure leading to correlated equilibrium. Econometrica 68, 1127–1150 (2000) 11. Hoeﬀding, W.: Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58, 13–30 (1963) 12. Lehrer, E.: A wide range no-regret theorem. Games Econom. Behav. 42, 101–115 (2003) 13. Lehrer, E., Solan, E.: Learning to play partially-specified equilibrium (manuscript, 2007) 14. Lugosi, G., Mannor, S., Stoltz, G.: Strategies for prediction under imperfect monitoring. Math. Oper. Res. 33, 513–528 (2008) 15. Rustichini, A.: Minimizing regret: the general case. Games Econom. Behav. 29, 224–243 (1999) 16. Sandroni, A., Smorodinsky, R., Vohra, R.V.: Calibration with many checking rules. Math. Oper. Res. 28, 141–153 (2003) 17. Sorin, S.: Lectures on Dynamics in Games. Unpublished Lecture Notes (2008) 18. Vovk, V.: Non-asymptotic calibration and resolution. Theoret. Comput. Sci. 387, 77–89 (2007)

Calibration and Internal No-Regret with Random Signals

We develop these tools in the framework of a game with partial mon- itoring, where players do not observe the ... in the partial monitoring framework and proved the existence of strategies with no such regret. We will generalize ..... y(l) on Nn(l). So he just has (and could not do better) to play xy(l) on these stages. Remark 5.

Download PDF

292KB Sizes 0 Downloads 217 Views

Report

Calibration and Internal No-Regret with Random Signals

Recommend Documents