Yevgeny Seldin University of Copenhagen
ECML-PKDD-2015 Tutorial
What is Online Learning? • Subfield of Machine Learning studying problems involving interaction with an environment
Examples • • • • •
Investment in the stock market Online advertising/personalization/routing/… Games Robotics …
How Online different from “batch”? Batch Learning
Online Learning
Collect Data
Analyze
Apply
Act
Analyze
Get More Data
When do we need Online Learning? • Interactive learning • “Adversarial” game-theoretic settings – No i.i.d. assumptions
• Large-scale data analysis
The Space of Online Learning Problems
full info
limited (bandit) feedback
partial monitoring
feedback
The Space of Online Learning Problems
stock market
full info
limited (bandit) feedback
partial monitoring
feedback
The Space of Online Learning Problems
stock market
full info
limited (bandit) feedback
medical treatments advertising
partial monitoring
feedback
The Space of Online Learning Problems
stock market
full info
limited (bandit) feedback
medical treatments advertising
partial monitoring
censored feedback dynamic pricing
feedback
The Space of Online Learning Problems environmental resistance adversarial
stochastic (i.i.d.) full
bandit
partial
feedback
The Space of Online Learning Problems environmental resistance
spam filtering stock market
adversarial
stochastic (i.i.d.) full
bandit
partial
feedback
The Space of Online Learning Problems environmental resistance
spam filtering stock market
adversarial
medical treatments weather prediction
stochastic (i.i.d.) full
bandit
partial
feedback
The Space of Online Learning Problems environmental resistance adversarial
Markov Decision Processes (MDPs)
i.i.d. stateless full side info (state)
structural complexity
bandit
partial
feedback
The Space of Online Learning Problems environmental resistance adversarial
Markov Decision Processes (MDPs)
i.i.d. stateless full side info (state)
bandit
partial
medical records of different patients subsequent treatments of the same patient
structural complexity
feedback
The Space of Online Learning Problems environmental resistance The opponent adversarial
i.i.d. stateless state MDPs
structural complexity The battle field
full
bandit
partial
feedback The player
Part I: “classical” algorithms environmental resistance adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Part I: “classical” algorithms environmental resistance
Prediction with expert advice
adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Part I: “classical” algorithms environmental resistance
Prediction with expert advice Adversarial bandits
adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Part I: “classical” algorithms environmental resistance
Prediction with expert advice Adversarial bandits
adversarial Stochastic bandits i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Part I: “classical” algorithms environmental resistance
Prediction with expert advice Adversarial bandits
adversarial Stochastic bandits i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Part I: “classical” algorithms environmental resistance
Prediction with expert advice Adversarial bandits
adversarial Stochastic bandits i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Prediction with Expert Advice Examples I
Experts = financial advisers
I
Experts = different algorithms
Prediction with Expert Advice
I I
Experts = financial advisers Experts = different algorithms
Game Definition
Expert Advice
Examples
ξ11 , .. .
ξ21 , .. .
ξ1a , .. .
ξ2a , .. .
ξ1K ,
For t = 1, 2, . . . :
Expert Losses
3. Observe `1t , . . . , `K t & suffer
t `A t
··· ···
··· ξ2K , · · ·
ξt1 , .. . ξta , .. .
··· ··· ···
··· ξtK , · · ·
−−−−−−−−→ time
1. Observe advice of K experts 2. Pick an expert At to follow
···
`11 , .. .
`12 , .. .
`a1 , .. .
`a2 , .. .
`K 1 ,
··· ··· ···
··· `K 2 , ···
`1t , .. . `at , .. .
··· ··· ···
··· `K t , ···
−−−−−−−−→ time
Prediction with Expert Advice
I I
Experts = financial advisers Experts = different algorithms
Game Definition
Expert Advice
Examples
ξ11 , .. .
ξ21 , .. .
ξ1a , .. .
ξ2a , .. .
ξ1K ,
For t = 1, 2, . . . :
Performance Measure: Regret RT =
T X t=1
t `A t − min
a
T X t=1
`at
!
Expert Losses
3. Observe `1t , . . . , `K t & suffer
t `A t
··· ···
··· ξ2K , · · ·
ξt1 , .. . ξta , .. .
··· ··· ···
··· ξtK , · · ·
−−−−−−−−→ time
1. Observe advice of K experts 2. Pick an expert At to follow
···
`11 , .. .
`12 , .. .
`a1 , .. .
`a2 , .. .
`K 1 ,
··· ··· ···
··· `K 2 , ···
`1t , .. . `at , .. .
··· ··· ···
··· `K t , ···
−−−−−−−−→ time
Prediction with Expert Advice - Simplification Examples I
Experts = different algorithms
Game Definition For t = 1, 2, . . . : 1. Observe advice of K experts 2. Pick an expert At to follow `1t , . . . , `K t
& suffer
t `A t
Performance Measure: Regret RT =
T X t=1
t `A t − min
a
T X t=1
`at
!
Expert Advice
Experts = financial advisers
Expert Losses
I
3. Observe
1 1 1 · · Z ξ1 , ξ2 , · · · ξt , · Z.. . .. . Z .. ··· . ··· Za · · ·ξta , · · · ξ1a , ξZ 2, .. .. Z .. . . Z ··· . ··· Z K K , ··· Z ξt , · · · ξ1K , ξ2 Z −−−−−−−−→ Z Z time Z Z
Z
`11 , .. .
`12 , .. .
`a1 , .. .
`a2 , .. .
`K 1 ,
··· ··· ···
··· `K , ··· 2
`1t , .. . `at , .. .
··· ··· ···
··· `K , ··· t
−−−−−−−−→ time
Game Definition For t = 1, 2, . . . : 1. Pick a row At At 2. Observe `1t , . . . , `K t & suffer `t
Performance Measure: Regret RT =
T X t=1
t `A t − min
a
T X t=1
`at
!
Expert Losses
Prediction with Expert Advice `11 , .. .
`12 , .. .
`a1 , .. .
`a2 , .. .
`K 1 ,
··· ··· ···
··· `K 2 , ···
`1t , .. . `at , .. .
··· ··· ···
··· `K t , ···
−−−−−−−−→ time
Game Definition For t = 1, 2, . . . : 1. Pick a row At At 2. Observe `1t , . . . , `K t & suffer `t
Performance Measure: Regret RT =
T X t=1
t `A t − min
a
T X t=1
`at
!
Assumptions I
`at -s are in [0,1] and selected arbitrarily (adversarially)
Expert Losses
Prediction with Expert Advice `11 , .. .
`12 , .. .
`a1 , .. .
`a2 , .. .
`K 1 ,
··· ··· ···
··· `K 2 , ···
`1t , .. . `at , .. .
··· ··· ···
··· `K t , ···
−−−−−−−−→ time
Game Definition For t = 1, 2, . . . : 1. Pick a row At At 2. Observe `1t , . . . , `K t & suffer `t
Performance Measure: Regret RT =
T X t=1
t `A t − min
a
T X t=1
`at
!
Assumptions I
`at -s are in [0,1] and selected arbitrarily (adversarially)
Expert Losses
Prediction with Expert Advice `11 , .. .
`12 , .. .
`a1 , .. .
`a2 , .. .
`K 1 ,
··· ··· ···
··· `K 2 , ···
`1t , .. . `at , .. .
RT = o(T )
··· ···
··· `K t , ···
−−−−−−−−→ time
Learning Goal
···
Game Definition For t = 1, 2, . . . : 1. Pick a row At At 2. Observe `1t , . . . , `K t & suffer `t
Performance Measure: Regret RT =
T X t=1
t `A t − min
a
T X t=1
`at
!
Assumptions I
`at -s are in [0,1] and selected arbitrarily (adversarially)
Expert Losses
Prediction with Expert Advice `11 , .. .
`12 , .. .
`a1 , .. .
`a2 , .. .
`K 1 ,
··· ··· ···
··· `K 2 , ···
`1t , .. . `at , .. .
··· ··· ···
··· `K t , ···
−−−−−−−−→ time
Learning Goal RT = o(T ) Why comparing to the best row, not the best path?
Prediction with Expert Advice
I
The algorithm needs a protection against the adversary
I
Protection - randomization
Assumptions I
The adversary may know the algorithm, but not the random bits (“oblivious setting” - `at -s are written down before the game starts)
Expert Losses
Algorithms `11 , .. .
`12 , .. .
`a1 , .. .
`a2 , .. .
`K 1 ,
··· ··· ···
··· `K 2 , ···
`1t , .. . `at , .. .
··· ··· ···
··· `K t , ···
−−−−−−−−→ time
The Hedge Algorithm (a.k.a. Exponential Weights) [Vovk, 1990, Littlestone & Warmuth, 1994, . . . ]
Input: Learning rates η1 ≥ η2 ≥ · · · > 0 ˆ 0 (a) = 0 ∀a : L for t = 1, 2, ... do ˆ
∀a : pt (a) = P
e−ηt Lt−1 (a)
a0
e−ηt Lˆ t−1 (a0 )
Sample At according to pt and play it Observe `1t , . . . , `K t ˆ t (a) = L ˆ t−1 (a) + `at ∀a : L end
Analysis (simplified for known T and constant η) Reminder ˆ
pt (a) = P
e−ηLt−1 (a)
a0
e−ηLˆ t−1 (a0 )
ˆ t (a) = L ˆ t−1 (a) + `a L t
Analysis (simplified for known T and constant η) Let Wt =
P
ˆ
−η Lt (a) = ae
P
a
ˆ
−η`t e−η Lt−1 (a) ae
Reminder ˆ
pt (a) = P
e−ηLt−1 (a)
a0
e−ηLˆ t−1 (a0 )
ˆ t (a) = L ˆ t−1 (a) + `a L t
Analysis (simplified for known T and constant η) Let Wt =
P
ˆ
−η Lt (a) = ae
Calculation
P
a
ˆ
−η`t e−η Lt−1 (a) ae
ˆ
X a e−ηLt−1 (a) Wt = e−η`t P ˆ t−1 (a0 ) −η L Wt−1 a0 e a
Reminder ˆ
pt (a) = P
e−ηLt−1 (a)
a0
e−ηLˆ t−1 (a0 )
ˆ t (a) = L ˆ t−1 (a) + `a L t
Analysis (simplified for known T and constant η) Let Wt =
P
ˆ
−η Lt (a) = ae
Calculation
P
a
ˆ
X a e−ηLt−1 (a) Wt = e−η`t P ˆ t−1 (a0 ) −η L Wt−1 a0 e a X a e−η`t pt (a) = a
ˆ
−η`t e−η Lt−1 (a) ae
Reminder ˆ
pt (a) = P
e−ηLt−1 (a)
a0
e−ηLˆ t−1 (a0 )
ˆ t (a) = L ˆ t−1 (a) + `a L t
Analysis (simplified for known T and constant η) Let Wt =
P
ˆ
−η Lt (a) = ae
Calculation
P
a
ˆ
X a e−ηLt−1 (a) Wt = e−η`t P ˆ t−1 (a0 ) −η L Wt−1 a0 e a X a e−η`t pt (a) = a
ˆ
−η`t e−η Lt−1 (a) ae
Reminder ˆ
pt (a) = P
e−ηLt−1 (a)
a0
e−ηLˆ t−1 (a0 )
ˆ t (a) = L ˆ t−1 (a) + `a L t
Useful Inequalities For x ≤ 0: 1 ex ≤ 1 + x + x2 2
Analysis (simplified for known T and constant η) Let Wt =
P
ˆ
−η Lt (a) = ae
Calculation
P
ˆ
a
−η`t e−η Lt−1 (a) ae
ˆ
≤
X a
1 − η`at +
ˆ
pt (a) = P
e−ηLt−1 (a)
a0
X a e−ηLt−1 (a) Wt = e−η`t P ˆ t−1 (a0 ) −η L Wt−1 a0 e a X a e−η`t pt (a) = a
Reminder
1 2 (η`at ) pt (a) 2
e−ηLˆ t−1 (a0 )
ˆ t (a) = L ˆ t−1 (a) + `a L t
Useful Inequalities For x ≤ 0: 1 ex ≤ 1 + x + x2 2
Analysis (simplified for known T and constant η) Let Wt =
P
ˆ
−η Lt (a) = ae
Calculation
P
ˆ
a
−η`t e−η Lt−1 (a) ae
ˆ
a
≤
a
=1−η
1 − η`at + X a
ˆ
pt (a) = P
e−ηLt−1 (a)
a0
X a e−ηLt−1 (a) Wt = e−η`t P ˆ t−1 (a0 ) −η L Wt−1 a0 e a X a e−η`t pt (a) = X
Reminder
1 2 (η`at ) pt (a) 2
η X a 2 `at pt (a) + (` ) pt (a) 2 a t 2
e−ηLˆ t−1 (a0 )
ˆ t (a) = L ˆ t−1 (a) + `a L t
Useful Inequalities For x ≤ 0: 1 ex ≤ 1 + x + x2 2
Analysis (simplified for known T and constant η) Let Wt =
P
ˆ
−η Lt (a) = ae
Calculation
P
ˆ
a
−η`t e−η Lt−1 (a) ae
ˆ
a
≤
a
=1−η
1 − η`at + X a
ˆ
pt (a) = P
e−ηLt−1 (a)
a0
X a e−ηLt−1 (a) Wt = e−η`t P ˆ t−1 (a0 ) −η L Wt−1 a0 e a X a e−η`t pt (a) = X
Reminder
1 2 (η`at ) pt (a) 2
η X a 2 `at pt (a) + (` ) pt (a) 2 a t 2
e−ηLˆ t−1 (a0 )
ˆ t (a) = L ˆ t−1 (a) + `a L t
Useful Inequalities For x ≤ 0: 1 ex ≤ 1 + x + x2 2 For any x: 1 + x ≤ ex
Analysis (simplified for known T and constant η) Let Wt =
P
ˆ
−η Lt (a) = ae
Calculation
P
ˆ
a
−η`t e−η Lt−1 (a) ae
ˆ
a
≤
a
=1−η ≤ e−η
1 − η`at + X a
1 2 (η`at ) pt (a) 2
η X a 2 `at pt (a) + (` ) pt (a) 2 a t
η2 a a `t pt (a)+ 2
P
ˆ
pt (a) = P
e−ηLt−1 (a)
a0
X a e−ηLt−1 (a) Wt = e−η`t P ˆ t−1 (a0 ) −η L Wt−1 a0 e a X a e−η`t pt (a) = X
Reminder
2
a 2 a (`t ) pt (a)
P
e−ηLˆ t−1 (a0 )
ˆ t (a) = L ˆ t−1 (a) + `a L t
Useful Inequalities For x ≤ 0: 1 ex ≤ 1 + x + x2 2 For any x: 1 + x ≤ ex
Analysis (simplified for known T and constant η) From the last slide: Wt =
X
ˆ
e−ηLt (a)
a
P a η2 Wt ≤ e−η a `t pt (a)+ 2 Wt−1
a 2 a (`t ) pt (a)
P
Analysis (simplified for known T and constant η) From the last slide: Wt =
X
ˆ
e−ηLt (a)
a
P a η2 Wt ≤ e−η a `t pt (a)+ 2 Wt−1
a 2 a (`t ) pt (a)
P
Calculation continued: ln
T X T X η2 X X a 2 WT ≤ −η `at pt (a) + (` ) pt (a) W0 2 t=1 a t t=1 a
Analysis (simplified for known T and constant η) From the last slide: Wt =
X
ˆ
e−ηLt (a)
a
P a η2 Wt ≤ e−η a `t pt (a)+ 2 Wt−1
a 2 a (`t ) pt (a)
P
Calculation continued: ln WT ln = ln W0
T X T X η2 X X a 2 WT ≤ −η `at pt (a) + (` ) pt (a) W0 2 t=1 a t t=1 a
P
a
ˆ
e−ηLT (a) K
Analysis (simplified for known T and constant η) From the last slide: Wt =
X
ˆ
e−ηLt (a)
a
P a η2 Wt ≤ e−η a `t pt (a)+ 2 Wt−1
a 2 a (`t ) pt (a)
P
Calculation continued: T X T X η2 X X a 2 WT ≤ −η `at pt (a) + (` ) pt (a) W0 2 t=1 a t t=1 a ˆ T (a) P −ηLˆ T (a) −η L max e a e WT ln = ln a ≥ ln W0 K K
ln
Analysis (simplified for known T and constant η) From the last slide: Wt =
X
ˆ
e−ηLt (a)
a
P a η2 Wt ≤ e−η a `t pt (a)+ 2 Wt−1
a 2 a (`t ) pt (a)
P
Calculation continued: T X T X η2 X X a 2 WT ≤ −η `at pt (a) + (` ) pt (a) W0 2 t=1 a t t=1 a ˆ T (a) P −ηLˆ T (a) −η L max e a e WT ˆ T (a) −ln K ln = ln a ≥ ln = −η min L a W0 K K
ln
Analysis (simplified for known T and constant η) From the last slide: Wt =
X
ˆ
e−ηLt (a)
a
P a η2 Wt ≤ e−η a `t pt (a)+ 2 Wt−1
a 2 a (`t ) pt (a)
P
Calculation continued: T X T X η2 X X a 2 WT ≤ −η `at pt (a) + (` ) pt (a) W0 2 t=1 a t t=1 a ˆ T (a) P −ηLˆ T (a) −η L max e a e WT ˆ T (a) −ln K ln = ln a ≥ ln = −η min L a W0 K K
ln
T X T X η2 X X a 2 ˆ T (a) − ln K ≤ −η −η min L `at pt (a) + (` ) pt (a) a 2 t=1 a t t=1 a
Analysis (simplified for known T and constant η) Calculation Summary T X X t=1
a
T X X ˆ T (a) ≤ ln K + η `at pt (a) − min L (`at )2 pt (a) a η 2 a t=1
Analysis (simplified for known T and constant η) Calculation Summary T X X t=1
|a
T X X ˆ T (a) ≤ ln K + η `at pt (a) − min L (`at )2 pt (a) a η 2 a t=1 {z i } h At
E `t
Analysis (simplified for known T and constant η) Calculation Summary T X X t=1
|
|a
T X X ˆ T (a) ≤ ln K + η `at pt (a) − min L (`at )2 pt (a) a η 2 a t=1 {z i } h At
E `t
{z
E[RT ]
}
Analysis (simplified for known T and constant η) Calculation Summary T X X t=1
|
|a
T X X ˆ T (a) ≤ ln K + η `at pt (a) − min L (`at )2 pt (a) a | {z } η 2 a t=1 ≤1 {z } i h At
E `t
{z
E[RT ]
}
Analysis (simplified for known T and constant η) Calculation Summary T X X t=1
|
|a
T X X ˆ T (a) ≤ ln K + η `at pt (a) − min L (`at )2 pt (a) a | {z } η 2 a t=1 ≤1 {z } i h | {z } At
E `t
{z
E[RT ]
}
≤1
Analysis (simplified for known T and constant η) Calculation Summary T X X t=1
|
|a
T X X ˆ T (a) ≤ ln K + η `at pt (a) − min L (`at )2 pt (a) a | {z } η 2 a t=1 ≤1 {z } i h | {z } At
E `t
{z
E[RT ]
}
|
≤1
{z
≤T
}
Analysis (simplified for known T and constant η) Calculation Summary T X X t=1
|
|a
T X X ˆ T (a) ≤ ln K + η `at pt (a) − min L (`at )2 pt (a) a | {z } η 2 a t=1 ≤1 {z } i h | {z } At
E `t
{z
|
}
E[RT ]
Minimize with respect to η η=
r
2 ln K T
≤1
{z
≤T
}
Analysis (simplified for known T and constant η) Calculation Summary T X X t=1
|
|a
T X X ˆ T (a) ≤ ln K + η `at pt (a) − min L (`at )2 pt (a) a | {z } η 2 a t=1 ≤1 {z } i h | {z } At
E `t
{z
|
}
E[RT ]
Minimize with respect to η η=
r
2 ln K T
Final Result E [RT ] ≤
√
2T ln K
≤1
{z
≤T
}
Lower bound (high-level idea) Construction 1 2
Expert Losses
`at -s independent Bernoulli with bias
`11 , .. .
`12 , .. .
`a1 , .. .
`a2 , .. .
`K 1 ,
··· ··· ···
··· `K , ··· 2
`1t , .. . `at , .. .
··· ··· ···
··· `K , ··· t
−−−−−−−−→ time
Lower bound (high-level idea) Construction `at -s independent Bernoulli with bias
1 2
lim
K→∞,T →∞
h i ˆ T (a) T /2 − E min L q a =1 1 T ln K 2
Expert Losses
Lemma
`11 , .. .
`12 , .. .
`a1 , .. .
`a2 , .. .
`K 1 ,
··· ··· ···
··· `K , ··· 2
`1t , .. . `at , .. .
··· ··· ···
··· `K , ··· t
−−−−−−−−→ time
Lower bound (high-level idea) Construction `at -s independent Bernoulli with bias
1 2
E[RT ]
lim
K→∞,T →∞
z h }| i{ ˆ T (a) T /2 − E min L q a =1 1 T ln K 2
Expert Losses
Lemma
`11 , .. .
`12 , .. .
`a1 , .. .
`a2 , .. .
`K 1 ,
··· ··· ···
··· `K , ··· 2
`1t , .. . `at , .. .
··· ··· ···
··· `K , ··· t
−−−−−−−−→ time
Lower bound (high-level idea) Construction `at -s independent Bernoulli with bias
1 2
E[RT ]
lim
K→∞,T →∞
z h }| i{ ˆ T (a) T /2 − E min L q a =1 1 T ln K 2
Conclusion E [RT ] = Ω
√
T ln K
Expert Losses
Lemma
`11 , .. .
`12 , .. .
`a1 , .. .
`a2 , .. .
`K 1 ,
··· ··· ···
··· `K , ··· 2
`1t , .. . `at , .. .
··· ··· ···
··· `K , ··· t
−−−−−−−−→ time
Part I: “classical” algorithms environmental resistance
Prediction with expert advice Adversarial bandits
adversarial Stochastic bandits i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Adversarial Multiarmed Bandits
For t = 1, 2, . . . : 1. Play an action At t 2. Observe and suffer the loss `A t
Performance Measure: Regret RT =
T X t=1
t `A t − min
a
T X t=1
`at
!
Action Losses
Game Definition `11 , .. .
`12 , .. .
`a1 , .. .
`a2 , .. .
`K 1 ,
··· ··· ···
··· `K 2 , ···
`1t , .. . `at , .. .
··· ··· ···
··· `K t , ···
−−−−−−−−→ time
Reminder: The Hedge Algorithm
Input: Learning rates η1 ≥ η2 ≥ · · · > 0 ˆ 0 (a) = 0 ∀a : L for t = 1, 2, ... do ˆ
∀a : pt (a) = P
e−ηt Lt−1 (a)
a0
e−ηt Lˆ t−1 (a0 )
Sample At according to pt and play it Observe `1t , . . . , `K t ˆ t (a) = L ˆ t−1 (a) + `at ∀a : L end
The EXP3 Algorithm for Adversarial Bandits [Auer et. al., 2002; Bubeck, 2010]
Input: Learning rates η1 ≥ η2 ≥ · · · > 0 ˜ 0 (a) = 0 ∀a : L for t = 1, 2, ... do ˜
∀a : pt (a) = P
e−ηt Lt−1 (a)
a0
e−ηt L˜ t−1 (a0 )
Sample At according to pt and play it t Observe `A t
`at 1{At =a} ∀a : `˜at = = pt (a)
(
`a t pt (a) ,
if At = a
0,
otherwise
˜ t (a) = L ˜ t−1 (a) + `˜a ∀a : L t end
Importance-weighted sampling
Properties of Importance-Weighted Sampling
Notation Et [·] = E [·|everything up to round t]
Expectation =pt (a)
h i `at 1{At =a} Et `˜at = Et pt (a)
z }| { `at Et 1{At =a} = = `at pt (a)
Properties of Importance-Weighted Sampling
Second moment
≤1
2 = Et `˜at
=pt (a)
z
}|
=1{At =a}
{
z }| { z }| { 2 (`at )2 Et 1{At =a} pt
(a)2
≤
1 pt (a)
Properties of Importance-Weighted Sampling
Second moment
≤1
2 = Et `˜at Et
=pt (a)
z
}|
=1{At =a}
{
z }| { z }| { 2 (`at )2 Et 1{At =a} pt
" X a
(a)2
# 2 pt (a) `˜at ≤K
≤
1 pt (a)
Analysis (simplified for known T and constant η) Following the calculations in the analysis of Hedge we have: T X X t=1
a
T X 2 X ˜ T (a) ≤ ln K + η `˜at pt (a) − min L `˜at pt (a) a η 2 t=1 a
Analysis (simplified for known T and constant η) Following the calculations in the analysis of Hedge we have: T X X a
t=1
T X 2 X ˜ T (a) ≤ ln K + η `˜at pt (a) − min L `˜at pt (a) a η 2 t=1 a
Taking expectation on both sides E
" T XX t=1
a
`˜at pt (a)
#
h
˜ T (a) −E min L a
i
" T # X X 2 ln K η a ≤ + E `˜t pt (a) η 2 t=1 a
Analysis (simplified for known T and constant η) Following the calculations in the analysis of Hedge we have: T X X a
t=1
T X 2 X ˜ T (a) ≤ ln K + η `˜at pt (a) − min L `˜at pt (a) a η 2 t=1 a
Taking expectation on both sides E
" T XX t=1
a
`˜at pt (a)
#
h
˜ T (a) −E min L a
i
" T # X X 2 ln K η a ≤ + E `˜t pt (a) η 2 t=1 a
Propagating the expectations inside (E [min(·)] ≤ min E [·]) E
"
T X X t=1
a
# " T # h i h i X X a 2 ln K η a ˜ ˜ ˜ Et `t pt (a) −min E LT (a) ≤ + E Et ` t pt (a) a η 2 t=1 a
Analysis (simplified for known T and constant η) Following the calculations in the analysis of Hedge we have: T X X a
t=1
T X 2 X ˜ T (a) ≤ ln K + η `˜at pt (a) − min L `˜at pt (a) a η 2 t=1 a
Taking expectation on both sides E
" T XX t=1
`˜at pt (a)
a
#
h
˜ T (a) −E min L a
i
# " T X X 2 ln K η a ≤ `˜t pt (a) + E η 2 t=1 a
Propagating the expectations inside (E [min(·)] ≤ min E [·]) E
X T X t=1
a
|
X T X h i h i 2 ˜ T (a) ≤ ln K + η E Et `˜at pt (a) −min E L Et `˜at pt (a) a η 2 | {z } | {z } t=1 a | {z } `a L (a) T t ≤K {z } h i A E `t t
Analysis (simplified for known T and constant η) Following the calculations in the analysis of Hedge we have: T X X a
t=1
T X 2 X ˜ T (a) ≤ ln K + η `˜at pt (a) − min L `˜at pt (a) a η 2 t=1 a
Taking expectation on both sides E
" T XX t=1
a
`˜at pt (a)
#
h
˜ T (a) −E min L a
i
" T # X X 2 ln K η a ≤ + E `˜t pt (a) η 2 t=1 a
Propagating the expectations inside (E [min(·)] ≤ min E [·]) E |
"
T X X t=1
a
# " T # h i h i X X a 2 ln K η a ˜ ˜ ˜ Et `t pt (a) − min E LT (a) ≤ + E Et `t pt (a) a η 2 t=1 a {z } | {z } E[RT ]
≤KT
Analysis (simplified for known T and constant η) Calculation summary E [RT ] ≤
ln K η + KT η 2
Analysis (simplified for known T and constant η) Calculation summary E [RT ] ≤
ln K η + KT η 2
Minimize with respect to η η=
r
2 ln K KT
Analysis (simplified for known T and constant η) Calculation summary ln K η + KT η 2
E [RT ] ≤
Minimize with respect to η η=
r
2 ln K KT
Final Result E [RT ] ≤
√
2KT ln K
Analysis (simplified for known T and constant η) Calculation summary η ln K + KT η 2
E [RT ] ≤
Minimize with respect to η η=
r
2 ln K KT
Final Result E [RT ] ≤
√
2KT ln K
In comparison with full information we got the extra K factor
Analysis (simplified for known T and constant η) Calculation summary η ln K + KT η 2
E [RT ] ≤
Minimize with respect to η η=
r
2 ln K KT
Final Result E [RT ] ≤
√
2KT ln K
It is possible to eliminate ln K with more sophisticated algorithms
i-th game: `it -s Bernoulli with bias For a 6= i: `at -s Bernoulli with bias
1 2 1 2 +ε 1 2 −ε
Expert Losses
0-th game: `at -s Bernoulli with bias
`1 2, . . . a `2 , . . . K `2 ,
···
`1 1, . . . a `1 , . . . K `1 ,
`1 2, . . . a `2 , . . . K `2 ,
···
`1 1, . . . `a 1, . . . `K 1 ,
`1 2, . . . `a 2, . . . `K 2 ,
Expert Losses
Construct K + 1 games
`1 1, . . . a `1 , . . . K `1 ,
Expert Losses
Lower bound (high level idea) ··· ··· ··· ···
··· ···
.. .
··· ···
··· ··· ··· ··· ···
`1 t, . . . a `t , . . . K `t ,
···
`1 t, . . . a `t , . . . K `t ,
···
`1 t, . . . `a t, . . . `K t ,
··· ··· ··· ···
··· ··· ··· ···
··· ··· ··· ··· ···
i-th game: `it -s Bernoulli with bias For a 6= i: `at -s Bernoulli with bias
1 2 1 2 +ε 1 2 −ε
Claim For small ε, 0-th game is indistinguishable from i-th game based on T observations.
Expert Losses
0-th game: `at -s Bernoulli with bias
`1 2, . . . a `2 , . . . K `2 ,
···
`1 1, . . . a `1 , . . . K `1 ,
`1 2, . . . a `2 , . . . K `2 ,
···
`1 1, . . . `a 1, . . . `K 1 ,
`1 2, . . . `a 2, . . . `K 2 ,
Expert Losses
Construct K + 1 games
`1 1, . . . a `1 , . . . K `1 ,
Expert Losses
Lower bound (high level idea) ··· ··· ··· ···
··· ···
.. .
··· ···
··· ··· ··· ··· ···
`1 t, . . . a `t , . . . K `t ,
···
`1 t, . . . a `t , . . . K `t ,
···
`1 t, . . . `a t, . . . `K t ,
··· ··· ··· ···
··· ··· ··· ···
··· ··· ··· ··· ···
i-th game: `it -s Bernoulli with bias For a 6= i: `at -s Bernoulli with bias
1 2 1 2 +ε 1 2 −ε
Claim For small ε, 0-th game is indistinguishable from i-th game based on T observations. p For ε = θ K/T : √ E [RT ] = Ω KT
Expert Losses
0-th game: `at -s Bernoulli with bias
`1 2, . . . a `2 , . . . K `2 ,
···
`1 1, . . . a `1 , . . . K `1 ,
`1 2, . . . a `2 , . . . K `2 ,
···
`1 1, . . . `a 1, . . . `K 1 ,
`1 2, . . . `a 2, . . . `K 2 ,
Expert Losses
Construct K + 1 games
`1 1, . . . a `1 , . . . K `1 ,
Expert Losses
Lower bound (high level idea) ··· ··· ··· ···
··· ···
.. .
··· ···
··· ··· ··· ··· ···
`1 t, . . . a `t , . . . K `t ,
···
`1 t, . . . a `t , . . . K `t ,
···
`1 t, . . . `a t, . . . `K t ,
··· ··· ··· ···
··· ··· ··· ···
··· ··· ··· ··· ···
Part I: “classical” algorithms environmental resistance
Prediction with expert advice Adversarial bandits
adversarial Stochastic bandits i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Stochastic Multiarmed Bandits Game Definition `at -s independent; E [`at ] = µ(a) For t = 1, 2, . . . : 1. Play an action At
Action Losses
t 2. Observe and suffer `A t
`11 , .. .
`12 , .. .
`a1 , .. .
`a2 , .. .
`K 1 ,
··· ··· ···
··· `K 2 , ···
`1t , .. . `at , .. .
··· ··· ···
··· `K t , ···
−−−−−−−−→ time
Stochastic Multiarmed Bandits Notations
Game Definition `at -s
independent;
E [`at ]
= µ(a)
1. Play an action At
Action Losses
t 2. Observe and suffer `A t
`12 , .. .
`a1 , .. .
`a2 , .. .
`K 1 ,
··· ··· ···
··· `K 2 , ···
`1t , .. . `at , .. .
··· ··· ···
··· `K t , ···
−−−−−−−−→ time
a
Nt (a) - the number of times a was played up to round t
For t = 1, 2, . . . :
`11 , .. .
Gaps: ∆(a) = µ(a) − min µ(a0 ) 0
Stochastic Multiarmed Bandits Notations
Game Definition `at -s
independent;
E [`at ]
= µ(a)
1. Play an action At
Action Losses
t 2. Observe and suffer `A t
`12 , .. .
`a1 , .. .
`a2 , .. .
`K 1 ,
··· ··· ···
··· `K 2 , ···
`1t , .. . `at , .. .
··· ··· ···
··· `K t , ···
−−−−−−−−→ time
a
Nt (a) - the number of times a was played up to round t
For t = 1, 2, . . . :
`11 , .. .
Gaps: ∆(a) = µ(a) − min µ(a0 ) 0
Performance: Expected Regret E [RT ] = E =
" T X
X a
t=1
#
t `A − T min µ(a) t
NT (a)∆(a)
a
Stochastic Multiarmed Bandits Notations
Game Definition `at -s
independent;
E [`at ]
= µ(a)
1. Play an action At
Action Losses
t 2. Observe and suffer `A t
`12 , .. .
`a1 , .. .
`a2 , .. .
`K 1 ,
··· ··· ···
··· `K 2 , ···
`1t , .. . `at , .. .
··· ··· ···
··· `K t , ···
−−−−−−−−→ time
a
Nt (a) - the number of times a was played up to round t
For t = 1, 2, . . . :
`11 , .. .
Gaps: ∆(a) = µ(a) − min µ(a0 ) 0
Performance: Expected Regret E [RT ] = E =
" T X
X
t=1
#
t `A − T min µ(a) t
a
NT (a)∆(a)
a
Historical remark Originally formulated with gains rta = 1 − `at
LCB1 (Lower Confidence Bound) Algorithm Originally UCB1 (Upper Confidence Bound), [Auer et al., 2002]
Initialization: Play each arm once. for t = 1, 2, ... do ˆ t (a) - average loss of a up to t Let L s ˆ t (a) − Play At = arg min L a
end
|
3 ln t 2Nt (a) {z }
LCBt (a)
Hoeffding’s inequality
Hoeffding’s inequality (simplified) Let X1 , . . . , XN be i.i.d. with E [Xi ] = µ. Then: ( ) r N 1 X 3 ln t P Xi − µ ≥ ≤ N i=1 2N ( ) r N 1 X 3 ln t P µ− Xi ≥ ≤ N i=1 2N
1 t3 1 t3
Key properties of LCB Optimism in the face of uncertainty
ˆ t (a) − LCBt (a) = L I I I
s
3 ln t 2Nt (a)
h i ˆ t (a) = µ(a) E L
Warning: Nt (a) is a random variable - Hoeffding cannot be applied Pt Let X1 , . . . , Xt be i.i.d. with E [Xs ] = µ(a) and let µ ˆt = 1t s=1 Xs (
ˆ t (a) + P {LCBt (a) ≥ µ(a)} = P L (
s
) 3 ln t ≥ µ(a) 2Nt (a) r
) 3 ln t ≥ µ(a) 2s ( ) r t X 3 ln t 1 1 ≤ P µ ˆs (a) + ≥ µ(a) ≤ t × 3 = 2 2s t t s=1 ≤ P ∃s ∈ {1, . . . , t} : µ ˆs (a) +
I
Bottom line: the probability that LCBt (a) ≥ µ(a) is small
LCB1 Analysis Highlights (for two arms, a∗ and a)
I
∆ = µ(a) − µ(a∗ )
LCB1 Analysis Highlights (for two arms, a∗ and a)
I
∆ = µ(a) − µ(a∗ ) I W.p. ≥ 1 − 12 : LCBt (a∗ ) ≤ µ(a∗ ) t
LCB1 Analysis Highlights (for two arms, a∗ and a)
I
∆ = µ(a) − µ(a∗ ) I W.p. ≥ 1 − 12 : LCBt (a∗ ) ≤ µ(a∗ ) t q 3 ln t ˆ t (a) ≥ µ(a) − I W.p. ≥ 1 − 12 : L t 2Nt (a) (by a mirror calculation)...
LCB1 Analysis Highlights (for two arms, a∗ and a)
I
∆ = µ(a) − µ(a∗ ) I W.p. ≥ 1 − 12 : LCBt (a∗ ) ≤ µ(a∗ ) t q 3 ln t ˆ t (a) ≥ µ(a) − I W.p. ≥ 1 − 12 : L t 2Nt (a) (by a mirror calculation)... q q 3 ln t 3 ln t ˆ t (a) − I ... and thus: LCBt (a) = L ≥ µ(a) − 2 2Nt (a) 2Nt (a)
LCB1 Analysis Highlights (for two arms, a∗ and a)
I
∆ = µ(a) − µ(a∗ ) I W.p. ≥ 1 − 12 : LCBt (a∗ ) ≤ µ(a∗ ) t q 3 ln t ˆ t (a) ≥ µ(a) − I W.p. ≥ 1 − 12 : L t 2Nt (a) (by a mirror calculation)... q q 3 ln t 3 ln t ˆ t (a) − I ... and thus: LCBt (a) = L ≥ µ(a) − 2 2Nt (a) 2Nt (a) I
In expectation, the number of rounds on which either of the two Pt 2 confidence bounds fails is bounded by 2 s=1 s12 ≤ π3 .
LCB1 Analysis Highlights (for two arms, a∗ and a)
I
∆ = µ(a) − µ(a∗ ) I W.p. ≥ 1 − 12 : LCBt (a∗ ) ≤ µ(a∗ ) t q 3 ln t ˆ t (a) ≥ µ(a) − I W.p. ≥ 1 − 12 : L t 2Nt (a) (by a mirror calculation)... q q 3 ln t 3 ln t ˆ t (a) − I ... and thus: LCBt (a) = L ≥ µ(a) − 2 2Nt (a) 2Nt (a) I
In expectation, the number of rounds on which either of the two Pt 2 confidence bounds fails is bounded by 2 s=1 s12 ≤ π3 . I Fix time horizon T
LCB1 Analysis Highlights (for two arms, a∗ and a)
I
∆ = µ(a) − µ(a∗ ) I W.p. ≥ 1 − 12 : LCBt (a∗ ) ≤ µ(a∗ ) t q 3 ln t ˆ t (a) ≥ µ(a) − I W.p. ≥ 1 − 12 : L t 2Nt (a) (by a mirror calculation)... q q 3 ln t 3 ln t ˆ t (a) − I ... and thus: LCBt (a) = L ≥ µ(a) − 2 2Nt (a) 2Nt (a) I
In expectation, the number of rounds on which either of the two Pt 2 confidence bounds fails is bounded by 2 s=1 s12 ≤ π3 . I Fix time horizon T q 3 ln T I Once Nt (a) > 6 ln2T we have 2 ∆ 2Nt (a) < ∆
LCB1 Analysis Highlights (for two arms, a∗ and a)
I
∆ = µ(a) − µ(a∗ ) I W.p. ≥ 1 − 12 : LCBt (a∗ ) ≤ µ(a∗ ) t q 3 ln t ˆ t (a) ≥ µ(a) − I W.p. ≥ 1 − 12 : L t 2Nt (a) (by a mirror calculation)... q q 3 ln t 3 ln t ˆ t (a) − I ... and thus: LCBt (a) = L ≥ µ(a) − 2 2Nt (a) 2Nt (a) I
In expectation, the number of rounds on which either of the two Pt 2 confidence bounds fails is bounded by 2 s=1 s12 ≤ π3 . I Fix time horizon T q 3 ln T I Once Nt (a) > 6 ln2T we have 2 ∆ 2Nt (a) < ∆ I
Thus, LCBt (a) > µ(a) − ∆ = µ(a∗ ) ≥ LCBt (a∗ )
LCB1 Analysis Highlights (for two arms, a∗ and a)
I
∆ = µ(a) − µ(a∗ ) I W.p. ≥ 1 − 12 : LCBt (a∗ ) ≤ µ(a∗ ) t q 3 ln t ˆ t (a) ≥ µ(a) − I W.p. ≥ 1 − 12 : L t 2Nt (a) (by a mirror calculation)... q q 3 ln t 3 ln t ˆ t (a) − I ... and thus: LCBt (a) = L ≥ µ(a) − 2 2Nt (a) 2Nt (a) I
In expectation, the number of rounds on which either of the two Pt 2 confidence bounds fails is bounded by 2 s=1 s12 ≤ π3 . I Fix time horizon T q 3 ln T I Once Nt (a) > 6 ln2T we have 2 ∆ 2Nt (a) < ∆ I
Thus, LCBt (a) > µ(a) − ∆ = µ(a∗ ) ≥ LCBt (a∗ ) I Therefore, in expectation, arm a will be played no more than 6 ln T π2 ∆2 + 1 + 3 times
LCB1 Analysis Highlights (for two arms, a∗ and a)
I
∆ = µ(a) − µ(a∗ ) I W.p. ≥ 1 − 12 : LCBt (a∗ ) ≤ µ(a∗ ) t q 3 ln t ˆ t (a) ≥ µ(a) − I W.p. ≥ 1 − 12 : L t 2Nt (a) (by a mirror calculation)... q q 3 ln t 3 ln t ˆ t (a) − I ... and thus: LCBt (a) = L ≥ µ(a) − 2 2Nt (a) 2Nt (a) I
In expectation, the number of rounds on which either of the two Pt 2 confidence bounds fails is bounded by 2 s=1 s12 ≤ π3 . I Fix time horizon T q 3 ln T I Once Nt (a) > 6 ln2T we have 2 ∆ 2Nt (a) < ∆ I
Thus, LCBt (a) > µ(a) − ∆ = µ(a∗ ) ≥ LCBt (a∗ ) I Therefore, in expectation, arm a will be played no more than 6 ln T π2 ∆2 + 1 + 3 times I
E [RT ] = ∆NT (a) ≤
6 ln T ∆
+ (1 +
π2 3 )∆
Lower bound [Lai & Robbins, 1985] E [RT ] ≥ T →∞ ln T
lim inf
X
a:∆(a)>0
∆(a) , Kinf (νa , µ(a∗ ))
where Kinf (νa , µ(a∗ )) is the minimal KL-divergence between distribution of rewards νa of arm a and a suitable distribution with mean lower bounded by µ(a∗ ).
Lower bound [Lai & Robbins, 1985] E [RT ] ≥ T →∞ ln T
lim inf
X
a:∆(a)>0
∆(a) , Kinf (νa , µ(a∗ ))
where Kinf (νa , µ(a∗ )) is the minimal KL-divergence between distribution of rewards νa of arm a and a suitable distribution with mean lower bounded by µ(a∗ ).
Simplified I I
Kinf (νa , µ(a∗ )) ≥
1 2∆(a)2
When `at Bernoulli with µ(a) close to 21 , Kinf (νa , µ(a∗ )) ≈
1 2∆(a)2
and E [RT ] = θ
X
a:∆(a)>0
ln T ∆(a)
Other popular algorithms KL-UCB I
Capp´e, Garivier, Maillard, Munos, Stoltz. Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation. Annals of Statistics, 2013
I
Replaces Hoeffding’s inequality with a tighter KL concentration inequality
I
Matches the lower bound
Thompson Sampling I
Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933
I
Kaufmann, Korda, Munos. Thompson sampling: an asymptotically optimal finite time analysis. ALT, 2012
I
A Bayesian playing strategy
I
Matches the lower bound
Part I: “classical” algorithms environmental resistance
Prediction with expert advice Adversarial bandits
adversarial Stochastic bandits i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Bandits with Side Information A.k.a. Contextual Bandits
Version 1: Multiarmed Bandits with Expert Advice For t = 1, 2, . . . : 1. Observe advice of N experts in a form of N distributions pt,h (·) over K arms, where h ∈ {1, . . . , N } indexes the experts 2. Play one arm, observe and suffer the loss of that arm
Bandits with Side Information A.k.a. Contextual Bandits
Version 1: Multiarmed Bandits with Expert Advice For t = 1, 2, . . . : 1. Observe advice of N experts in a form of N distributions pt,h (·) over K arms, where h ∈ {1, . . . , N } indexes the experts 2. Play one arm, observe and suffer the loss of that arm
Algorithm: EXP4 [Auer et al., 2002] For t = 1, 2, . . . :
X
˜
pt,h (a)e−ηt Lt−1 (h)
I
Mix the advice into pt (a) ∝
I
Pick an arm At according to pt (a)
I
˜ t (h)-s Use importance-weighted sampling to track L
h
Bandits with Side Information A.k.a. Contextual Bandits
Version 1: Multiarmed Bandits with Expert Advice For t = 1, 2, . . . : 1. Observe advice of N experts in a form of N distributions pt,h (·) over K arms, where h ∈ {1, . . . , N } indexes the experts 2. Play one arm, observe and suffer the loss of that arm
Algorithm: EXP4 [Auer et al., 2002] For t = 1, 2, . . . :
X
˜
pt,h (a)e−ηt Lt−1 (h)
I
Mix the advice into pt (a) ∝
I
Pick an arm At according to pt (a)
I
˜ t (h)-s Use importance-weighted sampling to track L
Regret Bound
h
E [RT ] = O
√
KT ln N
Bandits with Side Information A.k.a. Contextual Bandits
Version 2: Multiarmed bandits with side info For t = 1, 2, . . . : 1. Observe side info (= state) St 2. Play arm At , observe and suffer the loss `(At , St )
Bandits with Side Information A.k.a. Contextual Bandits
Version 2: Multiarmed bandits with side info For t = 1, 2, . . . : 1. Observe side info (= state) St 2. Play arm At , observe and suffer the loss `(At , St )
Expert advice is a special case of side info Side info = the advice vector
Bandits with Side Information A.k.a. Contextual Bandits
Version 2: Multiarmed bandits with side info For t = 1, 2, . . . : 1. Observe side info (= state) St 2. Play arm At , observe and suffer the loss `(At , St )
Expert advice is a special case of side info Side info = the advice vector
Inverse reduction for finite state space S
Experts = the set of all possible functions h : S → {1, . . . , K} N = K |S| p E [RT ] = O KT |S| ln K
Bandits with Side Information A.k.a. Contextual Bandits
Version 2: Multiarmed bandits with side info For t = 1, 2, . . . : 1. Observe side info (= state) St 2. Play arm At , observe and suffer the loss `(At , St )
Expert advice is a special case of side info Side info = the advice vector
Inverse reduction for finite state space S
Experts = the set of all possible functions h : S → {1, . . . , K} N = K |S| p E [RT ] = O KT |S| ln K
Structural complexity ln N = |S| ln K
Markov Decision Processes (MDPs) Game definition Start from state S1 For t = 1, 2, . . . : 1. Play an action At 2. Observe and suffer loss `(At , St ) 3. Transfer to state St+1 ∼ p(St+1 |At , St )
Major difference with bandits with side info St+1 depends on At
Markov Decision Processes (MDPs) Game definition Start from state S1 For t = 1, 2, . . . : 1. Play an action At 2. Observe and suffer loss `(At , St ) 3. Transfer to state St+1 ∼ p(St+1 |At , St )
Major difference with bandits with side info St+1 depends on At
Complexity of MDPs 1. The size of the state space (same as in bandits with side info) 2. Mixing time
“Classical” algorithms summary Prediction with expert advice 𝑇 ln 𝐾
environmental resistance
Adversarial bandits 𝐾𝑇 adversarial Stochastic bandits i.i.d. stateless state
full
bandit
partial
ln 𝑇 𝑎≠𝑎∗ Δ(𝑎)
feedback
MDPs Adversarial bandits with expert advice 𝐾𝑇 ln 𝑁 or 𝐾𝑇|𝑆| ln 𝐾 structural complexity
Simplicities along the axes environmental resistance adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Simplicities along the axes environmental resistance
The more the simpler adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Simplicities along the axes environmental resistance
• Gaps (between action outcomes) • Variance (of action outcomes)
adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Simplicities along the axes environmental resistance
• Reducibility of the state space (relevance of side info) • Mixing time (in MDPs)
adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
“Classical” algorithms environmental resistance adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Assume a certain form of simplicity and exploit it
“Classical” algorithms environmental resistance adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Assume a certain form of simplicity and exploit it
environmental resistance adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Environmental resistance in full info [Koolen & van Erven, COLT, 2015, Luo & Schapire, COLT, 2015, Wintenberger, 2015, van Erven, Kotłowski & Warmuth, COLT, 2014, Gaillard, Stoltz & van Erven, COLT 2014, …] environmental resistance adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Environmental resistance in full info [Koolen & van Erven, COLT, 2015, Luo & Schapire, COLT, 2015, Wintenberger, 2015, van Erven, Kotłowski & Warmuth, COLT, 2014, Gaillard, Stoltz & van Erven, COLT 2014, …] environmental resistance adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Prediction with limited advice [Seldin, Bartlett, Cramer, Abbasi-Yadkori, ICML, 2014, Kale, COLT, 2014]
environmental resistance adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Prediction with limited advice [Seldin, Bartlett, Cramer, Abbasi-Yadkori, ICML, 2014, Kale, COLT, 2014]
environmental resistance adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Bandits with paid observations [Seldin, Bartlett, Cramer, Abbasi-Yadkori, ICML, 2014]
environmental resistance adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Bandits with paid observations [Seldin, Bartlett, Cramer, Abbasi-Yadkori, ICML, 2014]
environmental resistance adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Contaminated stochastic bandits [Seldin & Slivkins, ICML, 2014]
environmental resistance adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Contaminated stochastic bandits [Seldin & Slivkins, ICML, 2014]
environmental resistance adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Filtering of relevant side info [Seldin, Auer, Laviolette, Shawe-Taylor, Ortner, NIPS, 2011]
environmental resistance adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
In details environmental resistance adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Putting all in one language `1 1, . . . a `1 , . . . K `1 ,
`1 2, . . . a `2 , . . . K `2 ,
··· ··· ··· ··· ···
`1 t, . . . a `t , . . . K `t ,
··· ··· ··· ··· ···
Putting all in one language `1 1, . . . a `1 , . . . K `1 ,
Feedback I
Expert Advice: K/K
I
Bandits: 1/K
`1 2, . . . a `2 , . . . K `2 ,
··· ··· ··· ··· ···
`1 t, . . . a `t , . . . K `t ,
··· ··· ··· ··· ···
Putting all in one language `1 1, . . . a `1 , . . . K `1 ,
Feedback I
Expert Advice: K/K
I
Limited Advice: M/K
I
Bandits: 1/K
`1 2, . . . a `2 , . . . K `2 ,
··· ··· ··· ··· ···
`1 t, . . . a `t , . . . K `t ,
··· ··· ··· ··· ···
Putting all in one language `1 1, . . . a `1 , . . . K `1 ,
Feedback I
Expert Advice: K/K
I
Limited Advice: M/K
I
Bandits: 1/K
I
Paid Observations: 0/K
`1 2, . . . a `2 , . . . K `2 ,
··· ··· ··· ··· ···
`1 t, . . . a `t , . . . K `t ,
··· ··· ··· ··· ···
Putting all in one language `1 1, . . . a `1 , . . . K `1 ,
Feedback I
Expert Advice: K/K
I
Limited Advice: M/K
I
Bandits: 1/K
I
Paid Observations: 0/K
`1 2, . . . a `2 , . . . K `2 ,
··· ··· ··· ··· ···
`1 t, . . . a `t , . . . K `t ,
··· ··· ··· ··· ···
Loss generation I
Adversarial (deterministic)
I
Stochastic (E [`at ] = µ(a))
Putting all in one language `1 1, . . . a `1 , . . . K `1 ,
Feedback
`1 2, . . . a `2 , . . . K `2 ,
··· ··· ··· ··· ···
`1 t, . . . a `t , . . . K `t ,
··· ··· ··· ··· ···
Loss generation
I
Expert Advice: K/K
I
Adversarial (deterministic)
I
Limited Advice: M/K
I
Contaminated stochastic
I
Bandits: 1/K
I
Stochastic (E [`at ] = µ(a))
I
Paid Observations: 0/K
Putting all in one language `1 1, . . . a `1 , . . . K `1 ,
`1 2, . . . a `2 , . . . K `2 ,
··· ··· ··· ··· ···
Feedback
`1 t, . . . a `t , . . . K `t ,
··· ··· ··· ··· ···
Loss generation
I
Expert Advice: K/K
I
Adversarial (deterministic)
I
Limited Advice: M/K
I
Contaminated stochastic
I
Bandits: 1/K
I
Stochastic (E [`at ] = µ(a))
I
Paid Observations: 0/K
Regret E [RT ] = E
" T X t=1
#
t `A − min E t
a
"
T X t=1
`at
#
Prediction with limited advice [Seldin, Bartlett, Cramer, Abbasi-Yadkori, ICML, 2014]
environmental resistance adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Prediction with Limited Advice
Motivation We can observe the advice of M out of K experts for M ≤ K
Examples Experts are computationally-expensive functions (or algorithms) and we have a constraint on the response time Experts are humans that have to be paid
Prediction with Limited Advice
Notations Ot ⊆ {1, . . . , K} - the set of observed experts |Ot | = Mt - the number of observed experts 1 ≤ Mt ≤ N
Game Definition For t = 1, 2, . . . : 1. Pick (Ot , At ), such that At ∈ Ot and follow the advice of At t 2. Observe `at for a ∈ Ot and suffer `A t
General Picture Prediction with Expert Advice
Prediction with Limited Advice
`1t , . . . , `K t
{`at |a ∈ Ot , |Ot | = M }
Observations Regret Upper Bound Regret Lower Bound
(M = K)
O Ω
√
√
T ln K
T ln K
Bandits t `A t (M = 1)
???
O
???
Ω
√
√
KT
KT
General Picture Prediction with Expert Advice
Prediction with Limited Advice
`1t , . . . , `K t
{`at |a ∈ Ot , |Ot | = M }
Observations Regret Upper Bound Regret Lower Bound
(M = K)
O Ω
√
√
T ln K
T ln K
O
q Ω
K MT
q
ln K
K MT
Bandits t `A t (M = 1)
O Ω
√
√
KT
KT
General Picture Prediction with Expert Advice
Prediction with Limited Advice
`1t , . . . , `K t
{`at |a ∈ Ot , |Ot | = M }
Observations Regret Upper Bound Regret Lower Bound
(M = K)
O Ω
√
√
T ln K
T ln K
I The (ln K) gaps can be closed
O
q Ω
K MT
q
ln K
K MT
Bandits t `A t (M = 1)
O Ω
√
√
KT
KT
General Picture Prediction with Expert Advice
Prediction with Limited Advice
`1t , . . . , `K t
{`at |a ∈ Ot , |Ot | = M }
Observations Regret Upper Bound Regret Lower Bound
(M = K)
O Ω
√
√
T ln K
T ln K
O
I The (ln K) gaps can be closed I For time-dependent Mt the regret is O
q Ω
K MT
q
ln K
K MT
r PT K t=1
Bandits t `A t (M = 1)
1 Mt
O Ω
ln K
√
√
KT
KT
Reminder: Hedge Algorithm (Exponential Weights)
Input: Learning rates η1 ≥ η2 ≥ · · · > 0 ˆ 0 (a) = 0 ∀a : L for t = 1, 2, ... do ˆ
∀a : pt (a) = P
e−ηt Lt−1 (a)
a0
e−ηt Lˆ t−1 (a0 )
Sample At according to pt and play it Observe `1t , . . . , `K t ˆ t (a) = L ˆ t−1 (a) + `at ∀a : L end
Reminder: EXP3 Algorithm Input: Learning rates η1 ≥ η2 ≥ · · · > 0 ˜ 0 (a) = 0 ∀a : L for t = 1, 2, ... do ˜
∀a : pt (a) = P
e−ηt Lt−1 (a)
a0
e−ηt L˜ t−1 (a0 )
Sample At according to pt and play it t Observe `A t
∀a :
`˜at
`at 1{At =a} = = pt (a)
(
`a t pt (a) ,
if At = a
0,
otherwise
˜ t (a) = L ˜ t−1 (a) + `˜at ∀a : L end
Importance-weighted sampling
Algorithm for Prediction with Limited Advice Input: M1 , M2 , . . . and learning rates η1 ≥ η2 ≥ . . . ˜ 0 (a) = 0 ∀a : L for t = 1, 2, ... do ˜
∀a : pt (a) = P
e−ηt Lt−1 (a)
a0
e−ηt L˜ t−1 (a0 )
Sample At according to pt and play it (At ∈ Ot ) Sample Mt − 1 additional experts (rest of Ot ) uniformly Observe `at for a ∈ Ot . ∀a : `˜at =
`at 1{a∈Ot }
t −1 pt (a) + (1 − pt (a)) M N −1
˜ t (a) = L ˜ t−1 (a) + `˜at ∀a : L end
Analysis idea Upper bound By the analysis of Hedge/EXP3: " T # K 2 X ηt X ln K a ˜ E [RT ] ≤ E pt (a) `t + 2 a=1 ηT t=1
Analysis idea Upper bound By the analysis of Hedge/EXP3: " T # K 2 X ηt X ln K a ˜ E [RT ] ≤ E pt (a) `t + 2 a=1 ηT t=1 And we have:
Et
"
# 2 K a pt (a) `˜t ≤ M t a=1 K X
Analysis idea Upper bound By the analysis of Hedge/EXP3: " T # K 2 X ηt X ln K a ˜ E [RT ] ≤ E pt (a) `t + 2 a=1 ηT t=1 And we have:
By tuning ηt :
Et
"
# 2 K a pt (a) `˜t ≤ M t a=1 K X
v ! ! r u T u X 1 K E [RT ] ≤ O t K ln K = O T ln K Mt M t=1
Analysis idea Upper bound By the analysis of Hedge/EXP3: " T # K 2 X ηt X ln K a ˜ E [RT ] ≤ E pt (a) `t + 2 a=1 ηT t=1 And we have:
By tuning ηt :
Et
"
# 2 K a pt (a) `˜t ≤ M t a=1 K X
v ! ! r u T u X 1 K E [RT ] ≤ O t K ln K = O T ln K Mt M t=1
Lower bound Similar to bandits (indistinguishability of K games), just with M T ! r observations K E [RT ] = Ω T M
Bandits with paid observations [Seldin, Bartlett, Cramer, Abbasi-Yadkori, ICML, 2014]
environmental resistance adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Multiarmed Bandits with Paid Observations
Motivation How to deal with a situation when we have to pay for observations?
The loss of any arm can be observed, but each observation has a known cost ct (a)
Example Signing contracts with service providers ct (a) - inspection cost
Multiarmed Bandits with Paid Observations Notations At - the arm played Ot ⊆ {1, . . . , K} - the set of observed arms
Game Definition For t = 1, 2, . . . : 1. Pick (At , Ot ) and play At (At is not necessarily in Ot ) X t 2. Observe `at for a ∈ Ot and suffer `A ct (a) t + a∈Ot
Multiarmed Bandits with Paid Observations Notations At - the arm played Ot ⊆ {1, . . . , K} - the set of observed arms
Game Definition For t = 1, 2, . . . : 1. Pick (At , Ot ) and play At (At is not necessarily in Ot ) X t 2. Observe `at for a ∈ Ot and suffer `A ct (a) t + a∈Ot
Performance Measure: Cost-sensitive Expected Regret
E [RTc ] = E |
" T X t=1
#
t `A − min t
a
{z
E[RT ]
T X t=1
`at
! }
+E
"
T X X
t=1 a∈Ot
ct (a)
#
Lower bound
Assume the algorithm makes M T observations and ct (a) = c:
E [RTc ] = E [RT ]+cM T
Lower bound
Assume the algorithm makes M T observations and ct (a) = c:
E [RTc ] = E [RT ]+cM T = Ω
r
! K T +cM T M
Lower bound
Assume the algorithm makes M T observations and ct (a) = c:
E [RTc ] = E [RT ]+cM T = Ω
r
! K T +cM T ≥ Ω (cK)1/3 T 2/3 M
Algorithm ˜ 0 (a) = 0 ∀a : L for t = 1, 2, ... do ˜
∀a : pt (a) = P
e−ηt Lt−1 (a)
a0
e−ηt L˜ t−1 (a0 )
Sample At according to pt and play it ∀a: Query the loss of a with probability s ! ηt pt (a) qt (a) = min 1, 2ct (a)
Trade-off between relative arm quality pt (a) & observation cost ct (a)
`a ∀a : `˜at = t 1{At =a} qt (a) ˜ t (a) = L ˜ t−1 (a) + `˜a ∀a : L t end The learning rate ηt is tuned based on p1 (·), . . . , pt−1 (·) and c1 (·), . . . , ct−1 (·)
Analysis By the analysis of Hedge/EXP3: " T # K 2 X ηt X ln K a +E pt (a) `˜t E [RT ] ≤ ηT 2 a=1 t=1
Analysis By the analysis of Hedge/EXP3:
And:
" T # K 2 X ηt X ln K a +E pt (a) `˜t E [RT ] ≤ ηT 2 a=1 t=1 2 Et `˜at ≤
1 qt (a)
Analysis By the analysis of Hedge/EXP3:
And:
" T # K 2 X ηt X ln K a +E pt (a) `˜t E [RT ] ≤ ηT 2 a=1 t=1 2 Et `˜at ≤
Thus: E [RTc ]
1 qt (a)
" T # K X ηt X pt (a) ln K +E + ct (a)qt (a) ≤ ηT 2 a=1 qt (a) t=1
Analysis By the analysis of Hedge/EXP3:
And:
" T # K 2 X ηt X ln K a +E pt (a) `˜t E [RT ] ≤ ηT 2 a=1 t=1 2 Et `˜at ≤
Thus: E [RTc ]
1 qt (a)
" T # K X ηt X pt (a) ln K +E + ct (a)qt (a) ≤ ηT 2 a=1 qt (a) t=1
We have to minimize:
K X ηt pt (a)
a=1
2qt (a)
+ ct (a)qt (a)
Analysis By the analysis of Hedge/EXP3:
And:
" T # K 2 X ηt X ln K a +E pt (a) `˜t E [RT ] ≤ ηT 2 a=1 t=1 2 Et `˜at ≤
Thus: E [RTc ]
1 qt (a)
" T # K X ηt X pt (a) ln K +E + ct (a)qt (a) ≤ ηT 2 a=1 qt (a) t=1
We have to minimize:
K X ηt pt (a)
a=1
This is achieved by
2qt (a)
+ ct (a)qt (a)
( s
qt (a) = min 1,
ηt pt (a) 2ct (a)
)
Results Simplified Upper Bound for ct (a) = c RTc
1/3
. (32c ln K)
Worst case:
PT
t=1
K p X
pt (a)
|a=1 {z
1≤···≤K
}
2/3
√ + 2 T ln K
√ RTc ≤ (32cK ln K)1/3 T 2/3 + 2 T ln K Favorable case (one dominating arm): √ RTc → (32c ln K)1/3 T 2/3 + 2 T ln K
General Upper Bound RTc
. (32 ln K)
1/3
X T t=1
√
Worst case:
RTc . (32 ln K)1/3
K p X
|a=1
ct
pt (a)ct (a) {z } √PK
(a0 )≤···≤
v
T u K uX X t=1
t
a=1
2/3
√ + 2 T ln K
a=1 ct (a)
2/3
ct (a)
√ + 2 T ln K
Favorable case (one dominating arm h? ): RTc
→ (32 ln K)
1/3
!2/3 T p X √ ct (a? ) + 2 T ln K t=1
Bandits with Paid Observations Summary
I
Adaptation to the cost of information gathering
I
Balance between qproblemcomplexity and information cost t pt (a) qt (a) = min 1, η2c t (a)
I
Automatic tuning of the learning rate ηt
Stochastic and Adversarial bandits [Seldin & Slivkins, ICML, 2014]
environmental resistance adversarial
i.i.d. stateless state MDPs
structural complexity
full
bandit
partial
feedback
Loss Generation Models Adversarial Regime `at -s are picked by an adversary in an arbitrary way
Stochastic Regime `at -s are drawn independently at random, so that E [`at ] = µ(a) ∆(a) = µ(a) − mina0 {µ(a0 )} - the gap ∆ = mina:∆(a)>0 {∆(a)} - the minimal gap
Loss Generation Models Adversarial Regime `at -s are picked by an adversary in an arbitrary way
Stochastic Regime `at -s are drawn independently at random, so that E [`at ] = µ(a) ∆(a) = µ(a) − mina0 {µ(a0 )} - the gap ∆ = mina:∆(a)>0 {∆(a)} - the minimal gap
Moderately Contaminated Stochastic RegimeNEW A stochastic regime, where the adversary can contaminate I
up to t∆(a)/4 entries for suboptimal actions
I
up to t∆/4 entries for optimal actions
Loss Generation Models Adversarial Regime `at -s are picked by an adversary in an arbitrary way
Stochastic Regime `at -s are drawn independently at random, so that E [`at ] = µ(a) ∆(a) = µ(a) − mina0 {µ(a0 )} - the gap ∆ = mina:∆(a)>0 {∆(a)} - the minimal gap
Moderately Contaminated Stochastic RegimeNEW A stochastic regime, where the adversary can contaminate I
up to t∆(a)/4 entries for suboptimal actions
I
up to t∆/4 entries for optimal actions
Adversarial Regime with a GapNEW
P Let λt (a) = ts=1 `as There exists a consistent minimizer a∗τ of λt (a) for all t ≥ τ 1 ∆(τ, a) = mint≥τ t (λt (a) − λt (a∗τ )) - deterministic gap
Can we have one algorithm that performs “well” in all the regimes? (without prior knowledge of the regime type)
Classical Results Adversarial Regime
√ Lower bound - Ω Kt [Auer et. al., 1995] √ EXP3 - O Kt ln K [Auer et. al., 2002] √ Kt [Audibert & Bubeck, 2009] INF - O
Stochastic Regime
P ln t Lower bound - Ω a:∆(a)>0 ∆(a) [Lai & Robbins, 1985] P ln t UCB1 - O a:∆(a)>0 ∆(a) [Auer et. al., 2002] P ln t KL-UCB, Thompson sampling, EwS, ... - O a:∆(a)>0 ∆(a)
Classical Results Adversarial Regime
√ Lower bound - Ω Kt [Auer et. al., 1995] √ EXP3 - O Kt ln K [Auer et. al., 2002] √ Kt [Audibert & Bubeck, 2009] INF - O
Stochastic Regime
P ln t Lower bound - Ω a:∆(a)>0 ∆(a) [Lai & Robbins, 1985] P ln t UCB1 - O a:∆(a)>0 ∆(a) [Auer et. al., 2002] P ln t KL-UCB, Thompson sampling, EwS, ... - O a:∆(a)>0 ∆(a) I
Algorithms for the stochastic regime are inapplicable in the adversarial regime (linear regret)
I
Algorithms for the adversarial regime are suboptimal in the stochastic regime
SAO [Bubeck & Slivkins, 2012]
T K (ln T )3/2 ln K - in the adversarial regime 2 + O K (ln T ) ln K - in the stochastic regime ∆ + O
√
− Does not cover the intermediate regimes
− Relatively complicated and unnatural for the problem − Relies on the knowledge of time horizon T
− Based on one-time irreversible transition from stochastic to adversarial operation mode
The EXP3++ Algorithm [Seldin & Slivkins, 2014]
+ Simple and natural generalization of the EXP3 algorithm √ + O Kt ln K regret in the adversarial regime P (ln t)3 regret in the stochastic regime + O a:∆(a)>0 ∆(a) P 3 (ln t) regret in the moderately contaminated + O a:∆(a)>0 ∆(a) stochastic regime n o P (ln t)3 + O minτ τ + a:∆(τ,a)>0 ∆(τ,a) regret in the adversarial regime with a gap
Reminder: EXP3 Control lever: ηt
=
q
ln K tK
˜ 0 (h) = 0 ∀h : L for t = 1, 2, ... do ˜
∀a : pt (a) = P
e−ηt Lt−1 (a)
a0
e−ηt L˜ t−1 (a0 )
t Sample At according to pt and play it. Observe and suffer `A t
∀a : `˜at =
t `A t 1{At =a} pt (h)
Importance-weighted sampling
˜ t (a) = L ˜ t−1 (a) + `˜a ∀a : L t end
The EXP3++ Algorithm ˜ 0 (h) = 0 ∀h : L for t = 1, 2, ... do
Control levers: ηt and εt (a)-s ˜
∀a : pt (a) = P
e−ηt Lt−1 (a)
a0
∀a : p˜t (a) =
1−
X a0
e−ηt L˜ t−1 (a0 ) !
εt (a0 ) pt (a) + εt (a)
t Sample At according to p˜t and play it. Observe and suffer `A t
∀a : `˜at =
t `A t 1{At =a} p˜t (h)
˜ t (a) = L ˜ t−1 (a) + `˜a ∀a : L t end
Analysis Adversarial Regime
EXP3++ playing distribution p˜t (a) =
1−
X
!
εt (a) pt (a) + εt (a)
a
Regret in the adversarial regime q For εt (a) = O
ln K Kt
:
E [RT ] = O (E [RT ] is unaffected by εt (a))
√
KT ln K
Analysis Stochastic Regime
Properties of Importance-Weighted Sampling I I I I
h i E `˜at = E [`at ] = µ(a)
˜a is a martingale ˜ t (a) = Pt tµ(a) − L µ(a) − ` s s=1 2 a Instantaneous variance: Et µ(a) − `˜t ≤ p˜t1(a) ≤ Cumulative variance over t rounds: νt (a) ≈
t εt (a)
1 εt (a)
The Fundamental Trade-off of the Algorithm Stochastic Regime 4
By Benstein’s inequality, w.p. ≥ 1 − 1t : p ˜ t (a) ≤ 2νt (a) ln t + ln t tµ(a) − L 3εt s 2t ln t ln t + ≈ εt (a) 3εt
10
x 10
8
6
4
tµ(a1 ) tµ(a2 ) ˜ t (a1 ) L ˜ t (a2 ) L U C B(a1 ) U C B(a2 ) LC B(a1 ) LC B(a2 ) t
2
0 0
2
4
6
8
10 4
x 10
The Fundamental Trade-off of the Algorithm Stochastic Regime 4
By Benstein’s inequality, w.p. ≥ 1 − 1t : p ˜ t (a) ≤ 2νt (a) ln t + ln t tµ(a) − L 3εt s 2t ln t ln t + ≈ εt (a) 3εt
10
x 10
8
6
4
tµ(a1 ) tµ(a2 ) ˜ t (a1 ) L ˜ t (a2 ) L U C B(a1 ) U C B(a2 ) LC B(a1 ) LC B(a2 ) t
2
0 0
2
4
6
8
10 4
x 10
I
For separation ofq arms
2t ln t εt (a)
= O (t∆(a)) ⇒ εt (a) = Ω
1 t∆(a)2
The Fundamental Trade-off of the Algorithm Stochastic Regime 4
By Benstein’s inequality, w.p. ≥ 1 − 1t : p ˜ t (a) ≤ 2νt (a) ln t + ln t tµ(a) − L 3εt s 2t ln t ln t + ≈ εt (a) 3εt
10
x 10
8
6
4
tµ(a1 ) tµ(a2 ) ˜ t (a1 ) L ˜ t (a2 ) L U C B(a1 ) U C B(a2 ) LC B(a1 ) LC B(a2 ) t
2
0 0
2
4
6
8
10 4
x 10
I
For separation ofq arms
2t ln t εt (a)
I
= O (t∆(a)) ⇒ εt (a) = Ω P 1 Nt (a) ≥ ts=1 εs (a) ⇒ εt (a) = O t∆(a) 2
1 t∆(a)2
The Fundamental Trade-off of the Algorithm Stochastic Regime 4
By Benstein’s inequality, w.p. ≥ 1 − 1t : p ˜ t (a) ≤ 2νt (a) ln t + ln t tµ(a) − L 3εt s 2t ln t ln t + ≈ εt (a) 3εt
10
x 10
8
6
4
tµ(a1 ) tµ(a2 ) ˜ t (a1 ) L ˜ t (a2 ) L U C B(a1 ) U C B(a2 ) LC B(a1 ) LC B(a2 ) t
2
0 0
2
4
6
8
10 4
x 10
I
I I
For separation ofq arms
2t ln t εt (a)
18(ln t)2 ˆ t (a)2 t∆
ˆ t (a) → ∆(a) and show that ∆
= O (t∆(a)) ⇒ εt (a) = Ω P 1 Nt (a) ≥ ts=1 εs (a) ⇒ εt (a) = O t∆(a) 2 We take εt (a) =
1 t∆(a)2
Main Results I
ηt =
1 2
q
q ln K and εt (a) = O tK √ tK ln K regret in the adversarial regime ⇒O
ln K tK
Main Results I
ηt =
1 2
q
q ln K and εt (a) = O tK √ tK ln K regret in the adversarial regime ⇒O
ln K tK
ˆ t (a) be empirical estimate of ∆(a) Let ∆ q 18(ln t)2 1 ln K I εt (a) = ˆ t (a)2 and ηt ≥ 2 tK t∆ (ln t)3 ⇒ O ∆(a) regret in the stochastic regime, moderately contaminated stochastic, adversarial with a gap
Main Results I
ηt =
1 2
q
q ln K and εt (a) = O tK √ tK ln K regret in the adversarial regime ⇒O
ln K tK
ˆ t (a) be empirical estimate of ∆(a) Let ∆ q 18(ln t)2 1 ln K I εt (a) = ˆ t (a)2 and ηt ≥ 2 tK t∆ (ln t)3 ⇒ O ∆(a) regret in the stochastic regime, moderately contaminated stochastic, adversarial with a gap q 18(ln t)2 ln K I ηt = 1 and ε (a) = ⇒ good for all four regimes t 2 ˆ 2 tK t∆t (a)
Experiments in the Stochastic Regime
200 0 0
5 t
Cumulative Regret
1000
5 t
2000 0 0 4
2000
0 0
4000
10 6 x 10
3000
10 6 x 10
Cumulative Regret
400
4
6000
4
x 10
5 t K = 10. ∆ = 0.01
10 6 x 10 Cumulative Regret
600
K = 2. ∆ = 0.01 Cumulative Regret
K = 10. ∆ = 0.1 Cumulative Regret
Cumulative Regret
K = 2. ∆ = 0.1 800
3 2 1 0 0
5 t
10 6 x 10
4
x 10
K = 100. ∆ = 0.1
3 2 1 0 0
10
5 t 4 K = 100. ∆ = 0.01 x 10 UCB Thom EXP3
5
10 6 x 10
EXP3++η=β EXP3++η=1
0 0
5 t
10 6 x 10
EXP3++ Summary I
EXP3++ simple and natural extension of EXP3
I
Two control levers ηt and εt (a)-s
I
Almost optimal performance in both stochastic and adversarial regimes “Logarithmic” regret in two new regimes
I
I I
I
Moderately contaminated stochastic regime Adversarial regime with a gap
In the stochastic regime empirically comparable to UCB1 Punch Line
EXP3++ is a powerful tool for exploiting the gaps in a variety of regimes without compromising on the worst-case performance!
environmental resistance adversarial
i.i.d. no state state MDPs
structural complexity
full
bandit
partial
feedback
environmental resistance adversarial
i.i.d. no state state MDPs
structural complexity
full
bandit
partial
feedback
environmental resistance adversarial
i.i.d. no state state MDPs
structural complexity
full
bandit
partial
feedback
environmental resistance adversarial
i.i.d. no state state MDPs
structural complexity
full
bandit
partial
feedback
Other popular problems we have not touched
I
Linear bandits
I
Combinatorial bandits
I
Dueling bandits
I
And many, many more variations ...
Further Reading Part 1 I
Nicol`o Cesa-Bianchi and G´abor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006
I
Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 2012
I
Peter Auer, Nicol` o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 2002
I
Peter Auer, Nicol` o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 2002
I
S´ebastien Bubeck and Nicol` o Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. In Foundations and Trends in Machine Learning, 2012
Further Reading Part 2 I
Wouter M. Koolen and Tim Van Erven. Second-order quantile methods for experts and combinatorial games. COLT, 2015
I
Haipeng Luo and Robert E. Schapire. Achieving all with no parameters: AdaNormalHedge. COLT, 2015
I
Yevgeny Seldin and Aleksandrs Slivkins. One practical algorithm for both stochastic and adversarial bandits. ICML, 2014
I
Yevgeny Seldin, Peter L. Bartlett, Koby Crammer, and Yasin Abbasi-Yadkori. Prediction with limited advice and multiarmed bandits with paid observations. ICML, 2014
I
Satyen Kale. Multiarmed bandits with limited expert advice. COLT, 2014
I
Yevgeny Seldin, Peter Auer, Fran¸cois Laviolette, John Shawe-Taylor, and Ronald Ortner. PAC-Bayesian analysis of contextual bandits. NIPS, 2011