Inferring bounds on the performance of a control policy from a sample of trajectories 28th Benelux Meeting – Spa, March 16th, 2009
Raphaël Fonteneau, Susan Murphy, Louis Wehenkel, Damien Ernst
Université de Liège – University of Michigan, Ann Arbor
1
Outline > Introduction > Approach > Results > Conclusion and Future Work 2009 IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (IEEE ADPRL 2009) conference proceedings published annually by IEEE
2
Introduction General problem > Discretetime dynamics system x t1=f xt ,ut > Reward function rt = xt ,u t > Deterministic system dynamics f and reward function ρ unknown, replaced by ... > ... Collected data: (parts of) trajectories n
F= { x , u , r = x ,u , y =f x ,u }l= 1 l
l
l
l
l
l
l
l
> Goal: finding a (nearoptimal) control policy h T −1 that maximises
J x0 =∑ xt ,h t , xt h T
t =0
3
Introduction Existing solutions > Model Predictive Control, Dynamic Programming, Reinforcement Learning ... But l
l
l
l
n
> Limited amount of data in F= { x , u , r , y }l=1 > High dimensional / continuous spaces > Need for approximation structures > Guarantees on the performance of a computed T−1 policy h ? h
J T x0 =∑ x t , ht , x t t =0
4
Introduction What we propose > A tool for evaluating the quality of a policy > Computation a lowerbound on the return of a given policy h, starting from a given initial state > Use of the (assumed nonnoisy) database without approximation
5
Approach Assumptions > Deterministic framework, continuous spaces and functions, nonnoisy database > Lipschitz continuity of f, ρ and h
2
2
Lf , L , Lh ∈ℝ , ∀ x , x '∈ X ,u , u ' ∈U ,
∥f x , u−f x ' ,u ' ∥≤L f ∥x−x '∥∥u−u '∥ ∣ x ,u− x ' , u ' ∣≤L ∥x −x '∥∥u−u '∥ ∥h t , x −ht , x '∥≤Lh∥x −x '∥, 0≤t≤T −1 6
Approach General idea > Using a sequence of T tuples to compute an exact lowerbound on T−1
J x0 =∑ x t , ht , x t h T
t =0
using the exact rewards and the exact dynamics given by the tuples
> Maximizing this lowerbound by chosing the best sequence of tuples 7
Approach An illustration
x0
x 1=f x 0 ,h 0, x0 x2
r 0 = x0 , h 0, x 0
0 l0
x
, u
l0
x T−2
1
l0
x T−1
l0
l0
l0
T −1 l1
l1
l1
l T −2
x l0
l T −1
x
l1
x , u , r , y x , u , r , y
xT
l T −2
,u
,r
l T −2
,y
l T −2
,u
l T −1
,r
l T −1
,y
l T −1
l0
0 =∣∣x −x 0∣∣∣∣u −h 0, x 0 ∣ 1=∣∣ y l −x l ∣∣∣ ult −h 1, yl ∣∣ 0
1
1
0
8
Results Theorem : Lowerbound associated with a sequence of tuples τ > A computable lowerbound on
h T
J x0
T −1
B , x 0 = ∑ [r −LQ ∥x − y ∥∥u −ht , y ∥] h
lt
l t−1
lt
l t −1
T −t
t =0
with
lt
l −1
y = x0 N−1
LQ =L ∑ [ Lf 1L h]
t
N
t =0
9
Results h
Maximizing the lowerbound B , x 0 > Maximizing over the set of all possible sequences of tuples FT ✶ T F
h
B x 0 =max B , x 0
with h T
∈ F
T
✶ T F
J x0 ≥B x 0
> Exhaustive search becomes quickly prohibitive > Viterbilike algorithm O( T*n2) 10
Results Theorem: Tightness of the computed lowerbound B ✶F x 0 T
> Hypothesis on the density of the database (with X and U bounded)
∃∈ℝ :
sup
l
l
{ min ∥x−x ∥∥u−u ∥ }≤
x, u ∈ X ×U l ∈{1,. .. ,n }
" For each couple (x,u), the nearest tuple is not farther than α "
h T
✶ T F
∃C∈ℝ : J x 0 −B x 0 ≤C 11
Results Toy example: 1Dimensional linear system > from 100 to 40 000 tuples in F
12
Conclusion and Future Work > An approach for computing an exact lowerbound > Simple algorithm > Linear relationship with the database density To improve > Strong correlation with the Lipschitz constants > The lowerbound can be very low (!) > Extension to a stochastic framework > Developing new algorithms 13
Thank you!
14