Inferring bounds on the performance of a control policy ...

Viewer
Transcript

Inferring bounds on the performance of a control policy from a sample of trajectories 28th Benelux Meeting – Spa, March 16th, 2009

Raphaël Fonteneau, Susan Murphy, Louis Wehenkel, Damien Ernst

Université de Liège – University of Michigan, Ann Arbor

1

Outline > Introduction > Approach > Results > Conclusion and Future Work 2009 IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (IEEE ADPRL 2009)   conference proceedings published annually by IEEE

2

Introduction General problem > Discretetime dynamics system x t1=f  xt ,ut  > Reward function rt = xt ,u t  > Deterministic system dynamics f and reward                function ρ unknown, replaced by ... >  ... Collected data: (parts of) trajectories n

F= { x , u , r = x ,u , y =f x ,u }l= 1 l

l

l

l

l

l

l

l

> Goal: finding a (nearoptimal) control policy h T −1 that maximises

J  x0 =∑  xt ,h t , xt  h T

t =0

3

Introduction Existing solutions > Model Predictive Control, Dynamic   Programming,  Reinforcement Learning ... But l

l

l

l

n

> Limited amount of data in F= { x , u , r , y  }l=1 > High dimensional / continuous spaces > Need for approximation structures > Guarantees on the performance of a computed   T−1 policy h ? h

J T  x0 =∑  x t , ht , x t  t =0

4

Introduction What we propose > A tool for evaluating the quality of a policy > Computation a lowerbound on the return of a given policy h, starting from a given initial state > Use of the (assumed nonnoisy) database        without approximation

5

Approach Assumptions > Deterministic framework, continuous spaces and   functions, nonnoisy database > Lipschitz continuity of f, ρ and h 

2

2

Lf , L , Lh ∈ℝ , ∀  x , x '∈ X ,u , u ' ∈U ,

∥f  x , u−f  x ' ,u ' ∥≤L f ∥x−x '∥∥u−u '∥ ∣ x ,u− x ' , u ' ∣≤L ∥x −x '∥∥u−u '∥ ∥h t , x −ht , x '∥≤Lh∥x −x '∥, 0≤t≤T −1 6

Approach General idea > Using a sequence of T tuples to compute an exact      lowerbound on T−1

J  x0 =∑  x t , ht , x t  h T

t =0

using the exact rewards and the exact dynamics    given by the tuples

> Maximizing this lowerbound by chosing the best   sequence of tuples 7

Approach An illustration

x0

x 1=f  x 0 ,h 0, x0  x2

r 0 =  x0 , h 0, x 0 

0 l0

  x

, u

l0

x T−2

1



l0

x T−1

l0

l0

l0

T −1 l1

l1

l1

l T −2

x l0

l T −1

x

l1

x , u , r , y  x , u , r , y 

xT

l T −2

,u

,r

l T −2

,y

l T −2

,u

l T −1

,r

l T −1

,y

l T −1



l0

0 =∣∣x −x 0∣∣∣∣u −h 0, x 0 ∣ 1=∣∣ y l −x l ∣∣∣ ult −h 1, yl ∣∣ 0

1

1

0

8



Results Theorem : Lowerbound associated with a sequence of tuples τ > A computable lowerbound on

h T

J  x0 

T −1

B  , x 0 = ∑ [r −LQ ∥x − y ∥∥u −ht , y ∥] h

lt

l t−1

lt

l t −1

T −t

t =0

with

lt

l −1

y = x0 N−1

LQ =L ∑ [ Lf 1L h]

t

N

t =0

9

Results h

Maximizing the lowerbound B  , x 0  > Maximizing over the set of all possible sequences   of tuples FT ✶ T F

h

B  x 0 =max B  , x 0 

with h T

∈ F

T

✶ T F

J  x0 ≥B  x 0 

> Exhaustive search becomes quickly prohibitive > Viterbilike algorithm O( T*n2) 10

Results Theorem: Tightness of the computed lowerbound B ✶F  x 0  T

> Hypothesis on the density of the database (with X and U bounded) 

∃∈ℝ :

sup

l

l

{ min ∥x−x ∥∥u−u ∥ }≤

 x, u ∈ X ×U l ∈{1,. .. ,n }

" For each couple (x,u), the nearest tuple is not farther than α " 

h T

✶ T F

∃C∈ℝ : J  x 0 −B  x 0 ≤C  11

Results Toy example: 1Dimensional linear system > from 100 to 40 000 tuples in F

12

Conclusion and Future Work > An approach for computing an exact lowerbound > Simple algorithm > Linear relationship with the database density To improve > Strong correlation with the Lipschitz constants >  The lowerbound can be very low (!) > Extension to a stochastic framework >  Developing new algorithms 13

Thank you!

14