Inferring bounds on the performance of a  control policy from a sample of trajectories University of Michigan ­ December 18th, 2008

Raphaël Fonteneau, Susan Murphy, Louis Wehenkel, Damien Ernst

Université de Liège (Belgium) – University of Michigan

1

Outline   > Introduction   >  Formalization   > Approach   >  Results   > Conclusion and Future Work 2009  IEEE  International  Symposium  on  Adaptive  Dynamic  Programming  and  Reinforcement  Learning  (ADPRL  2009)  ­    conference  proceedings  published annually by IEEE 2

Introduction 

1/4

General problem   > System   >  Collected data: (parts of) trajectories  

 x 0 , u0 ,r 0 , x 1 ,u 1 , r 1 ,...

  > Dynamics of the system unknown  3

Introduction

2/4

General problem   > Goal: finding a (near­optimal) control policy   > How: many solutions   >  Approximation structures

4

Introduction

3/4

General problem   > Approximation structures bring uncertainties   > Garantees on the performance of a policy ?

5

Introduction

4/4

What we propose   > Compute a lower­bound on the return of a given  policy, starting from a given initial state...   > … using the (assumed non­noisy) database  without approximation

6

Formalization

1/3

  > Finite horizon T   > Continuous state and action spaces X , U  

∀ t∈ { 0,. .. , T −1 } x t ∈ X ,u t ∈U

  > Deterministic system dynamics f : X  U → X   

∀ t∈ { 0,. .. , T −1 } , x t1=f  x t ,u t 

  > Deterministic reward function ρ : X  U → R 7

Formalization

2/3

  > Data: a non­noisy set of n tuples l

l

l

l

n

F={  x , u , r , y  }l=1 l

l

l

y =f  x ,u  l l l r =  x , u   

  > We assume having computed a deterministic  policy h h : [0,... T­1 ]  X → U 8

Formalization

3/3

Performance of h   > T­stage return of the policy h, starting from x0 T−1

   

J  x 0 =∑  x t , h t , x t  h T

t =0

with    ∀ t∈ { 0,. .. , T −1 } , x t1=f  x t ,u t 

    >  Knowledge of f and ρ is needed for exact  h computation J T  x 0  9

Approach

1/10

What is the point ?   > When  f and ρ are known: OK BUT   > Most of the time, no (exact) information about f  and ρ  

10

Approach

2/10

What is exact ?   > Tuples are assumed to be non­noisy (e.g., deterministic)   > They give exact values of  

­  system dynamics

 

­ rewards signals  

l

l

l

y =f  x ,u  l l l r =  x , u  11

Approach

3/10

General idea   >  Using a sequence of T tuples to compute an exact lower­bound on  T−1

J  x 0 =∑  x t , h t , x t  h T

t =0

using the exact rewards and the exact dynamics given  by the tuples   > Maximizing this lower­bound by chosing the best  sequence of tuples 12

Approach

4/10

Assumptions   > (Deterministic framework)   > Lipschitz continuity of f, ρ and h 

2

2

Lf , L , L h∈ℝ , ∀  x , x '∈ X , u , u '∈U , 0≤t≤T −1 ∥f  x , u−f  x ' , u '∥≤Lf ∥x−x '∥∥u−u'∥   ∣ x , u−  x ' ,u ' ∣≤L ∥x−x '∥∥u−u '∥ ∥ht , x −h t , x '∥≤ Lh∥x−x '∥ 13

Approach

5/10

Definition   >  State action value functions   h N

Q  x , u: X×U ℝ  

T−1

h N

Q  x , u= x ,u 

∑    x t , h t , x t 

T− N1

with xT­N+1 = f(x,u)   > We have h T

h T

J  x 0 =Q  x 0 , h0, x 0  14

Approach

6/10

Properties >  Recursion h N

Q  x , u= x ,u Q

h N−1

[f  x ,u , hT −N 1, f  x ,u ]

  > Good news: all these functions are Lipschitz  continuous  

N−1

LQ = L ∑ [ Lf 1Lh ]

t

N

t =0

15

Approach

7/10

Computing a lower­bound from a sequence of tuples   >  Sequence of tuples: 

lt

lt

lt

lt

T −1 t =0

[ x , u , r , y ]

  > Using recursion (t = T­N):

Q

h  T −t

 

lt

lt

lt

lt

 x , u =  x ,u Q

h T −t−1

  > Non noisy database : 

lt

lt

lt

lt

lt

lt

[ f  x , u  , ht1, f  x , u ] lt

f  x , u = y l l l   x , u =r t

t

t

16

Approach

8/10

  >  Thus

Q

h T −t

lt

lt

lt

 x , u =r Q

h T −t−1

lt

lt

[ y , h t1, y ]

  > Connexion between tuples using Lipschitz  continuity

Q with

h T −t−1

lt

lt

 y , ht1, y ≥Q

l t1

lt

h T −t−1

l t1

x

l t1

l t1

, u −LQ

T −t −1

t 1

lt

t 1=∥x − y ∥∥u −h t1, y ∥ 17

Approach

9/10

Link between two tuples lt

h N

lt

lt

Q  x , u ≥r Q

h N−1

x

l t1

l t1

, u −LQ

T −t −1

t 1

  with l t1

lt

l t1

lt

t 1=∥x − y ∥∥u −h t1, y ∥

18

Approach

10/10

An illustration  

x0

x 1=f  x 0 , h 0, x 0  x2

r 0 =  x 0 ,h 0, x 0 

0 l0

l0

l0

x T −2

1

l0

x ,u  l0

x T −1

l0

x , u , r , y 

T −1 l1

l1

l1

x

l1

x , u , r , y  x

l T −2

,u

l T −2

,r

l T −2

,y

xT

l T −1

l T −2

,u

l T −1

,r

l T −1

,y

l T −1



0 =∣ x l −x 0∣∣∣ ul −h0, x 0 ∣ 0

0

1=∣∣y l −x l ∣∣∣∣u lt −h  1, y l ∣∣ 0

1

1

0

19



Results 

1/5

Lower­bound associated with a sequence of tuples τ     >  A computable lower­bound on

h T

J  x0

T −1

B  , x 0 = ∑ [r −LQ ∥x − y ∥∥u −h t , y ∥] h

 

lt

t=0

lt

l t−1

lt

l t−1

T −t

with 

l −1

y =x 0

  20

Results 

2/5 h

Maximizing the lower­bound  B  , x 0    > Maximizing over the set of all possible sequences  of tuples FT ✶ T F

h

B  x 0 =max B  , x 0  and  

h T

∈ F

T

✶ T F

J  x 0 ≥B  x 0 

    > Viterbi­like algorithm

21

Results 

3/5

Algorithm  

22

Results

4/5

Tightness of the computed lower­bound B ✶F  x 0  T

?

    > Hypothesis on the density of the database (with X  and U bounded) 

∃∈ℝ :  

sup

l

l

{ min ∥x− x ∥∥u−u ∥}≤

 x ,u ∈ X ×U l∈{1,... , n}

" For each couple (x,u), the nearest tuple is not farther than α "

23

Results

5/5

Then



h T

✶ T F

∃C∈ℝ : J  x 0−B  x 0 ≤C 

24

Conclusion and Future Work

1/2

  > An approach for computing an exact lower­bound   > Simple algorithm   >  Linear relationship with the database density To improve     >  Strong correlation with the Lipschitz constants   >  The lower­bound can be very low (!) 25

Conclusion and Future Work

2/2

Using the same approach in a stochastic framework   > Work in progress Developing new algorithms based on this approach   > An evaluation tool 

Thank you! 26

Inferring bounds on the performance of a control policy ...

Dec 18, 2008 - and Reinforcement Learning (ADPRL 2009) conference proceedings published annually by IEEE ... How: many solutions. > Approximation ...

660KB Sizes 0 Downloads 204 Views

Recommend Documents

Inferring bounds on the performance of a control policy from a ... - ORBi
eralizations) of the following discrete-time optimal control problem arise quite frequently: a system, ... to high-enough cumulated rewards on the real system that is considered. In this paper, we thus focus on the evaluation of ... interactions with

Inferring bounds on the performance of a control policy from a ... - ORBi
The main philosophy behind the proof is the follow- ing. First, a sequence of .... Athena Scientific, Belmont, MA, 2nd edition, 2005. [2] D.P. Bertsekas and J.N. ...

Inferring bounds on the performance of a control policy ...
Mar 16, 2009 - ))}l=1 n x t+1. =f (x t. ,u t. ) r t. =ρ(x t. ,u t. ) J. T h. (x. 0. )=∑ t=0. T1 ρ(x t. ,h(t,x ... Programming, Reinforcement Learning ... But. > Limited amount ...

Inferring bounds on the performance of a control policy ...
[3] R. Sutton and A. Barto, Reinforcement Learning, an. Introduction. MIT Press, 1998. [4] M. Lagoudakis and R. Parr, “Least-squares policy it- eration,” Jounal of Machine Learning Research, vol. 4, pp. 1107–1149, 2003. [5] D. Ernst, P. Geurts,

Deterministic Performance Bounds on the Mean ...
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 0, NO. , 2012. 1. Deterministic Performance Bounds on the Mean Square. Error for Near Field Source Localization. Mohammed Nabil El Korso, Alexandre Renaux, Rémy Boyer, and. Sylvie Marcos. Abstract—This

Deterministic Performance Bounds on the Mean Square Error for Near ...
the most popular tool [11]. However ... Date of publication November 27, 2012; ... of this manuscript and approving it for publication was Dr. Benoit Champagne.

Deterministic Performance Bounds on the Mean Square Error for Near ...
mean square error applied to the passive near field source localization. More precisely, we focus on the ... Index Terms—Deterministic lower bounds, mean square error, near field source localization, performance analysis ..... contained in the samp

On the performance of randomized power control ...
Mar 11, 2009 - advantage of the capture effect, under which the packet with the strongest ... In wireless communications, random access mechanisms are.

Improved Competitive Performance Bounds for ... - Semantic Scholar
Email: [email protected]. 3 Communication Systems ... Email: [email protected]. Abstract. .... the packet to be sent on the output link. Since Internet traffic is ...

Upper Bounds on the Distribution of the Condition ...
be a numerical analysis procedure whose space of input data is the space of arbitrary square complex .... The distribution of condition numbers of rational data of.

Bounds on the Lifetime of Wireless Sensor Networks Employing ...
each sensor node can send its data to any one of these BSs (may be to the ... deployed as data sinks along the periphery of the observation region R.

Bounds on the Lifetime of Wireless Sensor Networks Employing ...
Indian Institute of Science. Bangalore – 560012. INDIA .... deployed as data sinks along the periphery of the observation region R. – obtaining optimal locations ...

Lower Bounds on the Minimum Pseudo-Weight of ...
Nov 30, 2003 - indices are in Vr. We call C a (j, k)-regular code if the uniform column weight ..... Proof: In App. E of [14] the above lower bound for the minimum ...

Bounds on the Lifetime of Wireless Sensor Networks Employing ...
Wireless Research Lab: http://wrl.ece.iisc.ernet.in ... Key issues in wireless sensor networks ... NW lifetime can be enhanced by the use of multiple BSs. – deploy ...

On some upper bounds on the fractional chromatic ...
This work was carried out while the author was at the University of Wisconsin at. Madison, USA. Thanks are due to professor Parmesh Ramanathan for suggesting this direction. References. [1] B. Bollobás. Modern Graph Theory. Springer, Graduate Texts i

Impact of Power Control on the Performance of Ad Hoc ...
control (MAC) protocol such ,as time division multiple access. (TDMA), and a ..... receiver of more than one transmission at any time slot, a d ii) a node is not ...

Uniform bounds on the number of rational points of a ...
−log | |p, where either p = ∞ and |F|p := edeg(F), or p runs over the set of. 6 ..... Now we are going to express these estimates in terms of the height of N/D. Let g be the gcd ...... monodromy, volume 40 of AMS Colloquium Publications, American

On the calculation of the bounds of probability of events ...
Apr 26, 2006 - specialist may also be able to extract PDFs, though experts will ..... A(x), for x ∈ X, represents the degree to which x is compatible with the.

New bounds on the rate-distortion function of a binary ...
and Hb(x) = −x log 2−(1−x log(1−x)). Beyond Dc, even for this simple case, currently only lower and upper bounds bounds on R(D) are known. In 1977, Berger found explicit lower and upper bounds on. R(D), Rl(D) and Ru(D) respectively, which do

Bounds on the domination number of a digraph and its ...
Let δ(G) denote the minimum degree of G. For n ≥ 3, let Pn and. Cn denote the ... Then Dn is connected and γ(D) = |V (D)| − 1, and hence Proposition 1.1 is best.