Inferring bounds on the performance of a  control policy from a sample of trajectories 28th Benelux Meeting – Spa, March 16th, 2009

Raphaël Fonteneau, Susan Murphy, Louis Wehenkel, Damien Ernst

Université de Liège – University of Michigan, Ann Arbor

1

Outline   > Introduction   > Approach   >  Results   > Conclusion and Future Work 2009  IEEE  International  Symposium  on  Adaptive  Dynamic  Programming  and Reinforcement Learning (IEEE ADPRL 2009) ­  conference proceedings  published annually by IEEE

2

Introduction  General problem   > Discrete­time dynamics system x t1=f  xt ,ut    >  Reward function  rt = xt ,u t    > Deterministic system dynamics f and reward                function ρ unknown, replaced by ...   >  ... Collected data: (parts of) trajectories n

F= { x , u , r = x ,u , y =f x ,u }l= 1 l

l

l

l

l

l

l

l

  > Goal: finding a (near­optimal) control policy h  T −1 that maximises

J  x0 =∑  xt ,h t , xt  h T

t =0

  3

Introduction Existing solutions   > Model Predictive Control, Dynamic   Programming,  Reinforcement Learning ...  But l

l

l

l

n

  >  Limited amount of data in F= { x , u , r , y  }l=1   >  High dimensional / continuous spaces   > Need for approximation structures    > Guarantees on the performance of a computed   T−1 policy h ? h

J T  x0 =∑  x t , ht , x t  t =0

  4

Introduction What we propose     >  A tool for evaluating the quality of a policy   > Computation a lower­bound on the return of a    given policy h, starting from a given initial state   >  Use of the (assumed non­noisy) database          without approximation

 

5

Approach Assumptions   > Deterministic framework, continuous spaces and   functions, non­noisy database   > Lipschitz continuity of f, ρ and h 

2

2

Lf , L , Lh ∈ℝ , ∀  x , x '∈ X ,u , u ' ∈U ,  

∥f  x , u−f  x ' ,u ' ∥≤L f ∥x−x '∥∥u−u '∥ ∣ x ,u− x ' , u ' ∣≤L ∥x −x '∥∥u−u '∥ ∥h t , x −ht , x '∥≤Lh∥x −x '∥, 0≤t≤T −1 6

Approach General idea   >  Using a sequence of T tuples to compute an exact      lower­bound on  T−1

J  x0 =∑  x t , ht , x t  h T

 

t =0

using the exact rewards and the exact dynamics    given by the tuples

  > Maximizing this lower­bound by chosing the best     sequence of tuples 7

Approach An illustration  

x0

x 1=f  x 0 ,h 0, x0  x2

r 0 =  x0 , h 0, x 0 

0 l0

  x

, u

l0

x T−2

1



l0

x T−1

l0

l0

l0

T −1 l1

l1

l1

l T −2

x l0

l T −1

x

l1

x , u , r , y  x , u , r , y 

xT

l T −2

,u

,r

l T −2

,y

l T −2

,u

l T −1

,r

l T −1

,y

l T −1



l0

0 =∣∣x −x 0∣∣∣∣u −h 0, x 0 ∣ 1=∣∣ y l −x l ∣∣∣ ult −h 1, yl ∣∣ 0

1

1

0

8



Results  Theorem : Lower­bound associated with a sequence of  tuples τ   >  A computable lower­bound on

h T

J  x0 

T −1

B  , x 0 = ∑ [r −LQ ∥x − y ∥∥u −ht , y ∥] h

 

lt

l t−1

lt

l t −1

T −t

t =0

with   

lt

l −1

y = x0 N−1

LQ =L ∑ [ Lf 1L h]

t

N

t =0

9

Results  h

Maximizing the lower­bound  B  , x 0    > Maximizing over the set of all possible sequences     of tuples FT ✶ T F

h

B  x 0 =max B  , x 0     

with h T

∈ F

T

✶ T F

J  x0 ≥B  x 0 

  > Exhaustive search becomes quickly prohibitive    >  Viterbi­like algorithm ­ O( T*n2) 10

Results Theorem: Tightness of the computed lower­bound B ✶F  x 0  T

    >  Hypothesis on the density of the database (with X    and U bounded) 

∃∈ℝ :

sup

l

l

{ min ∥x−x ∥∥u−u ∥ }≤

 x, u ∈ X ×U l ∈{1,. .. ,n }

" For each couple (x,u), the nearest tuple is not farther than α " 

h T

✶ T F

∃C∈ℝ : J  x 0 −B  x 0 ≤C  11

Results Toy example: 1­Dimensional linear system   > from 100 to 40 000 tuples in F  

12

Conclusion and Future Work   > An approach for computing an exact lower­bound   > Simple algorithm   >  Linear relationship with the database density To improve   >  Strong correlation with the Lipschitz constants   >  The lower­bound can be very low (!)   >  Extension to a stochastic framework   >  Developing new algorithms 13

Thank you!

14

Inferring bounds on the performance of a control policy ...

Mar 16, 2009 - ))}l=1 n x t+1. =f (x t. ,u t. ) r t. =ρ(x t. ,u t. ) J. T h. (x. 0. )=∑ t=0. T1 ρ(x t. ,h(t,x ... Programming, Reinforcement Learning ... But. > Limited amount ...

524KB Sizes 0 Downloads 206 Views

Recommend Documents

Inferring bounds on the performance of a control policy from a ... - ORBi
eralizations) of the following discrete-time optimal control problem arise quite frequently: a system, ... to high-enough cumulated rewards on the real system that is considered. In this paper, we thus focus on the evaluation of ... interactions with

Inferring bounds on the performance of a control policy from a ... - ORBi
The main philosophy behind the proof is the follow- ing. First, a sequence of .... Athena Scientific, Belmont, MA, 2nd edition, 2005. [2] D.P. Bertsekas and J.N. ...

Inferring bounds on the performance of a control policy ...
Dec 18, 2008 - and Reinforcement Learning (ADPRL 2009) conference proceedings published annually by IEEE ... How: many solutions. > Approximation ...

Inferring bounds on the performance of a control policy ...
[3] R. Sutton and A. Barto, Reinforcement Learning, an. Introduction. MIT Press, 1998. [4] M. Lagoudakis and R. Parr, “Least-squares policy it- eration,” Jounal of Machine Learning Research, vol. 4, pp. 1107–1149, 2003. [5] D. Ernst, P. Geurts,

Deterministic Performance Bounds on the Mean ...
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 0, NO. , 2012. 1. Deterministic Performance Bounds on the Mean Square. Error for Near Field Source Localization. Mohammed Nabil El Korso, Alexandre Renaux, Rémy Boyer, and. Sylvie Marcos. Abstract—This

Deterministic Performance Bounds on the Mean Square Error for Near ...
the most popular tool [11]. However ... Date of publication November 27, 2012; ... of this manuscript and approving it for publication was Dr. Benoit Champagne.

Deterministic Performance Bounds on the Mean Square Error for Near ...
mean square error applied to the passive near field source localization. More precisely, we focus on the ... Index Terms—Deterministic lower bounds, mean square error, near field source localization, performance analysis ..... contained in the samp

On the performance of randomized power control ...
Mar 11, 2009 - advantage of the capture effect, under which the packet with the strongest ... In wireless communications, random access mechanisms are.

Improved Competitive Performance Bounds for ... - Semantic Scholar
Email: [email protected]. 3 Communication Systems ... Email: [email protected]. Abstract. .... the packet to be sent on the output link. Since Internet traffic is ...

Upper Bounds on the Distribution of the Condition ...
be a numerical analysis procedure whose space of input data is the space of arbitrary square complex .... The distribution of condition numbers of rational data of.

Bounds on the Lifetime of Wireless Sensor Networks Employing ...
each sensor node can send its data to any one of these BSs (may be to the ... deployed as data sinks along the periphery of the observation region R.

Bounds on the Lifetime of Wireless Sensor Networks Employing ...
Indian Institute of Science. Bangalore – 560012. INDIA .... deployed as data sinks along the periphery of the observation region R. – obtaining optimal locations ...

Lower Bounds on the Minimum Pseudo-Weight of ...
Nov 30, 2003 - indices are in Vr. We call C a (j, k)-regular code if the uniform column weight ..... Proof: In App. E of [14] the above lower bound for the minimum ...

Bounds on the Lifetime of Wireless Sensor Networks Employing ...
Wireless Research Lab: http://wrl.ece.iisc.ernet.in ... Key issues in wireless sensor networks ... NW lifetime can be enhanced by the use of multiple BSs. – deploy ...

On some upper bounds on the fractional chromatic ...
This work was carried out while the author was at the University of Wisconsin at. Madison, USA. Thanks are due to professor Parmesh Ramanathan for suggesting this direction. References. [1] B. Bollobás. Modern Graph Theory. Springer, Graduate Texts i

Impact of Power Control on the Performance of Ad Hoc ...
control (MAC) protocol such ,as time division multiple access. (TDMA), and a ..... receiver of more than one transmission at any time slot, a d ii) a node is not ...

Uniform bounds on the number of rational points of a ...
−log | |p, where either p = ∞ and |F|p := edeg(F), or p runs over the set of. 6 ..... Now we are going to express these estimates in terms of the height of N/D. Let g be the gcd ...... monodromy, volume 40 of AMS Colloquium Publications, American

On the calculation of the bounds of probability of events ...
Apr 26, 2006 - specialist may also be able to extract PDFs, though experts will ..... A(x), for x ∈ X, represents the degree to which x is compatible with the.

New bounds on the rate-distortion function of a binary ...
and Hb(x) = −x log 2−(1−x log(1−x)). Beyond Dc, even for this simple case, currently only lower and upper bounds bounds on R(D) are known. In 1977, Berger found explicit lower and upper bounds on. R(D), Rl(D) and Ru(D) respectively, which do

Bounds on the domination number of a digraph and its ...
Let δ(G) denote the minimum degree of G. For n ≥ 3, let Pn and. Cn denote the ... Then Dn is connected and γ(D) = |V (D)| − 1, and hence Proposition 1.1 is best.