Model­free Monte Carlo­like Policy Evaluation Raphael Fonteneaua Susan A. Murphyb a University of Liège, Belgium

Abstract

The Monte Carlo estimator

  We  propose    an  algorithm  for  estimating    the  finite­horizon expected return  of a  closed  loop  control    policy    from  an    a  priori    given  (off­ policy)  sample  of    one­step  transitions.    It  averages  cumulated  rewards    along  a    set  of    broken trajectories  made  of one­step transitions  selected    from  the    sample  on  the    basis  of    the  control policy.  Under  some Lipschitz continuity  assumptions    on  the  system  dynamics,  reward    function  and  control  policy,  we    provide  bounds  on the  bias and  variance of  the estimator  that  depend  only  on the Lipschitz constants,  on the  number    of  broken  trajectories    used  in  the    estimator,  and on  the  sparsity  of  the sample   of  one­step transitions.

 

w

1 0

w

r

w

x0

r

x

1 0

When the system is accessible to experimentation, such an oracle  can be based on a Monte Carlo (MC) approach

r

r 1T −2

x 1T −2

2 0

x

r

2 1

2 1

x

2 2

x

2 T −2

r

∑ r 1t



Distance metric ∆

x 1T

1 T −1



k­sparsity

t =0

x 1T −1 w 2T−1

w 2T−2 2 T −2

T −1

r 2T −1 x 2 T

x 2T −1

∑ r 2t t =0

         MC Estimator

p T −1

w 1p r

p 1

1 r it ∑ ∑ p i=1 t=0

p w T−2

p 1

x

p 2

x

x it 1 =f  x it , h t , x it  , wit 

p T −2

r



p T −2

x

p T −1

p w T−1

r

∑ r tp

p T

t =0

r it = x it , h t , x it  , wit 

w it ~ pW .

(x',u')

X

Pn k

  x , u

Here, the MC approach is not feasible, since the system is  unknown



An instantaneous reward rt = ρ (xt  , ut  , wt) is associated with the  action ut   while being in state xt 

Bias of the MFMC estimator



Theorem



Variance of the MFMC estimator



Theorem

The only information available on the system is gathered in a  sample of n one­step transitions



A policy h: {0,...,T­1} × X  U is given, and we want to evaluate its  performance. The expected return of the policy h when starting from an initial  state x0 = x is given by



The Model­free Monte Carlo estimator

space U, wt are i.i.d. according to a probability distribution pW(.)



  x , u

The k­sparsity can be seen as the smallest radius such that all ∆­ balls in X×U contain at least k elements from

All xt   lie in a normed state space X, all ut lie in a normed action 



 x , u

U

We consider a discrete­time system whose dynamics over T stages  is given by xt+1  = f (xt  , ut  , wt)



Pn 1

Pn k−1

The bias and variance of the Monte Carlo estimator are

Problem statement





(x,u) ●



                      denotes the distance of (x,u) to its k­th nearest  neighbor (using the distance ∆) in the sample

T −1

p T −1

We assume that the random variable Rh(x0) admits a finite  variance



In this context, we propose a ``Model­Free Monte Carlo (MFMC)  estimator'' of the performance of a given policy that mimics in  some way the Monte Carlo estimator.



We assume that the functions f, ρ and h are Lipschitz continuous

T −1

1 T−1

x

In this paper, the only information is contained in a sample of one­ step transitions of the system



w

x 12

r

1 T−2

w

1 1 1 1

w 21

Discrete­time stochastic optimal control problems arise in many  fields (finance, medecine, engineering,...)



1 1

2 0

w 0p

r 0p

Many techniques for solving such problems use an oracle that  evaluates the performance of any given policy in order to  determine a (near­)optimal control policy



with

x



Analysis of the the MFMC estimator

We define the Monte Carlo estimator of the expected return of h  when starting from the initial state x0:



Introduction ●

Louis Wehenkela Damien Ernsta b University of Michigan, USA

We define the random variable         as follows:



a

The set of pairs                                           is arbitrary chosen, a

whereas the pairs (rl  , yl) are determined by  ( ρ (xl, ul , .) , f (xl , ul , .)) drawn according to pW(.) 

where ●

         is a realization of the random set         .          



We introduce the Model­Free Monte Carlo estimator From the sample of transitions, we build p sequences of different  transitions of length T called ``broken trajectories''



x0

w0

w1

r0

r1

x1

w T −2

x2

x T−2 r T −2

T−1

w T −1

x T−1

r T −1

R  x 0=∑ r t h

xT



Problem: the functions f, ρ and pW(.) are unknown



They are replaced by a sample of n system transitions

t =0

These broken trajectories are built so as to minimize the  discrepancy (using a distance metric ∆) with a classical MC sample  that could be obtained by simulating the system with the policy h



We average the cumulated returns over the p broken trajectories  to compute an estimate of the expected return of h



The algorithm has complexity O(npT) .



1

1

l1

1 0

1 0

1 0

1 1



w w

h

l

l

x

1 1

i 0

wl wl

i 1

l

w

x

wl w

p 0

1 2

1 T −2

w w

l

1

l T −2

w

x

1 T −1

wl

i

lt

i T −1

x 1T

i

2 T

w ,... , w

p

p T −1

 p

1p x 1

x

T −1

∑r t =0



MFMC Estimator

i

l T −1

x

p T −2

x

p T −1

x

2

lt

p T −1

1 rl ∑ ∑ p i=1 t=0

Transition generated i lt under disturbance  w

l0

p 2

t =0



x

l0

∑ rl

1 t

1 T −1

1 T −2

Real trajectory under disturbances



T −1

1

l T −1

x 11=f  x , h0, x , w l 



w

w

1T −1

1

1 0

p 0

Acknowledgements.  Raphael    Fonteneau  acknowledges    the    financial  support    of  the  FRIA.  Damien   Ernst is    a  research   associate   of the  FRS­FNRS. This  paper  presents  research results of  the Belgian  Network BIOMAGNET  and the PASCAL2  European  Network  of    Excellence.  We  also  acknowledge  financial  support  from  NIH  grants P50  DA10075 and R01 MH080015.  The scientific responsibility rests with its  authors.

l1

1

l0

10

x0

1

l1

l1

w

1 0

1

l1

x , u , r , y 

1 0

 xl , ul , r l , yl 

→ How to evaluate J (x0) in this context ?

Conclusions and Future work

p−1 T

T −1

x

p T

∑r ∑r

lt

t =0



● p−1 t

l

t =0 T −1

i t

p

We have proposed in this paper an estimator of the expected  return of a policy in a model­free setting, the MFMC estimator We have provided bounds on the bias and variance of the MFMC  estimator The bias and variance of the MFMC estimator converge to the bias  and variance of the MC estimator The MFMC estimator could be used in a direct policy search  framework Possible extensions (conditional probability distributions,  parameter estimation, etc) .

Raphael Fonteneaua Susan A. Murphyb Louis ...

We define the Monte Carlo estimator of the expected return of h when starting from the ..... Raphael Fonteneau acknowledges the financial support of the FRIA.

1023KB Sizes 0 Downloads 124 Views

Recommend Documents

Raphael Fonteneaua Susan A. Murphyb Louis ...
Algorithms and simulation results are provided for a deterministic system with discrete action space. They show that the proposed approach is ... We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algori

Raphael Fonteneau Susan Murphy Louis Wehenkel ...
Dept. of Electrical Engineering and Computer Science, University of Liège, Belgium. †. Dept. of Statistics, University of Michigan, USA. ABSTRACT. The treatment of chronic-like illnesses such has HIV infection, cancer or chronic depression implies

Raphael Fonteneau Louis Wehenkel Damien Ernst
•For treating such diseases, physicians often adopt explicit, operationalized series of decision rules specifying how drug types and quantities should vary over time: these are named. Dynamic Treatment Regimes (DTRs). •While typically DTRs are ba

6- Salvando a Raphael Santiago.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. 6- Salvando a ...

Raphael Rossi.pdf
Loading… Whoops! There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Raphael Rossi.pdf. Raphael Ro

(Fábio Marvulle Bueno e Raphael Seabra) subimperialismo ...
(Fábio Marvulle Bueno e Raphael Seabra) subimperialismo brasileiro.pdf. (Fábio Marvulle Bueno e Raphael Seabra) subimperialismo brasileiro.pdf. Open.

news.stlpublicradio.org-A moment in history A St Louis Bosnian ...
news.stlpublicradio.org-A moment in history A St Louis Bosnian reflects on the Syrian refugee crisis.pdf. news.stlpublicradio.org-A moment in history A St Louis ...

LOUIS BRAILLE.pdf
Download. Connect more apps... Try one of the apps below to open or edit this item. LOUIS BRAILLE.pdf. LOUIS BRAILLE.pdf. Open. Extract. Open with. Sign In.

The Louis
A Classic Louis XV style mantel with generous curves. The paneled legs with acanthus leaves on the bases rise up to end on consoles decorated with scroll and ...

6.Las Crónicas de Bane 6 - Salvando a Raphael Santiago.pdf ...
Whoops! There was a problem loading more pages. Retrying... 6.Las Crónicas de Bane 6 - Salvando a Raphael Santiago.pdf. 6.Las Crónicas de Bane 6 ...

Susan Rindt, PsyD - GitHub
Markdown -> PDF, HTML, and more .... service members, pre and post treatment and 6 month, 1 year, 2 year and 5 year post treatment follow up. Sudden Sibling ...

louis ck dvdrip.pdf
There was a problem loading more pages. louis ck dvdrip.pdf. louis ck dvdrip.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying louis ck dvdrip.pdf.

Targetz Catalog Shapes.cdr - Louis Candell
Download this and hundreds of other FREE targets at Targetz.com. ©2002 DLP, Inc. - Please feel free to share copies - No unauthorized Modifications Please.

Targetz Catalog Shapes.cdr - Louis Candell
Shooter. Date. Date. Get Targets for FREE! Distance. Caliber at. Powder Load. Bullet Gr. Targetz.com. Notes. Get more FREE Targets at Targetz.com.

2018 - 2019 School Calendar for Saint Raphael Catholic School
Date. Event. Time. Location. August 8, 2018. Back-to-school Fair. 10:30 - 12:00 IH and RH. August 9, 2018. Teacher Work Day. August 9, 2018. SET Training.

Karen Handel - Susan G. Komen
leading role in efforts to preserve access to vital breast health programs ... for long-time Komen partner Hallmark Cards, she helped to coordinate the company's ... aggressive economic development program that helped create tens of ...

Louis D. Reynolds
120 Baker Avenue, Berkeley Heights, NJ, 07922. Cell: (908)723-1629. Email: [email protected]. February 6, 2015. Eric Bakker. President. Computer Design & Integration. 500 Fifth Avenue, Suite 1010. New York, NY 10110. Dear Mr. Bakker: I was pleased to spe

A Plegable Susan Haack Ciclo UNIVALLE.pdf
A Plegable Susan Haack Ciclo UNIVALLE.pdf. A Plegable Susan Haack Ciclo UNIVALLE.pdf. Open. Extract. Open with. Sign In. Main menu.