Raphael Fonteneaua Susan A. Murphyb Louis ...

Viewer
Transcript

Modelfree Monte Carlolike Policy Evaluation Raphael Fonteneaua Susan A. Murphyb a University of Liège, Belgium

Abstract

The Monte Carlo estimator

We propose an algorithm for estimating the finitehorizon expected return  of a  closed  loop control policy from an a priori given (off policy) sample of onestep transitions. It averages cumulated rewards along a set of broken trajectories  made  of onestep transitions selected from the sample on the basis of the control policy.  Under  some Lipschitz continuity assumptions on the system dynamics, reward function and control policy, we provide bounds on the  bias and  variance of  the estimator  that depend  only  on the Lipschitz constants,  on the number of broken trajectories used in the estimator,  and on  the  sparsity  of  the sample   of  onestep transitions.

w

1 0

w

r

w

x0

r

x

1 0

When the system is accessible to experimentation, such an oracle can be based on a Monte Carlo (MC) approach

r

r 1T −2

x 1T −2

2 0

x

r

2 1

2 1

x

2 2

x

2 T −2

r

∑ r 1t

●

Distance metric ∆

x 1T

1 T −1

●

ksparsity

t =0

x 1T −1 w 2T−1

w 2T−2 2 T −2

T −1

r 2T −1 x 2 T

x 2T −1

∑ r 2t t =0

         MC Estimator

p T −1

w 1p r

p 1

1 r it ∑ ∑ p i=1 t=0

p w T−2

p 1

x

p 2

x

x it 1 =f  x it , h t , x it  , wit 

p T −2

r

●

p T −2

x

p T −1

p w T−1

r

∑ r tp

p T

t =0

r it = x it , h t , x it  , wit 

w it ~ pW .

(x',u')

X

Pn k

  x , u

Here, the MC approach is not feasible, since the system is unknown

●

An instantaneous reward rt = ρ (xt , ut , wt) is associated with the action ut   while being in state xt

Bias of the MFMC estimator

●

Theorem

●

Variance of the MFMC estimator

●

Theorem

The only information available on the system is gathered in a sample of n onestep transitions

●

A policy h: {0,...,T1} × X  U is given, and we want to evaluate its performance. The expected return of the policy h when starting from an initial state x0 = x is given by

●

The Modelfree Monte Carlo estimator

space U, wt are i.i.d. according to a probability distribution pW(.)

●

  x , u

The ksparsity can be seen as the smallest radius such that all ∆ balls in X×U contain at least k elements from

All xt   lie in a normed state space X, all ut lie in a normed action

●

 x , u

U

We consider a discretetime system whose dynamics over T stages is given by xt+1 = f (xt , ut , wt)

●

Pn 1

Pn k−1

The bias and variance of the Monte Carlo estimator are

Problem statement

●



(x,u) ●

●

                      denotes the distance of (x,u) to its kth nearest neighbor (using the distance ∆) in the sample

T −1

p T −1

We assume that the random variable Rh(x0) admits a finite variance

●

In this context, we propose a ``ModelFree Monte Carlo (MFMC) estimator'' of the performance of a given policy that mimics in some way the Monte Carlo estimator.

●

We assume that the functions f, ρ and h are Lipschitz continuous

T −1

1 T−1

x

In this paper, the only information is contained in a sample of one step transitions of the system

●

w

x 12

r

1 T−2

w

1 1 1 1

w 21

Discretetime stochastic optimal control problems arise in many fields (finance, medecine, engineering,...)

●

1 1

2 0

w 0p

r 0p

Many techniques for solving such problems use an oracle that evaluates the performance of any given policy in order to determine a (near)optimal control policy

●

with

x

●

Analysis of the the MFMC estimator

We define the Monte Carlo estimator of the expected return of h when starting from the initial state x0:

●

Introduction ●

Louis Wehenkela Damien Ernsta b University of Michigan, USA

We define the random variable         as follows:

●

a

The set of pairs                                           is arbitrary chosen, a

whereas the pairs (rl , yl) are determined by ( ρ (xl, ul , .) , f (xl , ul , .)) drawn according to pW(.)

where ●

         is a realization of the random set         .

●

We introduce the ModelFree Monte Carlo estimator From the sample of transitions, we build p sequences of different transitions of length T called ``broken trajectories''

●

x0

w0

w1

r0

r1

x1

w T −2

x2

x T−2 r T −2

T−1

w T −1

x T−1

r T −1

R  x 0=∑ r t h

xT

●

Problem: the functions f, ρ and pW(.) are unknown

●

They are replaced by a sample of n system transitions

t =0

These broken trajectories are built so as to minimize the discrepancy (using a distance metric ∆) with a classical MC sample that could be obtained by simulating the system with the policy h

●

We average the cumulated returns over the p broken trajectories to compute an estimate of the expected return of h

●

The algorithm has complexity O(npT) .

●

1

1

l1

1 0

1 0

1 0

1 1



w w

h

l

l

x

1 1

i 0

wl wl

i 1

l

w

x

wl w

p 0

1 2

1 T −2

w w

l

1

l T −2

w

x

1 T −1

wl

i

lt

i T −1

x 1T

i

2 T

w ,... , w

p

p T −1

 p

1p x 1

x

T −1

∑r t =0

●

MFMC Estimator

i

l T −1

x

p T −2

x

p T −1

x

2

lt

p T −1

1 rl ∑ ∑ p i=1 t=0

Transition generated i lt under disturbance w

l0

p 2

t =0

●

x

l0

∑ rl

1 t

1 T −1

1 T −2

Real trajectory under disturbances

●

T −1

1

l T −1

x 11=f  x , h0, x , w l 



w

w

1T −1

1

1 0

p 0

Acknowledgements. Raphael Fonteneau acknowledges the financial support of the  FRIA. Damien Ernst is a research associate of the  FRSFNRS. This  paper presents  research results of  the Belgian  Network BIOMAGNET  and the PASCAL2 European Network of Excellence. We also acknowledge financial support from NIH grants P50  DA10075 and R01 MH080015.  The scientific responsibility rests with its authors.

l1

1

l0

10

x0

1

l1

l1

w

1 0

1

l1

x , u , r , y 

1 0

 xl , ul , r l , yl 

→ How to evaluate J (x0) in this context ?

Conclusions and Future work

p−1 T

T −1

x

p T

∑r ∑r

lt

t =0

●

● p−1 t

l

t =0 T −1

i t

p

We have proposed in this paper an estimator of the expected return of a policy in a modelfree setting, the MFMC estimator We have provided bounds on the bias and variance of the MFMC estimator The bias and variance of the MFMC estimator converge to the bias and variance of the MC estimator The MFMC estimator could be used in a direct policy search framework Possible extensions (conditional probability distributions, parameter estimation, etc) .