Raphael Fonteneaua Susan A. Murphyb Louis ...

Viewer
Transcript

Active Exploration by Searching for Experiments that Falsify the Computed Control Policy Raphael Fonteneau Susan A. Murphy a University of Liège, Belgium a

Abstract We propose a strategy for experiment selection - in the context of reinforcement learning - based on the idea that the most interesting experiments to carry out at some stage are those that are the most liable to falsify the current hypothesis about the optimal control policy. We cast this idea in a context where a policy learning algorithm and a model identification method are given a priori. Experiments are selected if, using the learned environment model, they are predicted to yield a revision of the learned control policy. Algorithms and simulation results are provided for a deterministic system with discrete action space. They show that the proposed approach is promising.

●

●

●

The performance of their solutions are related to the amount of information available on the system dynamics and reward function of the optimal control problem

●

In this work, we assume that information on the system must be inferred from trajectories of the system, and, due to time and cost issues, only a limited number of trajectories can be generated

●

Distribution of the returns of all control policies ●

Uniform sampling strategy

●

Falsifaction-based sampling strategy

●

We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL Using the sample of already collected transitions, we first compute a control policy:

We uniformly draw a state-action point (x,u), and we compute a predicted transition:

We add the predicted transition to the current sample, a we compute a predicted control policy If the predicted control policy falsifies the current control policy, then we sample a new transition, else we iterate with a new state-action point (x',u')

Experimental results Problem statement

a

Falsification-based sampling strategy

Introduction Discrete-time optimal control problems arise in many fields (engineering, finance, medicine, artificial intelligence, etc)

Louis Wehenkel Damien Ernst b University of Michigan, USA a

Sampling strategy

●

●

b

Graphical representation of typical runs ●

Uniform sampling strategy

●

Falsifaction-based sampling strategy

Benchmark ●

Problem

The car-on-the-hill benchmark

How to generate an informative batch collection of data so that high-performance control policies can be inferred from this collection ?

●

We propose a sequential strategy for choosing, given a batch collection of already sampled transition, where to sample additional data

Formalization ●

We consider a deterministic discrete-time system whose dynamics over T stages is given by the time-invariant equation:

●

●

●

where all xt lie in a normed state space X , and ut in a finite action space U . ●

●

PM: nearest neighbor algorithm BMRL: nearest neighbor model learning RL algorithm We generate 50 databases of 1000 system transitions We evaluate the performances of the inferred control policies on the real system

The transition from time t to t+1 is associated with an instantaneous reward Performance analysis: 50 runs of our strategy (blue) are compared with 50 uniform runs (red)

●

●

●

●

●

The return over T stages of a sequence of actions u when starting from an initial state x0 is given by

Maximal return:

Conclusions and future work

The goal is to find a sequence of actions whose return is as close as possible to the maximal return

Summary ●

The system dynamics and the reward function are unknown ●

They are replaced by a sample of n system transitions

We have proposed a strategy for generating informative batch collections of data This approach has been empirically validated

Future works Distribution of the returns of control policies at the end of the sampling process where

Problem ●

Given a sample of system transitions

●

Extending the approach to more general frameworks

●

Investigating theoretical properties

Acknowledgements Raphael Fonteneau acknowledges the financial support of the FRIA. Damien Ernst is a research associate of the FRS-FNRS. This paper presents research results of the Belgian Networks BIOMAGNET and DYSCO and the PASCAL2 European Network of Excellence. We also acknowledge financial support from NIH grants P50 DA10075 and R01 MH080015. The scientific responsibility rests with its authors.

Reference How one could determine where to sample additional transitions ?

R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Active exploration by searching for experiments that falsify the computed control policy. IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 2011), Paris, France, April 11-15, 2011, 8 pages.

Raphael Fonteneaua Susan A. Murphyb Louis ...

Raphael Fonteneau Susan Murphy Louis Wehenkel ...

Raphael Fonteneau Louis Wehenkel Damien Ernst

6- Salvando a Raphael Santiago.pdf

Raphael Rossi.pdf

(Fábio Marvulle Bueno e Raphael Seabra) subimperialismo ...

news.stlpublicradio.org-A moment in history A St Louis Bosnian ...

LOUIS BRAILLE.pdf

The Louis

6.Las CrÃ³nicas de Bane 6 - Salvando a Raphael Santiago.pdf ...

Susan Rindt, PsyD - GitHub

louis ck dvdrip.pdf

Targetz Catalog Shapes.cdr - Louis Candell

2018 - 2019 School Calendar for Saint Raphael Catholic School

Karen Handel - Susan G. Komen

Louis D. Reynolds

A Plegable Susan Haack Ciclo UNIVALLE.pdf