Active Exploration by Searching for Experiments that Falsify the Computed Control Policy Raphael Fonteneau Susan A. Murphy a University of Liège, Belgium a

Abstract We propose a strategy for experiment selection - in the context of reinforcement learning - based on the idea that the most interesting experiments to carry out at some stage are those that are the most liable to falsify the current hypothesis about the optimal control policy. We cast this idea in a context where a policy learning algorithm and a model identification method are given a priori. Experiments are selected if, using the learned environment model, they are predicted to yield a revision of the learned control policy. Algorithms and simulation results are provided for a deterministic system with discrete action space. They show that the proposed approach is promising.







The performance of their solutions are related to the amount of information available on the system dynamics and reward function of the optimal control problem



In this work, we assume that information on the system must be inferred from trajectories of the system, and, due to time and cost issues, only a limited number of trajectories can be generated



Distribution of the returns of all control policies ●

Uniform sampling strategy



Falsifaction-based sampling strategy



We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL Using the sample of already collected transitions, we first compute a control policy:

We uniformly draw a state-action point (x,u), and we compute a predicted transition:

We add the predicted transition to the current sample, a we compute a predicted control policy If the predicted control policy falsifies the current control policy, then we sample a new transition, else we iterate with a new state-action point (x',u')

Experimental results Problem statement

a

Falsification-based sampling strategy

Introduction Discrete-time optimal control problems arise in many fields (engineering, finance, medicine, artificial intelligence, etc)

Louis Wehenkel Damien Ernst b University of Michigan, USA a

Sampling strategy





b

Graphical representation of typical runs ●

Uniform sampling strategy



Falsifaction-based sampling strategy

Benchmark ●

Problem

The car-on-the-hill benchmark

How to generate an informative batch collection of data so that high-performance control policies can be inferred from this collection ?



We propose a sequential strategy for choosing, given a batch collection of already sampled transition, where to sample additional data

Formalization ●

We consider a deterministic discrete-time system whose dynamics over T stages is given by the time-invariant equation:







where all xt lie in a normed state space X , and ut in a finite action space U . ●



PM: nearest neighbor algorithm BMRL: nearest neighbor model learning RL algorithm We generate 50 databases of 1000 system transitions We evaluate the performances of the inferred control policies on the real system

The transition from time t to t+1 is associated with an instantaneous reward Performance analysis: 50 runs of our strategy (blue) are compared with 50 uniform runs (red)











The return over T stages of a sequence of actions u when starting from an initial state x0 is given by

Maximal return:

Conclusions and future work

The goal is to find a sequence of actions whose return is as close as possible to the maximal return

Summary ●

The system dynamics and the reward function are unknown ●

They are replaced by a sample of n system transitions

We have proposed a strategy for generating informative batch collections of data This approach has been empirically validated

Future works Distribution of the returns of control policies at the end of the sampling process where

Problem ●

Given a sample of system transitions



Extending the approach to more general frameworks



Investigating theoretical properties

Acknowledgements Raphael Fonteneau acknowledges the financial support of the FRIA. Damien Ernst is a research associate of the FRS-FNRS. This paper presents research results of the Belgian Networks BIOMAGNET and DYSCO and the PASCAL2 European Network of Excellence. We also acknowledge financial support from NIH grants P50 DA10075 and R01 MH080015. The scientific responsibility rests with its authors.

Reference How one could determine where to sample additional transitions ?

R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Active exploration by searching for experiments that falsify the computed control policy. IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 2011), Paris, France, April 11-15, 2011, 8 pages.

Raphael Fonteneaua Susan A. Murphyb Louis ...

Algorithms and simulation results are provided for a deterministic system with discrete action space. They show that the proposed approach is ... We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL. ○. Using the sample of already collected transitions, we ...

440KB Sizes 0 Downloads 118 Views

Recommend Documents

Raphael Fonteneaua Susan A. Murphyb Louis ...
We define the Monte Carlo estimator of the expected return of h when starting from the ..... Raphael Fonteneau acknowledges the financial support of the FRIA.

Raphael Fonteneau Susan Murphy Louis Wehenkel ...
Dept. of Electrical Engineering and Computer Science, University of Liège, Belgium. †. Dept. of Statistics, University of Michigan, USA. ABSTRACT. The treatment of chronic-like illnesses such has HIV infection, cancer or chronic depression implies

Raphael Fonteneau Louis Wehenkel Damien Ernst
•For treating such diseases, physicians often adopt explicit, operationalized series of decision rules specifying how drug types and quantities should vary over time: these are named. Dynamic Treatment Regimes (DTRs). •While typically DTRs are ba

6- Salvando a Raphael Santiago.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. 6- Salvando a ...

Raphael Rossi.pdf
Loading… Whoops! There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Raphael Rossi.pdf. Raphael Ro

(Fábio Marvulle Bueno e Raphael Seabra) subimperialismo ...
(Fábio Marvulle Bueno e Raphael Seabra) subimperialismo brasileiro.pdf. (Fábio Marvulle Bueno e Raphael Seabra) subimperialismo brasileiro.pdf. Open.

news.stlpublicradio.org-A moment in history A St Louis Bosnian ...
news.stlpublicradio.org-A moment in history A St Louis Bosnian reflects on the Syrian refugee crisis.pdf. news.stlpublicradio.org-A moment in history A St Louis ...

LOUIS BRAILLE.pdf
Download. Connect more apps... Try one of the apps below to open or edit this item. LOUIS BRAILLE.pdf. LOUIS BRAILLE.pdf. Open. Extract. Open with. Sign In.

The Louis
A Classic Louis XV style mantel with generous curves. The paneled legs with acanthus leaves on the bases rise up to end on consoles decorated with scroll and ...

6.Las Crónicas de Bane 6 - Salvando a Raphael Santiago.pdf ...
Whoops! There was a problem loading more pages. Retrying... 6.Las Crónicas de Bane 6 - Salvando a Raphael Santiago.pdf. 6.Las Crónicas de Bane 6 ...

Susan Rindt, PsyD - GitHub
Markdown -> PDF, HTML, and more .... service members, pre and post treatment and 6 month, 1 year, 2 year and 5 year post treatment follow up. Sudden Sibling ...

louis ck dvdrip.pdf
There was a problem loading more pages. louis ck dvdrip.pdf. louis ck dvdrip.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying louis ck dvdrip.pdf.

Targetz Catalog Shapes.cdr - Louis Candell
Download this and hundreds of other FREE targets at Targetz.com. ©2002 DLP, Inc. - Please feel free to share copies - No unauthorized Modifications Please.

Targetz Catalog Shapes.cdr - Louis Candell
Shooter. Date. Date. Get Targets for FREE! Distance. Caliber at. Powder Load. Bullet Gr. Targetz.com. Notes. Get more FREE Targets at Targetz.com.

2018 - 2019 School Calendar for Saint Raphael Catholic School
Date. Event. Time. Location. August 8, 2018. Back-to-school Fair. 10:30 - 12:00 IH and RH. August 9, 2018. Teacher Work Day. August 9, 2018. SET Training.

Karen Handel - Susan G. Komen
leading role in efforts to preserve access to vital breast health programs ... for long-time Komen partner Hallmark Cards, she helped to coordinate the company's ... aggressive economic development program that helped create tens of ...

Louis D. Reynolds
120 Baker Avenue, Berkeley Heights, NJ, 07922. Cell: (908)723-1629. Email: [email protected]. February 6, 2015. Eric Bakker. President. Computer Design & Integration. 500 Fifth Avenue, Suite 1010. New York, NY 10110. Dear Mr. Bakker: I was pleased to spe

A Plegable Susan Haack Ciclo UNIVALLE.pdf
A Plegable Susan Haack Ciclo UNIVALLE.pdf. A Plegable Susan Haack Ciclo UNIVALLE.pdf. Open. Extract. Open with. Sign In. Main menu.