Active Exploration by Searching for Experiments that Falsify the Computed Control Policy Raphael Fonteneau Susan A. Murphy a University of Liège, Belgium a
Abstract We propose a strategy for experiment selection - in the context of reinforcement learning - based on the idea that the most interesting experiments to carry out at some stage are those that are the most liable to falsify the current hypothesis about the optimal control policy. We cast this idea in a context where a policy learning algorithm and a model identification method are given a priori. Experiments are selected if, using the learned environment model, they are predicted to yield a revision of the learned control policy. Algorithms and simulation results are provided for a deterministic system with discrete action space. They show that the proposed approach is promising.
●
●
●
The performance of their solutions are related to the amount of information available on the system dynamics and reward function of the optimal control problem
●
In this work, we assume that information on the system must be inferred from trajectories of the system, and, due to time and cost issues, only a limited number of trajectories can be generated
●
Distribution of the returns of all control policies ●
Uniform sampling strategy
●
Falsifaction-based sampling strategy
●
We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL Using the sample of already collected transitions, we first compute a control policy:
We uniformly draw a state-action point (x,u), and we compute a predicted transition:
We add the predicted transition to the current sample, a we compute a predicted control policy If the predicted control policy falsifies the current control policy, then we sample a new transition, else we iterate with a new state-action point (x',u')
Experimental results Problem statement
a
Falsification-based sampling strategy
Introduction Discrete-time optimal control problems arise in many fields (engineering, finance, medicine, artificial intelligence, etc)
Louis Wehenkel Damien Ernst b University of Michigan, USA a
Sampling strategy
●
●
b
Graphical representation of typical runs ●
Uniform sampling strategy
●
Falsifaction-based sampling strategy
Benchmark ●
Problem
The car-on-the-hill benchmark
How to generate an informative batch collection of data so that high-performance control policies can be inferred from this collection ?
●
We propose a sequential strategy for choosing, given a batch collection of already sampled transition, where to sample additional data
Formalization ●
We consider a deterministic discrete-time system whose dynamics over T stages is given by the time-invariant equation:
●
●
●
where all xt lie in a normed state space X , and ut in a finite action space U . ●
●
PM: nearest neighbor algorithm BMRL: nearest neighbor model learning RL algorithm We generate 50 databases of 1000 system transitions We evaluate the performances of the inferred control policies on the real system
The transition from time t to t+1 is associated with an instantaneous reward Performance analysis: 50 runs of our strategy (blue) are compared with 50 uniform runs (red)
●
●
●
●
●
The return over T stages of a sequence of actions u when starting from an initial state x0 is given by
Maximal return:
Conclusions and future work
The goal is to find a sequence of actions whose return is as close as possible to the maximal return
Summary ●
The system dynamics and the reward function are unknown ●
They are replaced by a sample of n system transitions
We have proposed a strategy for generating informative batch collections of data This approach has been empirically validated
Future works Distribution of the returns of control policies at the end of the sampling process where
Problem ●
Given a sample of system transitions
●
Extending the approach to more general frameworks
●
Investigating theoretical properties
Acknowledgements Raphael Fonteneau acknowledges the financial support of the FRIA. Damien Ernst is a research associate of the FRS-FNRS. This paper presents research results of the Belgian Networks BIOMAGNET and DYSCO and the PASCAL2 European Network of Excellence. We also acknowledge financial support from NIH grants P50 DA10075 and R01 MH080015. The scientific responsibility rests with its authors.
Reference How one could determine where to sample additional transitions ?
R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Active exploration by searching for experiments that falsify the computed control policy. IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 2011), Paris, France, April 11-15, 2011, 8 pages.