Active Exploration by Searching for Experiments that Falsify the Computed Control Policy
Raphael Fonteneau *, Susan A. Murphy **, Louis Wehenkel *, Damien Ernst * * University of Liège, Belgium
** University of Michigan, USA
IEEE ADPRL 2011, Paris, France Raphael Fonteneau April 12th , 2011
Introduction
Reinforcement Learning Environment
Agent
Actions
Observations, Rewards
Examples of rewards:
●
Reinforcement Learning (RL) aims at finding a policy maximizing received rewards by interacting with the environment
3
Batch Mode Reinforcement Learning ●
●
All the available information is contained in a batch collection of data Batch mode RL (BMRL) aims at computing a (near-)optimal policy from this collection of data
Agent
Environment Actions BMRL Observations, Rewards
Finite collection of trajectories of the agent
(near-)optimal policy
4
Problem statement Problem
How to generate an informative batch collection of data so that high-performance control policies can be inferred from this collection ?
●
We propose a sequential strategy for choosing, given a batch collection of already sampled transition, where to sample additional data
5
Formalization
Formalization ●
We consider a deterministic discrete-time system whose dynamics over T stages is given by the time-invariant equation:
where all x lie in a normed state space X , and u in a finite action space U . t
t
●
The transition from time t to t+1 is associated with an instantaneous reward
●
The return over T stages of a sequence of actions u when starting from an initial state x is given by 0
7
Formalization ●
●
Maximal return:
The goal is to find a sequence of actions whose return is as close as possible to the maximal return
●
The system dynamics and the reward function are unknown
●
They are replaced by a sample of n system transitions
where
8
Formalization Problem ●
Given a sample of system transitions
How one could determine where to sample additional transitions ?
9
Sampling strategy Falsification-based sampling strategy ●
We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL
Sampling strategy Falsification-based sampling strategy ●
We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL
●
Using the sample of already collected transitions, we first compute a control policy:
Sampling strategy Falsification-based sampling strategy ●
We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL
●
Using the sample of already collected transitions, we first compute a control policy:
●
We uniformly draw a state-action point (x,u), and we compute a predicted transition:
Sampling strategy Falsification-based sampling strategy ●
We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL
●
Using the sample of already collected transitions, we first compute a control policy:
●
We uniformly draw a state-action point (x,u), and we compute a predicted transition:
●
We add the predicted transition to the current sample, a we compute a predicted control policy
Sampling strategy Falsification-based sampling strategy ●
We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL
●
Using the sample of already collected transitions, we first compute a control policy:
●
We uniformly draw a state-action point (x,u), and we compute a predicted transition:
●
●
We add the predicted transition to the current sample, a we compute a predicted control policy If the predicted control policy falsifies the current control policy, then we sample a new transition, else we iterate with a new state-action point (x',u')
14
Experimental results
Experimental results Illustration ●
The car-on-the-hill benchmark
●
PM: nearest neighbor algorithm
●
BMRL: nearest neighbor model learning RL algorithm
●
We generate 50 databases of 1000 system transitions
●
We evaluate the performances of the inferred control policies on the real system
16
Experimental results Illustration ●
Performance analysis: 50 runs of our strategy (blue) are compared with 50 uniform runs (red)
17
Experimental results Illustration ●
Distribution of the returns of control policies at the end of the sampling process
18
Experimental results Illustration ●
Distribution of the returns of all control policies
- red : uniform - blue : our strategy
19
Sampling strategy Illustration ●
Graphical representation of typical runs Falsifaction-based sampling strategy
Uniform sampling strategy
20
Conclusions & Future works
Conclusions & future works Summary ●
We have proposed a strategy for generating informative batch collections of data
●
This approach has been empirically validated
Future works ●
Extending the approach to more general frameworks
●
Investigating theoretical properties
22