Active Exploration by Searching for Experiments that ...

Viewer
Transcript

Active Exploration by Searching for Experiments that Falsify the Computed Control Policy

Raphael Fonteneau *, Susan A. Murphy **, Louis Wehenkel *, Damien Ernst * * University of Liège, Belgium

** University of Michigan, USA

IEEE ADPRL 2011, Paris, France Raphael Fonteneau April 12th , 2011

Introduction

Reinforcement Learning Environment

Agent

Actions

Observations, Rewards

Examples of rewards:

●

Reinforcement Learning (RL) aims at finding a policy maximizing received rewards by interacting with the environment

3

Batch Mode Reinforcement Learning ●

●

All the available information is contained in a batch collection of data Batch mode RL (BMRL) aims at computing a (near-)optimal policy from this collection of data

Agent

Environment Actions BMRL Observations, Rewards

Finite collection of trajectories of the agent

(near-)optimal policy

4

Problem statement Problem

How to generate an informative batch collection of data so that high-performance control policies can be inferred from this collection ?

●

We propose a sequential strategy for choosing, given a batch collection of already sampled transition, where to sample additional data

5

Formalization

Formalization ●

We consider a deterministic discrete-time system whose dynamics over T stages is given by the time-invariant equation:

where all x lie in a normed state space X , and u in a finite action space U . t

t

●

The transition from time t to t+1 is associated with an instantaneous reward

●

The return over T stages of a sequence of actions u when starting from an initial state x is given by 0

7

Formalization ●

●

Maximal return:

The goal is to find a sequence of actions whose return is as close as possible to the maximal return

●

The system dynamics and the reward function are unknown

●

They are replaced by a sample of n system transitions

where

8

Formalization Problem ●

Given a sample of system transitions

How one could determine where to sample additional transitions ?

9

Sampling strategy Falsification-based sampling strategy ●

We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL

Sampling strategy Falsification-based sampling strategy ●

We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL

●

Using the sample of already collected transitions, we first compute a control policy:

Sampling strategy Falsification-based sampling strategy ●

We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL

●

Using the sample of already collected transitions, we first compute a control policy:

●

We uniformly draw a state-action point (x,u), and we compute a predicted transition:

Sampling strategy Falsification-based sampling strategy ●

We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL

●

Using the sample of already collected transitions, we first compute a control policy:

●

We uniformly draw a state-action point (x,u), and we compute a predicted transition:

●

We add the predicted transition to the current sample, a we compute a predicted control policy

Sampling strategy Falsification-based sampling strategy ●

We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL

●

Using the sample of already collected transitions, we first compute a control policy:

●

We uniformly draw a state-action point (x,u), and we compute a predicted transition:

●

●

We add the predicted transition to the current sample, a we compute a predicted control policy If the predicted control policy falsifies the current control policy, then we sample a new transition, else we iterate with a new state-action point (x',u')

14

Experimental results

Experimental results Illustration ●

The car-on-the-hill benchmark

●

PM: nearest neighbor algorithm

●

BMRL: nearest neighbor model learning RL algorithm

●

We generate 50 databases of 1000 system transitions

●

We evaluate the performances of the inferred control policies on the real system

16

Experimental results Illustration ●

Performance analysis: 50 runs of our strategy (blue) are compared with 50 uniform runs (red)

17

Experimental results Illustration ●

Distribution of the returns of control policies at the end of the sampling process

18

Experimental results Illustration ●

Distribution of the returns of all control policies

- red : uniform - blue : our strategy

19

Sampling strategy Illustration ●

Graphical representation of typical runs Falsifaction-based sampling strategy

Uniform sampling strategy

20

Conclusions & Future works

Conclusions & future works Summary ●

We have proposed a strategy for generating informative batch collections of data

●

This approach has been empirically validated

Future works ●

Extending the approach to more general frameworks

●

Investigating theoretical properties

22

Exploratory Searching As Conceptual Exploration

The Kiddy Carousel: Visual Exploration During Active ...

Searching for Activation Functions - arXiv

$pdf-098\searching-for-perfect-by-jennifer-probst.pdf$

pdf-098\searching-for-perfect-by-jennifer-probst.pdf

Searching the Web by Voice - CiteSeerX

Tree Exploration for Bayesian RL Exploration

$pdf-2140\geophysics-for-the-mineral-exploration-geoscientist-by ...$

pdf-2140\geophysics-for-the-mineral-exploration-geoscientist-by ...

$pdf-1320\the-vision-for-space-exploration-by-national-aeronautics ...$

pdf-1320\the-vision-for-space-exploration-by-national-aeronautics ...

searching for zero cancer bats.pdf

Still Searching for a Pragmatist Pluralism - PhilArchive

Searching Parallel Corpora for Contextually ...

Searching for a competitive edge - Dell

$pdf-2140\geophysics-for-the-mineral-exploration-geoscientist-by ...$

pdf-2140\geophysics-for-the-mineral-exploration-geoscientist-by ...

Four experiments establish and examine the effect that ...

Searching for a competitive edge - Dell

Statistics for Online Experiments - Optimizely

Searching for species in haloarchaea

Searching for Computer Science Services

$pdf-1464\e-study-guide-for-psychiatric-epidemiology-searching-for ...$

pdf-1464\e-study-guide-for-psychiatric-epidemiology-searching-for ...

$pdf-0942\searching-for-higher-education-leadership-advice-for ...$

pdf-0942\searching-for-higher-education-leadership-advice-for ...