Active Exploration by Searching for Experiments that Falsify the Computed Control Policy

Raphael Fonteneau *, Susan A. Murphy **, Louis Wehenkel *, Damien Ernst * * University of Liège, Belgium

** University of Michigan, USA

IEEE ADPRL 2011, Paris, France Raphael Fonteneau April 12th , 2011

Introduction

Reinforcement Learning Environment

Agent

Actions

Observations, Rewards

Examples of rewards:



Reinforcement Learning (RL) aims at finding a policy maximizing received rewards by interacting with the environment

3

Batch Mode Reinforcement Learning ●



All the available information is contained in a batch collection of data Batch mode RL (BMRL) aims at computing a (near-)optimal policy from this collection of data

Agent

Environment Actions BMRL Observations, Rewards

Finite collection of trajectories of the agent

(near-)optimal policy

4

Problem statement Problem

How to generate an informative batch collection of data so that high-performance control policies can be inferred from this collection ?



We propose a sequential strategy for choosing, given a batch collection of already sampled transition, where to sample additional data

5

Formalization

Formalization ●

We consider a deterministic discrete-time system whose dynamics over T stages is given by the time-invariant equation:

where all x lie in a normed state space X , and u in a finite action space U . t

t



The transition from time t to t+1 is associated with an instantaneous reward



The return over T stages of a sequence of actions u when starting from an initial state x is given by 0

7

Formalization ●



Maximal return:

The goal is to find a sequence of actions whose return is as close as possible to the maximal return



The system dynamics and the reward function are unknown



They are replaced by a sample of n system transitions

where

8

Formalization Problem ●

Given a sample of system transitions

How one could determine where to sample additional transitions ?

9

Sampling strategy Falsification-based sampling strategy ●

We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL

Sampling strategy Falsification-based sampling strategy ●

We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL



Using the sample of already collected transitions, we first compute a control policy:

Sampling strategy Falsification-based sampling strategy ●

We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL



Using the sample of already collected transitions, we first compute a control policy:



We uniformly draw a state-action point (x,u), and we compute a predicted transition:

Sampling strategy Falsification-based sampling strategy ●

We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL



Using the sample of already collected transitions, we first compute a control policy:



We uniformly draw a state-action point (x,u), and we compute a predicted transition:



We add the predicted transition to the current sample, a we compute a predicted control policy

Sampling strategy Falsification-based sampling strategy ●

We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL



Using the sample of already collected transitions, we first compute a control policy:



We uniformly draw a state-action point (x,u), and we compute a predicted transition:





We add the predicted transition to the current sample, a we compute a predicted control policy If the predicted control policy falsifies the current control policy, then we sample a new transition, else we iterate with a new state-action point (x',u')

14

Experimental results

Experimental results Illustration ●

The car-on-the-hill benchmark



PM: nearest neighbor algorithm



BMRL: nearest neighbor model learning RL algorithm



We generate 50 databases of 1000 system transitions



We evaluate the performances of the inferred control policies on the real system

16

Experimental results Illustration ●

Performance analysis: 50 runs of our strategy (blue) are compared with 50 uniform runs (red)

17

Experimental results Illustration ●

Distribution of the returns of control policies at the end of the sampling process

18

Experimental results Illustration ●

Distribution of the returns of all control policies

- red : uniform - blue : our strategy

19

Sampling strategy Illustration ●

Graphical representation of typical runs Falsifaction-based sampling strategy

Uniform sampling strategy

20

Conclusions & Future works

Conclusions & future works Summary ●

We have proposed a strategy for generating informative batch collections of data



This approach has been empirically validated

Future works ●

Extending the approach to more general frameworks



Investigating theoretical properties

22

Active Exploration by Searching for Experiments that ...

... we first compute a control policy: ○. We uniformly draw a state-action point (x,u), and we compute a predicted transition: ... Illustration. ○ ... Illustration. ○. Performance analysis: 50 runs of our strategy (blue) are compared with 50 uniform.

1MB Sizes 0 Downloads 128 Views

Recommend Documents

Exploratory Searching As Conceptual Exploration
Conference'10, Month 1–2, 2010, City, State, Country. .... We call this activity as conceptual mapping. .... Therefore, it is important that querying facility in.

The Kiddy Carousel: Visual Exploration During Active ...
forward-facing carrier near caregivers' eye level. • Head-mounted eye-trackers recorded gaze direction. • 2 tasks: goal-directed target retrieval & free exploratory walking. 3rd person view of caregiver's locomotor path. Caregiver's field of view

Searching for Activation Functions - arXiv
Oct 27, 2017 - Practically, Swish can be implemented with a single line code change in most deep learning libraries, such as TensorFlow (Abadi et al., 2016) (e.g., x * tf.sigmoid(beta * x) or tf.nn.swish(x) if using a version of TensorFlow released a

pdf-098\searching-for-perfect-by-jennifer-probst.pdf
SEARCHING FOR PERFECT BY JENNIFER PROBST PDF. But, how is the means to obtain this ... Review. "Jennifer Probst has solidified herself as one of my go-to authors with this novel. To me, her writing ... business savvy heroine at its center and a super

Searching the Web by Voice - CiteSeerX
query traffic is covered by the vocabulary of the lan- ... according to their likelihood ratios, and selecting all ... discovery algorithm considers all n − 1 possible.

Tree Exploration for Bayesian RL Exploration
games. Our case is similar, however we can take advan- tage of the special structure of the belief tree. In particu- ..... [8] S. Gelly and D. Silver. Combining online ...

pdf-2140\geophysics-for-the-mineral-exploration-geoscientist-by ...
Whoops! There was a problem loading more pages. pdf-2140\geophysics-for-the-mineral-exploration-geoscientist-by-michael-dentith-stephen-t-mudge.pdf.

pdf-1320\the-vision-for-space-exploration-by-national-aeronautics ...
Try one of the apps below to open or edit this item. pdf-1320\the-vision-for-space-exploration-by-national-aeronautics-space-administration.pdf.

searching for zero cancer bats.pdf
... apps below to open or edit this item. searching for zero cancer bats.pdf. searching for zero cancer bats.pdf. Open. Extract. Open with. Sign In. Main menu.

Still Searching for a Pragmatist Pluralism - PhilArchive
Michael Sullivan and John Lysaker (hereafter, S&L) challenge our diesis on ... Lewis's restricted nodon of democracy and the benighted soul who objects to.

Searching Parallel Corpora for Contextually ...
First, we dem- onstrate that the coverage of available corpora ... manually assigned domain categories that help ... In this paper, we first argue that corpus search.

Searching for a competitive edge - Dell
Google Search Appliance to 11th-generation Dell servers, delivering ... Dell PowerEdge servers have met that .... dedicated account team, a solid distribution.

pdf-2140\geophysics-for-the-mineral-exploration-geoscientist-by ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. pdf-2140\geophysics-for-the-mineral-exploration-geoscientist-by-michael-dentith-stephen-t-mudge.pdf.

Four experiments establish and examine the effect that ...
meaning, stimuli that are all perceptual characteristics, all abstract shapes? In one ...... Therefore, the average of the two groups of words was 4.2 SDs apart on ...

Searching for a competitive edge - Dell
cuStomer profiLe. Country: United States ... companies are coming to that realization. Still, there is .... engineering team with a list of our technical requirements ...

Statistics for Online Experiments - Optimizely
Although we know you value data and hard facts when growing your business, you make .... difference between the variation and control groups. Of course ...

Searching for species in haloarchaea
Aug 28, 2007 - Halorubrum from two adjacent ponds of different salinities at a. Spanish saltern and a ... When advantageous new mu- tant alleles sweep to .... recombination between species at an earlier stage (before the last common ...

Searching for Computer Science Services
Many students, parents and K-12 teachers and administrators in the U.S. highly value computer science education. Parents see computer science education as a good use of school resources and often think it is just as important as other courses. Two-th

pdf-1464\e-study-guide-for-psychiatric-epidemiology-searching-for ...
There was a problem loading more pages. Retrying... pdf-1464\e-study-guide-for-psychiatric-epidemiology-searching-for-the-causes-of-mental-disorders.pdf.

pdf-0942\searching-for-higher-education-leadership-advice-for ...
... the apps below to open or edit this item. pdf-0942\searching-for-higher-education-leadership-ad ... erican-council-on-education-series-on-higher-educ.pdf.