Dynamic Treatment Regimes using Reinforcement Learning: A Cautious Generalization Approach ∗

Raphael Fonteneau ∗

Susan Murphy





Louis Wehenkel

Damien Ernst



Dept. of Electrical Engineering and Computer Science, University of Liège, Belgium † Dept. of Statistics, University of Michigan, USA

ABSTRACT

3

The treatment of chronic-like illnesses such has HIV infection, cancer or chronic depression implies long-lasting treatments that can be associated with low quality outcome, painful side effects and expensive costs. To enhance these treatments, clinicians often adopt what we call Dynamic Treatment Regimes (DTRs). DTRs are sets of sequential decision rules defining what actions should be taken at a specific instant to treat a patient based on information observed up to that instant. Since a few years, a growing research community is working on the development of formal methods (mainly issued from mathematics, statistics and control theory) that allow to infer from clinical data high-quality DTRs. We propose in this framework a consistent algorithm of quadratic complexity [3] that infer from clinical data a sequence of treatment actions by maximizing a recently proposed lower bound on the return depending on the initial state [2]. The algorithm (called CGRL for Cautious Generalization for Reinforcement Learning) has cautious generalization properties, i.e. it avoids taking treatment actions for which the sample of clinical data is too sparse to make safe generalization.

LOWER BOUND ON THE RETURN OF A GIVEN SEQUENCE ACTIONS

Lemma 3.1 Let u0, . . . , uT −1 be a sequence of actions. T −1 ∈ FuT0,...,uT −1 where FuT0,...,uT −1 Let τ = [(xlt , ult , rlt , y lt )]t=0 is the set of all sequences of one-step system transitions [(xl0 , ul0 , rl0 , y l0 ), . . . , (xlT −1, ulT −1 , rlT −1 , y lT −1 )] for which ult = ut, ∀t ∈ J0, T − 1K. Then, J u0,...,uT −1(x) ≥ B(τ, x) , with T −1 lt . r − LQT −t ky lt−1 − xlt k , B(τ, x) = 



X

t=0





y l−1 = x , T −t−1 (Lf )i . LQT −t = Lρ X

i=0

Fig. 2: A graphical interpretation of the CGRL algorithm (notice that n = |F |)

6

PRELIMINARY VALIDATION

The CGRL algorithm is compared with the Fitted Q Iteration(FQI) algorithm [1] on two samples F1 (“normal” sample) and F2 (no information about the puddle). The puddle word benchmark

1

PROBLEM STATEMENT

• Discrete-time system dynamics over T stages xt+1 = f (xt, ut) t = 0, 1, . . . , T − 1, where for all t, the state xt is an element of the normed vector state space X and ut is an element of the finite (discrete) action space U , • An instantaneous reward

Fig. 1: A graphical interpretation of the different terms composing the bound on J u0,...,uT −1 (x) computed from a sequence of one-step transitions.

Definition 3.2 (Highest lower bound for u0, . . . , uT −1) B u0,...,uT −1(x) =

rt = ρ(xt, ut) ∈ R is associated with the action ut taken while being in state xt ,

kf (x, u) − f (x′, u)k ≤ Lf kx − x′k , |ρ(x, u) − ρ(x′, u)| ≤ Lρkx − x′k ,

τ ∈Fu0,...,uT −1

B(τ, x) .

Fig. 3: CGRL with F1.

Fig. 4: FQI with F1.

Fig. 5: CGRL with F2.

Fig. 6: FQI with F2.

Definition 3.3 (Sample sparsity of F) For X bounded, let Fa = {(xl , ul , rl , y l ) ∈ F |ul = a}. ∃ α ∈ R+ :

• The system dynamics f and the reward function ρ are unknown, • The system dynamics f and the reward function ρ are Lipschitz continuous, i.e. that there exist finite constants Lf , Lρ ∈ R such that: ∀x, x′ ∈ X , ∀u ∈ U,

max T

∀a ∈ U , sup

x′∈X

 

(xl ,ul ,rl ,y l )∈F

 

kx − x k ≤ α .

min





l



a

(1)

The smallest α which satisfies equation (1) is named the sample sparsity and is denoted by α∗. Theorem 3.4 (Tightness of highest lower bound) ∃ C > 0 : ∀(u0, . . . , uT −1) ∈ U T , J u0,...,uT −1(x) − B u0,...,uT −1 (x) ≤ Cα∗.

Database generation: A patient does not take his antiretroviral therapy in average once every eight days. CGRL is run on the trajectory generated by this patient. HIV infection

• Two constants Lf and Lρ satisfying the above-written inequalities are known, • Data : a set of one-step transitions

4

THE CGRL ALGORITHM

|F |

F = {(xl, ul , rl , y l )}l=1 where each one-step transition is such that y l = f (xl , ul ) and rl = ρ(xl , ul ),

• The CGRL algorithm computes for each initial state x a sequence of actions uˆ∗0 (x), . . . , uˆ∗T −1(x) that belongs to B∗(x) where

• Each action a ∈ U appears at least once in F:

J

u0,...,uT −1

(x) =

TX −1 t=0

T

(x) = {(u0, . . . , uT −1) ∈ U | u′0,...,u′T −1 u0,...,uT −1 B (x)}. B (x) = ′ max ′ T B

∀a ∈ U, ∃(x, u, r, y) ∈ F : u = a , • For every initial state x, the return over T stages of a sequence of actions (u0, . . . , uT −1) ∈ U T is defined as



• Finding an element of B∗(x) can be reformulated as a shortest path problem (see Figure 2).

J

u∗0 (x),...,u∗T −1(x)

(x) = J ∗(x) =.

CONSISTENCY J∗(x)

max

(u0,...,uT −1)∈U

7

FUTURE WORK

• Selecting concise sets of transitions.

= {(u0, . . . , uT −1) ∈ U T |J u0,...,uT −1(x) = J ∗(x)} , ∗

T



T

and let us suppose that J (x) 6= U (if J (x) = U , the search for an optimal sequence of actions is indeed trivial). We define

u0,...,uT −1 J (x) . T

• The goal is to compute, for any initial state x ∈ X , a sequence of actions (ˆ u∗0 (x), . . . , uˆ∗T −1(x)) ∈ U T such that u∗T −1(x) uˆ∗0 (x),...,ˆ is as close as possible to J ∗(x). J

puted by the CGRL algorithm

• Derivation of the CGRL algorithm to address the exploitation / exploration tradeoff,

Theorem 5.1 (Consistency of CGRL algorithm) Let

• An optimal sequence of actions u∗0 (x), . . . , u∗T −1(x) is such that

generating the database

• Extension of the CGRL algorithm to a stochastic framework / on-line learning framework,

ρ(xt, ut) .

OBJECTIVE

Fig. 8: Treatment evolution com-

(u0,...,uT −1)∈U

5 2

Fig. 7: Treatment evolution for

ǫ(x) =

min T

{J ∗(x) − J u0,...,uT −1 (x)}.

u0,...,uT −1∈U \J∗(x)

Acknowledgement This paper presents research results of the Belgian Network BIOMAGNET (Bioinformatics and Modeling: from Genomes to Networks), funded by the Interuniversity Attraction Poles Programme, initiated by the Belgian State, Science Policy Office. RF acknowledges the financial support of the FRIA. DE is a research associate of the FRS-FNRS. We also acknowledge financial support from NIH grants P50 DA10075 and R01 MH080015. The scientific responsibility rests with its authors.

References [1] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005.

Then Cα∗ < ǫ(x) =⇒ (ˆ u∗0 (x), . . . , uˆ∗T −1(x)) ∈ J∗(x) .

[2] R. Fonteneau, S. Murphy, L. Wehenkel, and D. Ernst. Inferring bounds on the performance of a control policy from a sample of trajectories. In Proceedings of the 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (IEEE ADPRL 09), Nashville, TN, USA, 2009. [3] R. Fonteneau, S.A. Murphy, L. Wehenkel, and D. Ernst. A cautious approach to generalization in reinforcement learning. In Proceedings of the Second International Conference on Agents and Artificial Intelligence (ICAART 2010), Valencia, Spain, 2010.

Raphael Fonteneau Susan Murphy Louis Wehenkel ...

Dept. of Electrical Engineering and Computer Science, University of Liège, Belgium. †. Dept. of Statistics, University of Michigan, USA. ABSTRACT. The treatment of chronic-like illnesses such has HIV infection, cancer or chronic depression implies long-lasting treatments that can be associated with low quality outcome, ...

236KB Sizes 0 Downloads 157 Views

Recommend Documents

Raphael Fonteneau Louis Wehenkel Damien Ernst
•For treating such diseases, physicians often adopt explicit, operationalized series of decision rules specifying how drug types and quantities should vary over time: these are named. Dynamic Treatment Regimes (DTRs). •While typically DTRs are ba

Raphael Fonteneaua Susan A. Murphyb Louis ...
We define the Monte Carlo estimator of the expected return of h when starting from the ..... Raphael Fonteneau acknowledges the financial support of the FRIA.

Raphael Fonteneaua Susan A. Murphyb Louis ...
Algorithms and simulation results are provided for a deterministic system with discrete action space. They show that the proposed approach is ... We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algori

Raphael Fonteneau Damien Ernst Bernard Boigelot ...
... using Lagrangian relaxation. Introduction. ○. Discrete-time optimal control problems arise in many fields (engineering, finance, medicine, artificial intelligence, ...

Raphael Rossi.pdf
Loading… Whoops! There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Raphael Rossi.pdf. Raphael Ro

Murphy Cesspool.pdf
very NICE one from dear little Iderideroo [Ida?], Percy, Clytie &. Mrs Richards. Thanks very much for them they cheered us up a. lot. The natives are terribly filthy ...

murphy-pa.pdf
If you are interested in receiving my weekly email newsletter describing. important votes and key committee activity, I invite you to visit my website at Murohv.

Murphy, L.L.
individual computer programmer, engineer, etc. would want. ...... a company needs to redesign employee workspace and has a choice to hire either the.

Laz Murphy
the industrial flows: social and political management of the networks via the “market” instead ... comes increasingly to resemble political action, since this is where the production of ... Marketing reveals its true nature here: it constructs th

Laz Murphy
initiative, may become a dependent of the “big boss” and a subject of the empire. Flow of Desire ... exercised by the political entrepreneur in the “social construction of the market” is that of ..... cannot provide precise data). The new hir

Murphy, L.L.
data from which informed decisions about design can be made and empirical ...... The NEO-FFI is designed to measure the 'Big Five,' five domains of adult personality: ...... information that would predict future health problems to aid in the hiring .

6- Salvando a Raphael Santiago.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. 6- Salvando a ...

Matt Murphy Resume.pdf
American Council on Exercise (ACE) Group Fitness Instructor Certification September, 2015. American Council on Exercise—Denver, CO. Emergency Care ...

(Fábio Marvulle Bueno e Raphael Seabra) subimperialismo ...
(Fábio Marvulle Bueno e Raphael Seabra) subimperialismo brasileiro.pdf. (Fábio Marvulle Bueno e Raphael Seabra) subimperialismo brasileiro.pdf. Open.

WE Murphy & DV Jackson Scholarship
(e)Participation in sports ... Name________________________________ Phone No. ... memberships (office held), sports, community and religious activities, ...

LOUIS BRAILLE.pdf
Download. Connect more apps... Try one of the apps below to open or edit this item. LOUIS BRAILLE.pdf. LOUIS BRAILLE.pdf. Open. Extract. Open with. Sign In.

The Louis
A Classic Louis XV style mantel with generous curves. The paneled legs with acanthus leaves on the bases rise up to end on consoles decorated with scroll and ...

Susan Rindt, PsyD - GitHub
Markdown -> PDF, HTML, and more .... service members, pre and post treatment and 6 month, 1 year, 2 year and 5 year post treatment follow up. Sudden Sibling ...

2015 Timothy Murphy ResultsByAgeDivision.pdf
8 Beebe, Linda 44:51.940 Runner North. Blenheim. 1. 2. ... 121 Lapointe, James 1:37:24.130 Walker. 2. ... 2015 Timothy Murphy ResultsByAgeDivision.pdf.

PR Murphy Birman Prize.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. PR Murphy ...

Haunted mansion by eddie murphy
10 ofthelakesconnecting to theriver Leaaretidal. ... The Mind ofthe Market:HowBiology and ... forevermoreshall beit isa very usefulsource..456599715261326671 Windows 8.1 rollup. ... Urdu Books, English Books and Old pdf books download.

louis ck dvdrip.pdf
There was a problem loading more pages. louis ck dvdrip.pdf. louis ck dvdrip.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying louis ck dvdrip.pdf.