DTR

Viewer
Transcript

University of Liège – Montefiore Institute

Variable selection for Dynamic Treatment Regimes (DTR): a Reinforcement Learning approach

Raphael Fonteneau, Louis Wehenkel and Damien Ernst

Department of Electrical Engineering and Computer Science University of Liège

EWRL 2008 , Villeneuve d'Ascq,  July 1st, 2008

University of Liège – Montefiore Institute

Outline >

>

>

Dynamic Treatment Regimes ●

Introduction

●

An example, the Nefazodone CBASP trial

●

Many difficulties

Mathematical approach ●

Problem formulation

●

Dynamic programming

●

Approach for solving the inference problem

●

Algorithm

●

Validation, the ''Car on the hill'' problem

Conclusion and future work

University of Liège – Montefiore Institute

Introduction >  Chronic diseases need longterm treatments > Dynamic Treatment Regimes (DTR): treatments are operationalized series of decisions specifying how treatment level and type should vary overtime >  Nowadays, DTR are based on clinical judgment and medical instinct, rather than on formal and systematic datadriven processes > But these latter ten years, one has seen the emergence of a research field addressing specifically problems of inference of DTR from clinical data > We propose an approach to address the problem of feature selection.

University of Liège – Montefiore Institute

An example: Nefazodone CBASP trial > A randomized controlled trial set up for determining optimal DTR for chronical depresssion > More than 60 variables: gender, racial category, marital satus, body mass index, medication current, depression, number of depressive episodes, alcohol, drug, ...

1 Gender 2 Racial category 34 Marital status 5 Body mass index 6 Age in years at screening 7 Treated current depression 8 Medication current depression 9 Psychotherapy current depression 10 Treated past depression 11 Medication past depression 12 Psychotherapy past depression ...

> 681 patients, 12 weeks of treatment, 3 types of treatments: ●

●

●

Nefazodone (200, 300, 400, 500 then 600 mg per day till the end) Cognitive behavioralanalysis system of psychotherapy (16 to 20 sessions) – twice weekly sessions (weeks 1 to 4 or 8 if problem), weekly sessions (weeks 5 to 12) Both

> Tests are performed at time t to evaluate the state of the patient, with a reward rt

University of Liège – Montefiore Institute

Many difficulties > Preference elicitation: finding a criterion that assesses the ''well being'' of patients > Confounding issues: in the Nefazodone CBASP trial, experiments are highly sensitive to the environment

> Inference problem

> Selecting a concise set of variables for representing the Dynamic Treatment Regime, since a policy defined on more than 60 variables is not convenient.

University of Liège – Montefiore Institute

Problem formulation (I) > This problem can be seen has a discretetime problem:

xt+1 = f (xt , ut , wt , t) > State:  xt  X (assimilated to the state of the patient) > Actions: ut  U > To the transition from t to t+1 is associated an instantaneous reward signal rt = r (xt , ut , wt , t), where r is the (real) reward function bounded by Br > Disturbances: wt  W (disturbance space), where wt is generated by the probability distribution Pw(w|x, u, t) .

University of Liège – Montefiore Institute

Problem formulation (II) > The goal is to find a policy πT (t, x) : {0, ... , T1} X  U that maximises rewards obtained on a time horizon T: T−1

T

J T  x = E [ ∑ r  x t , T t , x t  , wt , t∣x 0 =x ] w

t =0

t t=0,1,. .. , T −1

> The ''system dynamics'' f is unknown and replaced by an ensemble F of trajectories:

1

1

1

1

1

1

1

1

 x 0 , u0 , r 0 , x1 , ... , x T −1 , uT −1 , r T −1 , x T  ,

 x 20 , u 20 , r 20 , x 21 , ... , x 2T−1 , u 2T−1 , r 2T −1 , x T2  , ...  x 0p , u p0 , r 0p , x 1p , ... , x Tp −1 , uTp −1 , r Tp −1 , x Tp 

University of Liège – Montefiore Institute

Dynamic programming > Let us define recursively the sequence of QN functions:

QN  x , u=E [r  x , u , w , tmax Q N−1  f  x , u , w , t  , u '] w

with Q0 ≡ 0

u ' ∈U

> The policy defined by: ✶

∀ t∈ { 0,1,. .. , T −1 } , ∀ x ∈ X ,T t , x =argmax Q T− t  x , u u ∈U

is a Tstep optimal policy.

University of Liège – Montefiore Institute

Approach for solving the inference problem > An algorithm called fitted Q iteration algorithm computes the successive QN functions from the ensemble of trajectories > This algorithm is particularly performant when QN functions are approximated using treebased supervised learning methods > Problem: how to have a good policy defined on a small subset of variables ? > Approach: (1) Run the fitted Q iteration algorithm with trees (2) Compute the variance reduction associated with each variable (3) Rerun the fitted Q iteration algorithm by considering that states are only made of the components leading to the highest variance reduction.

University of Liège – Montefiore Institute

Algorithm N (1) Compute the          functions (from N = 1 to N = T ) by running the fitted Q              Q iteration algorithm on the F fourtuple set: (2) Compute the relevance of different attributes xi using the score evaluation: T

S  x i =

where:

∑ ∑ ∑  x i ,  .  var   .∣∣ N=1 ∈Q N ∈ T

∑ ∑ ∑  var  .∣∣ N=1 ∈QN  ∈

                is the variance reduction when splitting the (τ) treenode ν  var  ∣L ∣ ∣R ∣  var =var − . var  L − . var  R  ∣∣ ∣∣

∣∣             is the subset size before the splitting of ν i                       if x is used to split ν , else 0  x i , =1

(3) Rerun the fitted Q iteration algorithm on the ''best attributes''.

University of Liège – Montefiore Institute

Validation: ''Car on the hill'' Problem (I) > A car, represented by a point, is riding on a slope represented by the following graph > Problem: starting from the lowest point, the car has to reach the top of the hill, using only values 4 or 4 for u, in a minimum of iterations, and without running too fast

R mg

u

> x = (position, speed) = (x1, x2) > Originally, the problem is deterministic > We have added to the original state some noninformative components x3, x4, x5 to set up an experimental protocol.

University of Liège – Montefiore Institute

Validation: ''Car on the hill'' Problem (II) Results : variable relevance

●

Subset size nb of irrelevant variables

Scores position (x1 )

speed (x2 )

Rand [2,2] (x3 ) Rand [2,2] (x4 ) Rand [2,2] (x5 )

5000 5000 5000 5000

0 0 0 0

0.24 0.27 0.16 0.15

0.35 0.30 0.26 0.18

/ 0.08 0.12 0.07

/ / 0.06 0.07

/ / / 0.09

10000 10000 10000

1 1 1

0.16 0.20 0.15

0.34 0.19 0.31

0.09 0.08 0.05

/ 0.12 0.05

/ / 0.06

20000 20000 20000

2 2 2

0.18 0.15 0.15

0.27 0.24 0.21

0.10 0.08 0.08

/ 0.10 0.08

/ / 0.07

University of Liège – Montefiore Institute

Conclusion and future work >  A simple method of variable selection for reinforcement learning problems >  Incorporation of variable selection into the fitted Q iteration algorithm: possibility to compute a policy depending on the most informative variables only >  Application to clinical data > Could this process also help in designing algorithms with better inference capabilities?

Variable selection for Dynamic Treatment Regimes (DTR)

Jul 1, 2008 - depressive episodes, alcohol, drug, ... > 681 patients, 12 weeks of treatment, 3 types of treatments: â. Nefazodone (200, 300, 400, 500 then 600 mg ... Selecting a concise set of variables for representing the Dynamic Treatment Regime, since a policy defined on more than 60 variables is not convenient.

Download PDF

655KB Sizes 0 Downloads 97 Views

Report

Variable selection for Dynamic Treatment Regimes (DTR)

Variable selection for Dynamic Treatment Regimes (DTR)

Variable selection for Dynamic Treatment Regimes (DTR)

DTR

Recommend Documents