University of Liège – Montefiore Institute
Variable selection for Dynamic Treatment Regimes (DTR): a Reinforcement Learning approach
Raphael Fonteneau, Louis Wehenkel and Damien Ernst
Department of Electrical Engineering and Computer Science University of Liège
EWRL 2008 , Villeneuve d'Ascq, July 1st, 2008
University of Liège – Montefiore Institute
Outline >
>
>
Dynamic Treatment Regimes ●
Introduction
●
An example, the Nefazodone CBASP trial
●
Many difficulties
Mathematical approach ●
Problem formulation
●
Dynamic programming
●
Approach for solving the inference problem
●
Algorithm
●
Validation, the ''Car on the hill'' problem
Conclusion and future work
University of Liège – Montefiore Institute
Introduction > Chronic diseases need longterm treatments > Dynamic Treatment Regimes (DTR): treatments are operationalized series of decisions specifying how treatment level and type should vary overtime > Nowadays, DTR are based on clinical judgment and medical instinct, rather than on formal and systematic datadriven processes > But these latter ten years, one has seen the emergence of a research field addressing specifically problems of inference of DTR from clinical data > We propose an approach to address the problem of feature selection.
University of Liège – Montefiore Institute
An example: Nefazodone CBASP trial > A randomized controlled trial set up for determining optimal DTR for chronical depresssion > More than 60 variables: gender, racial category, marital satus, body mass index, medication current, depression, number of depressive episodes, alcohol, drug, ...
1 Gender 2 Racial category 34 Marital status 5 Body mass index 6 Age in years at screening 7 Treated current depression 8 Medication current depression 9 Psychotherapy current depression 10 Treated past depression 11 Medication past depression 12 Psychotherapy past depression ...
> 681 patients, 12 weeks of treatment, 3 types of treatments: ●
●
●
Nefazodone (200, 300, 400, 500 then 600 mg per day till the end) Cognitive behavioralanalysis system of psychotherapy (16 to 20 sessions) – twice weekly sessions (weeks 1 to 4 or 8 if problem), weekly sessions (weeks 5 to 12) Both
> Tests are performed at time t to evaluate the state of the patient, with a reward rt
University of Liège – Montefiore Institute
Many difficulties > Preference elicitation: finding a criterion that assesses the ''well being'' of patients > Confounding issues: in the Nefazodone CBASP trial, experiments are highly sensitive to the environment
> Inference problem
> Selecting a concise set of variables for representing the Dynamic Treatment Regime, since a policy defined on more than 60 variables is not convenient.
University of Liège – Montefiore Institute
Problem formulation (I) > This problem can be seen has a discretetime problem:
xt+1 = f (xt , ut , wt , t) > State: xt X (assimilated to the state of the patient) > Actions: ut U > To the transition from t to t+1 is associated an instantaneous reward signal rt = r (xt , ut , wt , t), where r is the (real) reward function bounded by Br > Disturbances: wt W (disturbance space), where wt is generated by the probability distribution Pw(w|x, u, t) .
University of Liège – Montefiore Institute
Problem formulation (II) > The goal is to find a policy πT (t, x) : {0, ... , T1} X U that maximises rewards obtained on a time horizon T: T−1
T
J T x = E [ ∑ r x t , T t , x t , wt , t∣x 0 =x ] w
t =0
t t=0,1,. .. , T −1
> The ''system dynamics'' f is unknown and replaced by an ensemble F of trajectories:
1
1
1
1
1
1
1
1
x 0 , u0 , r 0 , x1 , ... , x T −1 , uT −1 , r T −1 , x T ,
x 20 , u 20 , r 20 , x 21 , ... , x 2T−1 , u 2T−1 , r 2T −1 , x T2 , ... x 0p , u p0 , r 0p , x 1p , ... , x Tp −1 , uTp −1 , r Tp −1 , x Tp
University of Liège – Montefiore Institute
Dynamic programming > Let us define recursively the sequence of QN functions:
QN x , u=E [r x , u , w , tmax Q N−1 f x , u , w , t , u '] w
with Q0 ≡ 0
u ' ∈U
> The policy defined by: ✶
∀ t∈ { 0,1,. .. , T −1 } , ∀ x ∈ X ,T t , x =argmax Q T− t x , u u ∈U
is a Tstep optimal policy.
University of Liège – Montefiore Institute
Approach for solving the inference problem > An algorithm called fitted Q iteration algorithm computes the successive QN functions from the ensemble of trajectories > This algorithm is particularly performant when QN functions are approximated using treebased supervised learning methods > Problem: how to have a good policy defined on a small subset of variables ? > Approach: (1) Run the fitted Q iteration algorithm with trees (2) Compute the variance reduction associated with each variable (3) Rerun the fitted Q iteration algorithm by considering that states are only made of the components leading to the highest variance reduction.
University of Liège – Montefiore Institute
Algorithm N (1) Compute the functions (from N = 1 to N = T ) by running the fitted Q Q iteration algorithm on the F fourtuple set: (2) Compute the relevance of different attributes xi using the score evaluation: T
S x i =
where:
∑ ∑ ∑ x i , . var .∣∣ N=1 ∈Q N ∈ T
∑ ∑ ∑ var .∣∣ N=1 ∈QN ∈
is the variance reduction when splitting the (τ) treenode ν var ∣L ∣ ∣R ∣ var =var − . var L − . var R ∣∣ ∣∣
∣∣ is the subset size before the splitting of ν i if x is used to split ν , else 0 x i , =1
(3) Rerun the fitted Q iteration algorithm on the ''best attributes''.
University of Liège – Montefiore Institute
Validation: ''Car on the hill'' Problem (I) > A car, represented by a point, is riding on a slope represented by the following graph > Problem: starting from the lowest point, the car has to reach the top of the hill, using only values 4 or 4 for u, in a minimum of iterations, and without running too fast
R mg
u
> x = (position, speed) = (x1, x2) > Originally, the problem is deterministic > We have added to the original state some noninformative components x3, x4, x5 to set up an experimental protocol.
University of Liège – Montefiore Institute
Validation: ''Car on the hill'' Problem (II) Results : variable relevance
●
Subset size nb of irrelevant variables
Scores position (x1 )
speed (x2 )
Rand [2,2] (x3 ) Rand [2,2] (x4 ) Rand [2,2] (x5 )
5000 5000 5000 5000
0 0 0 0
0.24 0.27 0.16 0.15
0.35 0.30 0.26 0.18
/ 0.08 0.12 0.07
/ / 0.06 0.07
/ / / 0.09
10000 10000 10000
1 1 1
0.16 0.20 0.15
0.34 0.19 0.31
0.09 0.08 0.05
/ 0.12 0.05
/ / 0.06
20000 20000 20000
2 2 2
0.18 0.15 0.15
0.27 0.24 0.21
0.10 0.08 0.08
/ 0.10 0.08
/ / 0.07
University of Liège – Montefiore Institute
Conclusion and future work > A simple method of variable selection for reinforcement learning problems > Incorporation of variable selection into the fitted Q iteration algorithm: possibility to compute a policy depending on the most informative variables only > Application to clinical data > Could this process also help in designing algorithms with better inference capabilities?