University of Liège – Montefiore Institute
Variable selection for Dynamic Treatment Regimes (DTR)
Raphael Fonteneau, Louis Wehenkel and Damien Ernst Department of Electrical Engineering and Computer Science University of Liège
27th Benelux Meeting on Systems and Control, Heeze, The Netherlands, March 1820, 2008
University of Liège – Montefiore Institute
Outline Introduction
●
An example: Nefazodone CBASP trial
●
Many difficulties
●
Problem formulation
●
Dynamic programming
●
Approach for solving the inference problem
●
Algorithm
●
Validation: 'Car on the hill' problem
●
Conclusion and future work
●
University of Liège – Montefiore Institute
Introduction Chronic diseases need longterm treatments
●
Dynamic Treatment Regimes (DTR): treatments are operationalized series of decisions specifying how treatment level and type should vary overtime ●
Nowadays, DTR are based on clinical judgment and medical instinct, rather than on formal and systematic datadriven process ●
But these latter ten years, one has seen the emergence of a research field addressing specifically problems of inference of DTR from clinical data. ●
University of Liège – Montefiore Institute
An example: Nefazodone CBASP trial A clinical trial set up for determining optimal DTR for chronical depresssion ●
More than 60 variables: gender, racial category, marital satus, body mass index, medication current, depression, number of depressive episodes, alcohol, drug, ... ●
1 Gender 2 Racial category 3-4 Marital status 5 Body mass index 6 Age in years at screening 7 Treated current depression 8 Medication current depression 9 Psychotherapy current depression 10 Treated past depression 11 Medication past depression 12 Psychotherapy past depression ...
681 patients, 12 weeks of treatment, 3 types of treatments:
●
Nefazodone (200, 300, 400, 500 then 600 mg per day till the end) Cognitive behavioralanalysis system of psychotherapy (16 to 20 sessions) – twice weekly session (weeks 1 to 4 or 8 if problem), weekly sessions (weeks 5 to 12) Both. Tests are performed at t to evaluate the state of the patient, with a reward rt
●
University of Liège – Montefiore Institute
Many difficulties Preference elicitation: to define a criterion that assess the 'well being' of patients
●
Confounding issues: in the Nefazodone CBASP trial, experiments are highly sensitive to the environment ●
Inference problem
●
Selecting a concise set of variables for representing the Dynamic Treatment Regime, since a policy defined on more than 60 variables is not convenient. ●
University of Liège – Montefiore Institute
Problem formulation (I) This problem can be seen has a discretetime problem:
●
xt+1 = f (xt , ut , wt , t) State: xt X (assimilated to the state of the patient)
●
Actions: ut U
●
To the transition from t to t+1 is associated an instantaneous reward signal rt = r (xt , ut , wt , t), where r is the (real) reward function bounded by Br ●
Disturbances: wt W (disturbance space), where wt is generated by the probability
●
distribution Pw(w|x, u, t) .
University of Liège – Montefiore Institute
Problem formulation (II) The goal is to find a policy πT (t, x) : {0, ... , T1} X U that maximises rewards
●
obtained on a certain time horizon T: T −1 T T
J x = E w
t t =0,1,... ,T −1
[ ∑ r xt , T t , xt ,w t , t ∣x0 =x] t =0
The 'system dynamics' f is unknown and replaced by an ensemble F of trajectories:
●
x 10 , u10 , r 10 , x 11 , ... , x 1T −1 , u1T −1 , r 1T −1 , x 1T ,
x 20 , u20 , r 20 , x 21 , ... , x 2T −1 , u2T −1 , r 2T −1 , x 2T , ... p p p p p p p p x0 , u0 , r 0 , x 1 , ... , x T −1 , u T −1 , r T −1 , x T
University of Liège – Montefiore Institute
Dynamic programming Let us define recursively the sequence of QN functions:
●
Q N x , u=E [r x , u, w , tmax Q N −1 f x , u , w ,t , u'] with Q0 ≡ 0
w
u' ∈U
The policiy defined by:
●
∀ t ∈{ 0,1,... , T −1 } , ∀ x ∈ X ,T ✶ t , x =argmax Q T −t x ,u u∈U
is a Tstep optimal policy.
University of Liège – Montefiore Institute
Approach for solving the inference problem An algorithm called fittedQ iteration computes the successive QN functions from
●
the ensemble of trajectories This algorithm is particularly performant when QN functions are appoximated
●
using treebased supervised learning methods Problem: how to have a good policy defined on a small subset of variables ?
●
Approach: (1) Run fitted Q iteration with trees (2) Compute the variance reduction associated with each variables
●
(3) Rerun the algorithm by considering that states are only made of the components leading to the highest variance reduction.
University of Liège – Montefiore Institute
Algorithm (1) Compute the functions (from N = 1 to N = T ) by running the fittedQ Q N iteration algorithm on the F fourtuple set: (2) Compute the relevance of different attributes a using the score evaluation: T
∑ ∑
scorea=
where:
∑
a, node. redvnode.∣node∣
node ∈tree N =1 tree ∈Q N T
∑ ∑
∑
redv node .∣node∣
node∈tree N=1 tree∈Q N
redv(node) is the variance reduction when splitting the treenode node
∣ is the subset size before the splitting of node node∣ if a is used to split node, else 0 a ,node=1 (3) Rerun the fitted Q iteration algorithm on 'best attributes'.
University of Liège – Montefiore Institute
Validation: 'Car on the hill' Problem (I) A car, represented by a point, is riding on a slope represented by the following graph ●
Problem: starting from the lowest point, the car has to reach the top of the hill, using only value 4 or 4 for u, in a minimum of iterations, and without running too fast ●
R mg
u
x = (position, speed)
●
Originally, the problem is deterministic
●
We have added to the original state some noninformative components to set up an experimental protocol. ●
University of Liège – Montefiore Institute
Validation: 'Car on the hill' Problem (II) Results : variable relevance
●
Subset size k nb of trees nmin nb of irrelevant variables nb of iterations
position
speed
u
Scores rand [2,2]
rand [2,2]
rand [2,2]
10000 5000 4000 3000 2000
3 3 3 3 3
15 15 15 15 15
2 2 2 2 2
0 0 0 0 0
50 50 50 50 50
0.30 0.24 0.23 0.28 0.23
0.36 0.35 0.33 0.28 0.34
0.34 0.41 0.44 0.44 0.43
/ / / / /
/ / / / /
/ / / / /
50000 40000 20000 10000 8000 5000 5000
4 4 4 4 4 4 4
50 15 15 15 15 15 15
2 2 8 4 2 2 4
1 1 1 1 1 1 1
50 30 30 30 30 30 30
0.21 0.24 0.30 0.16 0.21 0.23 0.27
0.24 0.25 0.20 0.34 0.24 0.18 0.30
0.44 0.39 0.48 0.41 0.44 0.47 0.35
0.11 0.12 0.10 0.09 0.11 0.12 0.08
/ / / / / / /
/ / / / / / /
20000 10000
5 5
15 15
2 4
2 2
30 30
0.11 0.15
0.26 0.24
0.46 0.43
0.08 0.08
0.09 0.10
/ /
20000 10000
6 6
50 15
4 2
3 3
50 30
0.15 0.10
0.21 0.28
0.41 0.42
0.08 0.08
0.08 0.06
0.07 0.06
University of Liège – Montefiore Institute
Conclusion and future work A simple method of variable selection for reinforcement learning problems
●
Incorporation of variable selection into the fittedQ algorithm: possibility to compute a policy depending only on the most informative variables ●
Application to the Nefazodone CBASP trial
●
Could this process also help in designing algorithms with better inference capabilities ? ●