Scalable Joint Models for Reliable Event Prediction

Viewer
Transcript

Scalable Joint Models for Reliable Event Prediction

Hossein Soleimani Johns Hopkins University Baltimore, MD 21218 [email protected]

James Hensman prowler.io Cambridge, UK [email protected]

Suchi Saria Johns Hopkins University Baltimore, MD 21218 [email protected]

Abstract Missing data and noisy observations pose significant challenges for reliable event prediction from irregularly sampled multivariate time series data. Typically, imputation methods are used to compute missing features to be used for event prediction. However, the estimated features may be unreliable in regions with very few observations, leading to high false alarm rate. We propose a probabilistic approach that jointly models the time series and event data and estimates the uncertainty associated with event predictions. We then derive an optimal alerting policy that uses the estimated uncertainty to reduce false alerts. The policy trades-off the cost of a delayed versus false detection and abstains from making decisions when the prediction does not satisfy the derived confidence criteria. Experiments show that the proposed framework has significantly lower false alarm rate compared to state-of-the-art time series classification techniques.

1

Introduction

We consider the problem of predicting adverse events from multivariate, noisy, sparse, and irregularly sampled time series data. This is an important task in many applications such as healthcare, where early and reliable detection of many adverse events such as cardiac arrest or septic shock could reduce mortality and morbidity Rivers et al. (2001); Kumar et al. (2006). In principle, this task is a time series classification problem, where, typically, some features are extracted from the time series data and a classifier is learned to predict occurrence of the event given the features Marlin (2008); Wu et al. (2016). However, here, irregular measurement patterns pose significant challenges: In healthcare, for instance, many vital signals and laboratory measurements that are useful for predicting adverse events are not recorded at regular intervals, leading to missing data. One standard approach to address the missingness is to use binning or other imputation techniques to discretize the data, but in many applications especially in healthcare, signals evolve at different rates and there is no natural discretization time step. For instance, some lab signals are measured once a day whereas vital signals such as heart rate and blood pressure could be recorded multiple times per hour. Alternatively, more sophisticated imputation techniques such as Gaussian processes (GPs) (that do not require discretization) can be used to estimate the missing features Ghassemi et al. (2015); Liu and Hauskrecht (2016); Li and Marlin (2016). However, these estimates can be very unreliable in regions with very few measurements and may lead to high false detection rate. To address these challenges, in this paper, we propose a Bayesian framework that 1) jointly models the time series and event data and 2) uses the uncertainty due to missing longitudinal data to improve reliability of the event predictions. Figure 1 illustrates our proposed approach. At any given time t, we use the observed samples (y0:t , shaded red region) to model the multivariate time series data and compute the probability of occurrence of the event, H(∆|y0:t , t), within any given horizon (t, t + ∆]. Here, rather than computing a point-estimate of the event probability, we provide a distribution on this probability (top right panel), and use this distribution to quantify the uncertainty of the predictions and abstain from making a decision when the estimate does not satisfy a desired confidence level. Specifically, we take a decision-theoretic approach and derive an optimal alerting policy which takes 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

into account the uncertainty associated with H and trades-off the cost of a delayed detection versus the cost of making a false alert. Using this policy, the model may choose to wait and gather more observations when the estimated event probability is unreliable. Event Probability

8

∆=1 In addition to providing a principled estimate of 0.8 6 ∆=3 0.6 the uncertainty for the event probabilities, using 4 0.4 (a) Gaussian processes also allows us to estimate non2 0.2 trivial correlations withing and across signals and 0 0.0 0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 5 4 Prediction Horizon (∆) learn complicated patterns exhibited in many time Signal 1 Signal 2 Signal 3 series data such as clinical signals. One challenge in using GPs for high-dimensional and large data is expensive inference due to matrix inversion. To address this challenge, we develop a stochastic variational inference algorithm based on sparse0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 Time Time Time GP techniques, which reduces computational comLatent 1: Illustration of the proposed approach (b) plexity of our approach to linear in number of Figure Deterioration State occurrence of an event using signals, D, and number of observations per signal, for predicting Desired Output sparse and irregularly sampled longitudinal data. N , and quadratic in number of inducing points, Detector Output M ( N ), which are pseudo-input treated as additional variational parameters. 0 10

{

Longitudinal Signals

{

Our proposed approaches significantly improves upon joint longitudinal and time-to-event models in statistics Rizopoulos (2011, 2010); Proust-Lima et al. (2014) in scalability, flexibility of the time series model, and reliability of predictions.

2

Proposed Approach

Model Our proposed model consists of two sub-components: a multi-output GP to model the time series data and an event model. Our time series model is developed based on the framework of linear models of coregionalization (LMC) Journel and Huijbregts (1978); Seeger et al. (2005); Álvarez and Lawrence (2009). We use LMCs since they can naturally handle multivariate irregularly sampled signals and can learn correlations within and across multiple signals. We also construct our time series model to learn any possible shared structure between signals. Specifically, we express each signal d for every individual as a sum of two components, shared and signal-specific components; PR i.e., we write yd (t) = fd (t) + d (t), where d is a noise term and fd (t) = r=1 wdr gr (t) + κd vd (t) where gr ∼ GP, ∀r and vd ∼ GP are, respectively, shared and signal-specific latent functions. Also, w, κ (in addition to kernel hyper-parameters of the GPs) are model parameters. 1 We construct the event sub-component of our model using survival analysis Kalbfleisch and Prentice (2011) techniques. Compared to standard classification likelihoods used in machine learning, a time-to-event model allows us to estimate the probability of event as a function of prediction horizon (∆) without requiring a pre-specified ∆ for training and testing. A time-to-event distribution is fully characterized using a hazard function (λ(s; t), ∀s > t), which is the instantaneous probability that the event happens given that the individual has survived up to time t. Here, we let λ(s; t) = exp(b + a(s − t) + γ T x(t) + f¯(t)), where x are observed covariates and f¯(t) are the features Rt 0 )) estimated using the time series model: f¯(t) = αT 0 ρc (t0 ; t)f (t0 ) dt0 with ρc (t0 ; t) = c exp(−c(t−t 1−exp(−ct) . Here, a, b, c, α, γ are model parameters. Given R t this hazard function, we write the probability density function of the event as p(t) = λ(t) exp(− 0 λ(s) ds) and compute the probability of occurrence of R t+∆ the event within any horizon ∆ as H(∆|f 0:t , t) = 1 − exp(− t λ(s; t) ds). Learning and Inference To improve scalability of our approach, we develop our learning algorithm based on the sparse variational approach Titsias (2009); Hensman et al. (2013). Specifically, we integrate out each latent GP function and posit a variational distribution to estimate its posterior. Each variational distribution is parametrized using M inducing input-output pairs, which we optimize to obtain a lower bound for the joint log-likelihood. We also maximize the variational lower bound to estimate the model parameters. This technique also yields a variational lower bound that is additive over all individuals, allowing us to use stochastic gradient descent and further scale up training. Detailed description of our learning and inference algorithm is available in Soleimani et al. (2017a). Optimal Uncertainty-Aware Alerting Policy Using the estimated probability H(∆|f 0:t , t), at any given time, we can take one of the three possible actions (ψˆ ∈ {0, 1, a}): make a positive prediction (ψˆ = 1, i.e., to predict the event will happen within the next ∆ hours), negative prediction (ψˆ = 0, 2

Figure 2 Optimal Alerting Policy 8

6

(c) (a) 1: Input: 1 − 2q confidence interval ([h(1−q) , h(q) ]) of 6 (b) 4 the event probability H. Let cq = h(q) − h(1−q) , 01 a q > 0.5. Also, L1 , L and L2 , LL10 , where 4 L10 2 L01 , L10 , and La are the cost of false positive, false 2 negative, and abstention, respectively. 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 2: Output: ψˆ ∈ {0, 1, a}. 1+L1 3: If cq ≥ L2 L1 −1: (large confidence interval - high Figure 3: Three example decisions made using uncertainty) the policy described in Fig. 2 with L1 = 1 and Set ψˆ = 0 if h(q) ≤ L2 . L2 = 0.4. The shaded area is the confidence interval L (1−q) Set ψˆ = 1 if h ≥ 1 − L21 . [h(1−q) , h(q) ] for some choice of q for the three distributions, (a), (b), and (c). The arrows at 0.4 and 0.6 ˆ Set ψ = a otherwise. 2 are L2 and 1 − L , respectively. All cases satisfy 1 L1 4: If cq < L2 1+L (small confidence interval low − 1: L1 1+L1 cq ≥ L2 L1 − 1. The optimal decisions are ψˆ = 1 uncertainty) (q) (1−q) ˆ Set ψ = 0 if h + L1 h < L1 . for (a), ψˆ = 0 for (b), and ψˆ = a for (c). Set ψˆ = 1 if h(q) + L1 h(1−q) ≥ L1 .

i.e., to predict the event will not happen during the next ∆ hours), or abstain (ψˆ = a, i.e., to not make any prediction). To determine the optimal choice among these possible actions, we minimize a cost (risk) function which we define by specifying the relative cost term associated with each type of possible prediction error or abstention (waiting in case of high uncertainty). Specifically, we define L01 and L10 , respectively, as the cost of false positive (if the event does not happen (ψ = 0) but ψˆ = 1) and false negative (if the event does happen (ψ = 1) but ψˆ = 0) errors and let La be the cost of abstention (if ψˆ = a). We also let ψ be an unobserved Bernoulli random variable with probability ˆ ψ) = 1(ψˆ = 0)ψL10 + 1(ψˆ = (p(ψ = 1) = H(∆|f 0:t , t)). The overall risk function is R(ψ; ˆ 1)(1 − ψ)L01 + 1(ψ = a)La . Typical decision-theoretic approaches minimize the expectation of the risk function with respect to ψ to obtain the alert policy. Note that our joint model provides a distribution on H, since H is a function of the latent functions f which are generated from GPs; i.e., we have a distribution on the distribution of the event, pH (h), and hence a distribution for the cost function. This distribution provides valuable information about the uncertainty associated with H which we use to improve reliability of our predictions. Here, instead of minimizing the expected risk function, we obtain the alerting policy by minimizing the quantiles of the risk distribution. Intuitively, this means that we minimize the maximum cost that ˆ = could occur with a certain probability. We obtain the q-quantile of the risk function: R(q) (ψ) (q) (1−q) (q) 1(ψˆ = 0)h L10 + 1(ψˆ = 1)(1 − h )L01 + 1(ψˆ = a)La , where h is the q-quantile of pH (h). Minimizing R(q) , we obtain the optimal policy summarized in Fig. 2. See, Soleimani et al. (2017a), for detailed derivations. We also provide three example decisions using this policy in Fig. 3. If instead of the full distribution we take a point-estimate (h0 ) of the event probability (i.e., pH (h) = 1(h − h0 )), the alert policy reduces to a policy similar to the policy for classification with abstention L1 proposed by Chow (1957): i.e., set ψˆ = 0 if h0 ≤ min{L2 , 1+L }, ψˆ = 1 if h0 ≥ max{1 − 1 L2 L1 L1 , 1+L1 }, and abstain otherwise (see Fig. 2 for definition of L1 , L2 ). As an example, when L1 = 1 and L2 < 0.5, the abstention interval is [L2 , 1 − L2 ] and the classifier abstains when L2 < h0 < 1 − L2 (i.e., when h0 is close to the boundary). We refer to this policy as the “vanilla" policy and the one described in Fig. 2 as the “robust policy”. Under the vanilla policy, the abstention region only depends on L1 and L2 which are the same for all individuals, but under the robust policy, this interval changes depending on the confidence level of individual estimated probabilities; the effective abstention region is larger in cases where the classifier is uncertain about the estimate of H. To see how this helps to prevent incorrect predictions, consider 2 example (c) in Fig. 3. The vanilla policy predicts this example as positive (since h0 > 1 − L L1 ) but the confidence interval (shaded box) is relatively large. The vanilla policy incurs a false positive error cost if this is a negative sample. In order to abstain on this individual (and avoid a false positive cost), under the vanilla policy, the abstention interval should be very large, but this may lead to abstaining on many other individuals on whom the classifier may be correct. The effective abstention region under the robust policy, however, is adjusted to individual event probabilities based on their confidence intervals. In this case the optimal robust decision is abstention. 3

3

Experimental Results

Best TPR @ PPV > 0.50

TPR

TPR

We evaluate our proposed approach (hereon, called J-LTM) on the task of predicting septic shock, a life-threatening adverse event. We use MIMIC-II Clinical Database Goldberger et al. (2000) and use the definitions described by Henry et al. (2015) for septic shock to annotate the data. We model 10 time series signals such as heart rate, systolic blood pressure, creatinine, and white blood cell count. See Soleimani et al. (2017a) for the full list of features. 1 Our dataset includes 3151 patients, randomly divided into train (75%) (a) and test (25%) sets. We train on 2363 patients (287 patients with 0.8 observed septic shock) and evaluate on 788 patients (with 101 observed 0.6 shock). To also mimic the practical settings where monitoring tools are 0.4 used, for each test patient, we make predictions at 5 evaluation points, 0.2 equally spaced over the two-day interval leading to septic shock or discharge. We also do bootstrap on the test set with sample size 10 to 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 test significance of the results. PPV 0.7 Baselines: We compare our proposed model against 5 strong base(b) 0.6 lines: MoGP, a variant of our approach where the time series and time-to-event models are trained separately and uses the vanilla pre- 1 0.5 0.4 (a) diction policy; JM, uses a B-spline regression model with 20 knots 0.3 independently to impute each signal and estimate features for the time-0.8 0.2 to-event model; and three classification methods which we 1 trained 0.1 (a) J-LTM after binning the data into 4-hour windows: Logistic Regression (LR),0.6 0 M0.8 OGP1 0 0.2 0.4 0.6 SVM, and RNN. All baselines provide point-estimates of0.8 the event Decision Rate JM 0.4 probability and thus use the vanilla alerting policy. LR J-LTM TPR

0.6

SVM M OGP Evaluation: We performed grid search on the relative cost terms L10.2 RNN JM 0.4 and L2 and q (for the robust policy), and computed ROC-AUC on LR 0 0.2 4:0.4 (a) 0.6Maximum 0.8 1 the the obtained FPR and TPR pairs. Our approach (J-LTM) achieves 0Figure SVMFPR 0.2 an AUC (std. error) of 0.84 (0.005) and outperforms MoGP, JM, LR, TPR obtained at each PPV RNN SVM, and RNN with AUCs 0.79 (0.006), 0.78 (0.008), 0.800 (0.005), level. (b) best TPR achieved 0.2 0.4 0.6 0.8 0.79 (0.007), and 0.80 (0.006), respectively. We also report 0positive atFPR any decision1 rate fixing predictive probability (PPV), the ratio of true positives to the total PPV> 0.5. number of alarms. In Fig. 4a, we plot the maximum TPR achieved at each PPV level for J-LTM and the baselines, where at any TPR, the PPV for J-LTM is greater than that of all baselines. In particular, in the range of TPR from 0.4-0.6, J-LTM shows 13%-23% improvement in PPV over MoGP, the next best baseline and 31%-36% over LR. From a practical standpoint, each evaluation leads to a context switch and can cost the caregiver 30-40 minutes; a 13%-36% improvement in the PPV can amount to many hours saved daily.

Another important metric is the number of times where the alerting policy chooses to abstain. Event predictors should ideally make more accurate predictions with fewer abstentions (or higher decision rate). In Fig. 4b, we report the best TPR that each model could achieve as a function of decision rate with PPV greater than 0.5 (at 1, all models choose to make a decision for every instance). J-LTM achieves significantly higher TPR than the baselines at all decision rates. In other words, at any given decision rate, J-LTM identifies more accurately the subset of instances on whom it can make predictions. The maximum TPR with PPV>0.5 for J-LTM over all decision rates is 0.68 (std. error 0.01), a significant improvement over the best TPR at the same PPV level for MoGP, 0.51 (0.008), JM, 0.40 (0.02), LR, 0.18 (0.04), SVM, 0.21 (0.01), and RNN, 0.12 (0.038).

4

Conclusion

We proposed a probabilistic approach for reliable event prediction from irregularly sampled multivariate time series data, and showed significant improvement over state-of-the-art baselines on a challenging clinical task. The proposed approach opens up new avenues for future investigation. To train our model, we used observational data from patients that were already receiving treatment in the hospital. Several studies have shown that the estimated risk scores from observational data can be sensitive to provider practice pattern Dyagilev and Saria (2016) or may yield counterintuitive predictions Caruana et al. (2015). To account for treatment in the current study, we used censoring Kalbfleisch and Prentice (2011) and treat patients who experienced septic shock despite receiving treatment as interval-censored and those who did not develop septic shock as right-censored. A more principled approach is possible by using causal inference and counterfactual reasoning to learn the effect of treatments Soleimani et al. (2017b); Schulam and Saria (2017). 4

References E. Rivers, B. Nguyen, S. Havstad, J. Ressler, A. Muzzin, B. Knoblich, E. Peterson, and M. Tomlanovich, “Early goal-directed therapy in the treatment of severe sepsis and septic shock,” New England Journal of Medicine, vol. 345, no. 19, pp. 1368–1377, 2001, pMID: 11794169. A. Kumar, D. Roberts, K. E. Wood, B. Light et al., “Duration of hypotension before initiation of effective antimicrobial therapy is the critical determinant of survival in human septic shock,” Critical care medicine, vol. 34, no. 6, pp. 1589–1596, 2006. B. Marlin, “Missing data problems in machine learning,” Ph.D. dissertation, 2008. M. Wu, M. Ghassemi, M. Feng, L. A. Celi, P. Szolovits, and F. Doshi-Velez, “Understanding vasopressor intervention and weaning: Risk prediction in a public heterogeneous clinical time series database,” Journal of the American Medical Informatics Association, 2016. M. Ghassemi, T. Naumann, T. Brennan, D. a. Clifton, and P. Szolovits, “A Multivariate Timeseries Modeling Approach to Severity of Illness Assessment and Forecasting in ICU with Sparse , Heterogeneous Clinical Data,” in AAAI, 2015, pp. 446–453. Z. Liu and M. Hauskrecht, “Learning Adaptive Forecasting Models from Irregularly Sampled Multivariate Clinical Data,” in AAAI, 2016. S. C.-X. Li and B. M. Marlin, “A scalable end-to-end gaussian process adapter for irregularly sampled time series classification,” in NIPS, 2016, pp. 1804–1812. D. Rizopoulos, “Dynamic Predictions and Prospective Accuracy in Joint Models for Longitudinal and Time-to-Event Data,” Biometrics, vol. 67, no. 3, pp. 819–829, sep 2011. ——, “JM : An R Package for the Joint Modelling of Longitudinal and Time-to-Event Data,” Journal of Statistical Software, vol. 35, no. 9, pp. 1–33, 2010. C. Proust-Lima, M. Séne, J. M. G. Taylor, and H. Jacqmin-Gadda, “Joint latent class models for longitudinal and time-to-event data: a review.” Statistical methods in medical research, vol. 23, no. 1, pp. 74–90, 2014. A. G. Journel and C. J. Huijbregts, Mining geostatistics.

Academic Press, 1978.

M. Seeger, Y.-W. Teh, and M. Jordan, “Semiparametric latent factor models,” Tech. Rep., 2005. M. A. Álvarez and N. D. Lawrence, “Computationally efficient convolved multiple output gaussian processes,” JMLR, vol. 12, pp. 1459–1500, 2009. J. Kalbfleisch and R. Prentice, The statistical analysis of failure time data. 2011.

John Wiley & Sons,

M. K. Titsias, “Variational Model Selection for Sparse Gaussian Process Regression,” in AISTATS, 2009, pp. 1–20. J. Hensman, N. Fusi, and N. D. Lawrence, “Gaussian Processes for Big Data,” in UAI, 2013, pp. 282–290. H. Soleimani, J. Hensman, and S. Saria, “Scalable joint models for reliable uncertainty-aware event prediction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. C. K. Chow, “An optimum character recognition system using decision functions,” IRE Transactions on Electronic Computers, no. 4, pp. 247–254, 1957. A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. S. Eugene, “PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals,” Circulation, vol. 101, no. 23, pp. 215–220, 2000. K. E. Henry, D. N. Hager, P. J. Pronovost, and S. Saria, “A targeted real-time early warning score (TREWScore) for septic shock.” Science translational medicine, vol. 7, no. 299, p. 299ra122, 2015. 5

K. Dyagilev and S. Saria, “Learning (predictive) risk scores in the presence of censoring due to interventions,” Machine Learning, vol. 102, no. 3, pp. 323–348, 2016. R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad, “Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015, pp. 1721–1730. H. Soleimani, A. Subbaswamy, and S. Saria, “Treatment-response models for counterfactual reasoning with continuous-time, continuous-valued interventions,” in UAI, 2017. P. Schulam and S. Saria, “What-if reasoning with counterfactual Gaussian processes,” arXiv preprint arXiv:1703.10651, 2017.

6

Scalable I/O Event Handling for GHC

Scalable Efficient Composite Event Detection

Joint Topic Modeling for Event Summarization across ...

Ensemble of Cubist models for soy yield prediction ...

ON ELO BASED PREDICTION MODELS FOR THE ...

Probabilistic Models for Melodic Prediction - Research at Google

Exploiting Prediction to Enable Secure and Reliable ...

Joint Latent Topic Models for Text and Citations

6A5 Prediction Capabilities of Vulnerability Discovery Models

Spectral Numerical Weather Prediction Models

Prediction of the droplet size and velocity joint ... - Semantic Scholar

A Scalable Tree-based Approach for Joint Object and ...

Scalable Dynamic Nonparametric Bayesian Models of Content and ...

Training Structured Prediction Models with Extrinsic Loss ... - Slav Petrov