SDM-DMMH 2015

4 th Workshop on Data Mining for Medicine and Healthcare May 2, 2015, Vancouver, BC, Canada

15th SIAM International Conference on Data Mining (SDM 2015)

Honorary Chair Zoran Obradovic, Temple University

Workshop Chairs Nitesh Chawla, University of Notre Dame Gregor Stiglic, University of Maribor Fei Wang, University of Connecticut

Overview In virtually every country, the cost of healthcare is increasing more rapidly than the willingness and the ability to pay for it. At the same time, more and more data is being captured around healthcare processes in the form of Electronic Health Records (EHR), health insurance claims, medical imaging databases, disease registries, spontaneous reporting sites, and clinical trials. As a result, data mining has become critical to the healthcare world. On the one hand, EHR offers the data that gets data miners excited, however on the other hand, is accompanied with challenges such as 1) the unavailability of large sources of data to academic researchers, and 2) limited access to data-mining experts. Healthcare entities are reluctant to release their internal data to academic researchers and in most cases there is limited interaction between industry practitioners and academic researchers working on related problems. The objectives of this workshop are: 1. Bring together researchers (from both academia and industry) as well as practitioners to present their latest problems and ideas. 2. Attract healthcare providers who have access to interesting sources of data and problems but lack the expertise in data mining to use the data effectively. 3. Enhance interactions between data mining, text mining and visual analytics communities working on problems from medicine and healthcare.

Program Committee Riccardo Bellazzi, University of Pavia Andreas Holzinger, Medical University Graz Zoran Obradovic, Temple University Mykola Pechenizky, Technical University Eindhoven Chandan Reddy, Wayne State University Niels Peek, University of Manchester Igor Pernek, Research Studios Austria Ping Zhang, IBM T.J. Watson Research Center Jiayu Zhou, Samsung Research America

Workshop Schedule May 2, Saturday – Afternoon 13:30 – 14:00 Workshop Opening Invited talk I Big Data for Personalized Medicine: a case study of Biomarker Discovery Speaker: Raymond Ng 14:00 – 14:50 Improving confidence while predicting trends in temporal disease networks Djordje Gligorijevic, Jelena Stojanovic and Zoran Obradovic

A simple and effective method for anomaly detection in healthcare Luiz Fernando Carvalho, Carlos Teixeira, Ester Dias, Wagner Meira Jr. and Osvaldo Carvalho

*Data Analytics Framework for A Game-based Rehabilitation System Jiongqian Liang, David Fuhry, David Maung, Roger Crawfis, Lynne Gauthier, Arnab Nandi and Srinivasan Parthasarathy

*Modelling Trajectories for Diabetes Complications Pranjul Yadav, Lisiane Pruinelli, Andrew Hangsleben, Sanjoy Dey, Katherine Hauwiller, Bonnie Westra, Connie Delaney, Vipin Kumar, Michael Steinbach and Gyorgy Simon

14:50 – 15:00 Coffee Break 15:00 – 15:30 Invited talk II Semantic Data Mining and Deep Learning in Health Datasets Speaker: Dejing Dou 15:30 – 16:00 Invited talk III Efficient and Effective Algorithms for Analyzing Large Scale Genetic Interactions Speaker: Xiang Zhang 16:00 – 16:25 Mining Strongly Correlated Intervals with Hypergraphs on Health Social Network Hao Wang, Dejing Dou, Yue Fang and Yongli Zhang

*Hospital Corners and Wrapping Patients in Markov Blankets Dusan Ramljak, Adam Davey, Alexey Uversky, Shoumik Roychoudhury and Zoran Obradovic

16:25 – 16:55 Invited talk IV TBA Speaker: Zenglin Xu 16:55 – 17:00 Closing Remarks *Short papers will have 10 minutes for presentation and questions, long papers 15 minutes.

Table of Contents Improving confidence while predicting trends in temporal disease networks

1 - 10

Djordje Gligorijevic, Jelena Stojanovic and Zoran Obradovic

Data Analytics Framework for A Game-based Rehabilitation System

10 - 15

Jiongqian Liang, David Fuhry, David Maung, Roger Crawfis, Lynne Gauthier, Arnab Nandi and Srinivasan Parthasarathy

A simple and effective method for anomaly detection in healthcare

16 - 24

Luiz Fernando Carvalho, Carlos Teixeira, Ester Dias, Wagner Meira Jr. and Osvaldo Carvalho

Mining Strongly Correlated Intervals with Hypergraphs on Health Social Network

25 - 33

Hao Wang, Dejing Dou, Yue Fang and Yongli Zhang

Modelling Trajectories for Diabetes Complications

34 - 39

Pranjul Yadav, Lisiane Pruinelli, Andrew Hangsleben, Sanjoy Dey, Katherine Hauwiller, Bonnie Westra, Connie Delaney, Vipin Kumar, Michael Steinbach and Gyorgy Simon

Hospital Corners and Wrapping Patients in Markov Blankets Dusan Ramljak, Adam Davey, Alexey Uversky, Shoumik Roychoudhury and Zoran Obradovic

40 - 45

Improving confidence while predicting trends in temporal disease networks Djordje Gligorijevic∗

Jelena Stojanovic∗

Zoran Obradovic∗

difficult to create models that provide good generalization performance for chosen applications [22]. An additional problem in the analysis is posed by the heterogeneity of the data due to different data sources, often having different quality. This data heterogeneity can potentially compromise quality of conclusions derived from the given analysis. For these reasons, we focus on analysis of the Healthcare Cost and Utilization Project (HCUP) database [1], particularly the SID database from the hospitals in the state of California. This database contains more than 35 million inpatient discharge records over 9 years and it is not specific to a group of hospitals but contains data for entire state. It can be induced that the data can be observed as non–local for California and any analysis can be generalized for that state. For each patient in the database there is demographic information (like age, birth year, sex, race), diagnosis (primary and up to 25 others), procedures (up to 25), information about hospital stays and other information (like length of stay, total charges, type of payment and payer, discharge month, survival information). Using this data could potentially give us insight in many healthcare problems. Among many difficult tasks, capturing global trends of diseases are the ones that intrigue many healthcare practitioners [9, 11]. Being able to confidently predict future trends of diseases, may lead to better anticipation of healthcare systems and allow better decision making, which should consequently provide higher quality service to those who are in need of it. As many diseases are related, graphical modeling of disease data may be beneficial for predicting disease trends. As such, several types of graphs may be built and several prediction tasks may be imposed on these graphs. In the literature, several networks have been constructed to study the connection between human diseases. The nodes of the networks are disease codes while the links are derived based on common genes [10], shared metabolic pathways [7], connection to common miRNA molecule [13], observed comorbidity [8]. These networks have been used for the study of obesity [2], illness progression [8], association of diseases with the cellular interactions [14], properties of the genetic origin of diseases [10], etc.

Abstract For highly sensitive real-world predictive analytic applications such as healthcare and medicine, having good prediction accuracy alone is often not enough. These kinds of applications require a decision making process which uses uncertainty estimation as input whenever possible. Quality of uncertainty estimation is a subject of over or under confident prediction, which is often not addressed in many models. In this paper we show several extensions to the Gaussian Conditional Random Fields model, which aim to provide higher quality uncertainty estimation. These extensions are applied to the temporal disease graph built from the State Inpatient Database (SID) of California, acquired from the HCUP. Our experiments demonstrate benefits of using graph information in modeling temporal disease properties as well as improvements in uncertainty estimation provided by given extensions of the Gaussian Conditional Random Fields method.

1 Introduction The increased availability of Electronic Health Record (EHR) data has created suitable conditions for various studies that are aimed for improvement of healthcare quality level. Data mining based studies on EHR data have shown the potential for improvement of healthcare systems [4, 26]. These methods aim at building stable frameworks for estimating different states in healthcare systems and providing significant knowledge to healthcare practitioners. Predictive modeling application to health outcomes related to medical data in terms of diseases, procedures, mortality, and other measures may have a huge impact on quality of treatment, improve detection of high risk groups of patients, or detect important effects not taken into consideration in prior medical treatments. These are some of the problems that can be addressed by inspecting and analyzing medical data. However, some of the problems these studies encounter are data locality. For example such cases are applications where data records are specific for particular hospitals (e.g. only general hospitals) or group of patients (e.g. Medicare patients). Therefore, it is very ∗ Center

for Data Anaytics and Biomedical Informatics, Temple University {gligorijevic, jelena.stojanovic, zoran.obradovic}@temple.edu

1

Since the HCUP are EHR data, we are more interested in modeling of phenotypic networks. These kind of networks are introduced in [3, 8]. In [3] a novel multi-relational link prediction method is proposed and it is shown that disease co-morbidity can enhance the current knowledge of genetic association. In our study, we are primarily concentrated on disease-based graphs (we have opted modeling 253 diseases coded with CCS schema rather than modeling diseases with ICD-9 codes, because the CCS is the empirically built schema interpretable by wider audience rather than medical experts). Specifically, disease-based graphs can be built as comorbidity graphs based on phenotypic patient data (as described in Section 3.1), but also other disease links based on common genes [3], ontologies, common patient profile, or common history. As mentioned before, in previous work on diseases data, many graphs were proposed, and we now aim to utilize such constructed graphs to improve predictive power of unstructured models, which to our knowledge was not done before. For this task we are using a Gaussian Conditional Random Fields (GCRF) model. The GCRF model [18] has been successfully applied in many real world problems, providing improvements over given state-of-the-art models while maintaining reasonable scalability. Some of the applications are in the climatology domain [6, 18, 23], modeling patient’s response in acute inflammation treatment [17], traffic estimation [5], image de–noising problems [21], etc. When models provide prediction for real-world problems, it can be very important to report uncertainty estimation of the prediction. This is especially true for domains where predictions are used for important decision-making, such as health. The objective of this paper is to improve the estimation quality of prediction uncertainty in the GCRF model for evolving graphs in order to more accurately represent the confidence of the prediction for healthcare applications. For example, instead of predicting that the number of admissions for a disease is going to be 15.026±10.000, making a different estimation of 15.000±150, can be more useful for decision making process. Therefore, we aim to address this important topic in this paper. We experimentally demonstrate improvement of predictive uncertainty by removing bias of the GCRF framework via representing its parameters as functions. We use two approaches, one that models parameters as a function of uncertainty estimation of unstructured predictors [17], (where we extend an initial approach as described in Section 2.3.1) and another one, that models parameters as neural networks [19] (described in Section 2.3.2). In the following section, we are first going to describe how several unstructured models (Section 2.1)

and the GCRF method (Section 2.2.2) allow modeling uncertainty, as well as how the uncertainty estimation can be improved with modeling parameters of the GCRF model as functions (Sections 2.3.1 and 2.3.2). Further, we are going to describe in more details the data we used in our experiments and the way of constructing the graphs from this data (Section 3). Experiments are described in Section 4 and results in terms of accuracy and uncertainty estimation quality are given in this same section. Finally, we are going to conclude this study and give some further goals in Section 5. 2 Models We aim to model complex trends of diseases in an autoregressive manner utilizing linear and non-linear models. These models will be called unstructured predictors as they have no information on graph structure, that might prove beneficial for predictive performance. Therefore, the structured GCRF framework will be used for introducing graph structure to these models. 2.1 Unstructured Models For modeling disease trends we will be using several unstructured predictors, Linear regression models and Gaussian Processes regression models with several lags for learning each. These methods are evaluated and the ones with best performance are chosen as inputs for the GCRF framework, which will then introduce graph structure information to the prediction, providing a certain level of correction. As uncertainty of the unstructured predictors is the additional information we are using in some of our experiments, we will provide formulas for retrieving uncertainty from unstructured predictors. 2.1.1 Linear AR Model Linear regression form of auto–regressive (AR) representation is: (2.1)

yi = wT xi + ε, ε ∼ N (0, σy2 )

where w is an unknown set of weights. The weight and noise variance are estimated by (2.2)

w ˆ = X T (X T X)−1 Xy.

The variance of the Linear predictor is given by (y − w ˆ T X)T (y − w ˆ T X) , N −k−1 where X is matrix representation of all data available for training, N is the number of training examples and k is the number of attributes. Prediction y∗ and uncertainty estimation σ∗2 of the test data x∗ are found using (2.3)

2

σy2 =

(2.4)

y∗ = w ˆ T x∗ ,

(2.5)

σ∗2 = σy2 (1 + x∗ (X T X)−1 xT∗ ).

Z 2.1.2 Gaussian Processes Regression Model A (2.14) Z(X, α, β) = exp(φ(y, X, α, β))dy . Gaussian process (GP) is generalization of a multivariy ate Gaussian distribution over finite vector space to a function space of infinite dimension [20]. Assumption of Function A is called the association function, and it a Gaussian prior is present over functions that map x represents any function that handles mapping from to y: X → y with respect to input variables X. Function (2.6) yi = f (xi ) + ε, ε ∼ N (0, σy2 ). I is called interaction potential and it handles any relation the two data instances yi and yj have. In Gaussian processes are defined with order to efficiently define the CRF form, association and (2.7) f (x) ∼ GP (m(x), k(x, x0 )), interaction potentials are defined as linear combinations of feature functions f and g [12], 0 where m(x) is the mean function and k(x, x ) is the covariance function in the form of a kernel that is K X required to be positive definite. In our implementation (2.15) A(α, yi , X) = αk fk (yi , X), we are using a Gaussian kernel: k=1

(2.8)

! D j 1 X (xid − xd )2 2 k(xi , xj ) = σy exp − . 2 wd2

(2.16)

I(β, yi , yj , x) =

L X

d=1

Here, σy2 is the white noise variance parameter of the Gaussian prior. If we denote covariance of training part as C = K + σy2 IN , Kij = k(xi , xj ), the joint density of the observed outputs y and test output y∗ is presented as (2.9)

y y∗

! =N

  C 0, T k∗

k∗ c∗

βl gl (yi , yj , X).

l=1

2.2.2 Gaussian Conditional Random Fields Feature function fk can be represented as the residual error of any unstructured predictor Rk , fk (yi , X) = −(yi − Rk (X))2 , k = 1, ...K.

(2.17)

And feature function gl as the residual error of given (l) network similarity Sij

 ,

(l) where k∗ is the covariance vector for the new test point (2.18) gl (yi , yj , x) = −Sij (yi − yj )2 , l = 1, ...L. x∗ . The posterior predictive density is given by Then, the final GCRF takes following log-linear form:

(2.10) (2.11)

p(y∗ |x∗ , X, y) = N (y|0, C), −1 µ∗ = k T y, ∗ C

P (y|X) =

−1 σ∗2 = c∗ − kT k∗ . ∗ C

K X K X 1 αk (yi − Rk (X))2 exp(− Z i=1 k=1

The σ∗2 is the predictive variance or uncertainty at test point x∗ [20].



L XX

βl Sij (l) (yi − yj )2 )

i∼j l=1

2.2 Structured Models Benefits of using structured methods have emerged in the previous decade, and there are many evidences of benefits of using structured methods over unstructured [18]. As many datasets can be presented as graphs these methods have gained popularity and are considered state-of-the art methods in many domains.

where α and β are parameters of the feature functions, which model the association of each yi and X, and the interaction between different yi and yj in the graph, respectively. Here Rk (x) functions are any unstructured predictors that map X → yi independently, and might be used to also incorporate domain specific models. Similarity matrix S is used to define the weighted undirected graph structure between labels. 2.2.1 Continuous Conditional Random Fields This choice of feature functions enables us to repContinuous Conditional Random Fields were proposed resent this distribution as a multivariate Gaussian [18] by [15], and model conditional distribution: to ensure efficient and convex optimization: (2.12)

P (y|X) =

(2.19)

1 exp(φ(y, X, α, β)). Z(X, α, β)

(2π)

The term in the exponent φ(y, X, α, β) and the normalization constant Z(X, α, β) are defined as follows (2.13) φ(y, X, α, β) =

N X i=1

A(α, yi , X) +

X

1

P (y|X) =

N 2

  1 exp − (y − µ)T Q(y − µ) 2 |Σ| 1 2

where Q represents the inverse covariance matrix:

I(β, yi , yj , x),

(2.20)

i∼j

3

( P P PL (l) (l) 2 K k=1 αk + 2 l=1 βl eij Sij (x), i = j k Q= PL (l) (l) 2 l=1 βl eij Sij (x), i 6= j

The posterior mean is given by

as functions in terms of uncertainty estimation and prediction accuracy (Section 4). In this chapter we summarize potential ways of handling the bias in the GCRF models caused from parameters αk .

µ = Q−1 b,

(2.21)

where b is defined as (2.22)

bi = 2

K X

2.3.1 Parameters αk as functions of unstructured predictor’s uncertainty First, we address the problem of model bias by modeling the overall model uncertainty as a function of the uncertainty of each unstructured prediction. The principal assumption is that the chosen unstructured predictors can output uncertainty for their predictions. Initial GCRF uncertainty estimation improvements were done on modeling time series of patients’ response to acute inflammation treatment [17] showing that the uncertainty estimation of GCRF modeled in this way provides a higher quality of uncertainty than the quality of this measure for utilized unstructured predictors. The new parameters of the GCRF model are modeled such that:

! αk Rk (x) .

k=1

This specific way of modeling will allow efficient inference and learning. It should be noted that the GCRF model does not depend on input variables X directly, (l) but via its inputs Rk (X) and Sij which are learned prior to GCRF learning. The parameters α and β serve as confidence indicators, however are not sensitive to any change of distribution of input variables, which is common in real life datasets, thus making them biased towards the unstructured predictors whose performances may vary on the dataset. In this paper we aim to address this bias problem of the GCRF model, the euk,p (2.24) αk,p = 2 , β = ev extensions are going to be described in Sections 2.3.1 σk,1 and 2.3.2. 2 Learning and inference The learning task is where σk,1 represents the uncertainty estimation of to optimize parameters α and β by maximizing the unstructured predictor k for the first time step (p = 1), conditional log-likelihood, while α and β are coefficients for the feature functions. We have extended this work by relaxing a few asˆ = argmax logP (y|X; α, β) (2.23) (α, ˆ β) | {z } sumptions made in [17], as well as by applying uncerα,β tainty estimation on evolving graphs data (rather than which is a convex objective, and can be optimized using applying it on linear–chain data). The first assumpstandard quasi-newton optimization techniques. Note tion of the previous work was that the log likelihood of that there is one constraint that is needed to assure the GCRF would be optimized with respect to the uncerdistribution is Gaussian, which is to make the Q matrix tainty of each unstructured predictor, but only considpositive-semidefinite. To ensure this, we are using the ering the first time step of the respective model. This exponential transformation of parameters αk = euk and follows the homoscedasticity assumption that the variβl = evl , as suggested in [15], to make the optimization ance of the model will not change significantly through time. This strong assumption of homoscedasticity is unconstrained. dropped in our study and the parameters of the GCRF 2.3 Parameters of GCRF as functions The model have been optimized with respect to the uncerGCRF intrinsically possesses uncertainty estimation. tainty estimation of each unstructured predictor at each This uncertainty estimation is highly biased towards time step. This allows the model to optimize different the data the model was trained on as GCRF does not parameters for different prediction horizons. Thus, if depend on input variables directly. Once parameters predictions of the unstructured predictor increase unof the GCRF model are introduced as functions, these certainty further in the future, GCRF will also adjust accordingly. functions P impact P both parameters values and scale (note We have further improved uncertainty estimation that k αk + l βl = I, where I is a unit vector). Paby penalizing the predictors whose uncertainty estimarameters in the GCRF represent degree of belief toward tion was not good enough during validation. This could each unstructured predictor and similarity measure. If be done by any quality measure of uncertainty estimathey are modeled to be dependent on input variables, tion on the training data. We use the percentage of the bias problem will be solved and thus uncertainty nodes that fall into the 95% confidence interval (cik,p ) estimation should be improved. The uncertainty esti- as a quality index to augment our approach: mation improvement is achieved both by better fitting euk,p parameters and by altering the unit scale to better fit (2.25) αk,p = 2 cik,p . σk,p the data as well. We experimentally evaluate our assumptions on modeling parameters of the GCRF model

4

comes to the structure, we explored several cases.

2.3.2 Parameters αk as functions of input variables The previous approach can be further generalized by observing parameters α as functions of input parameters, which was proposed in [19]. However, in [19] there were no experimental results on uncertainty improvement and the paper does not mention this aspect of the extension. We can observe the parameter α as a parameterized function of input variables α(θk , x), where θk are the parameters, and x are input variables of function α. Due to the positive semi–definiteness of the precision matrix constraint, function α(θk , x) becomes

3.1 Comorbidity graph Disease-based graphs can be built as comorbidity graphs based on phenotypic patient data [3]. However, this type of graph did not provide satisfactorily results, since variogram [25] for the selected target attributes was not appropriate (Figure 1). Variograms were inspected for two behaviors: displaying a decreasing trend in variance at higher values of similarity, and falling below the line of the overall variance of the data. By [24] this similarity measure would (2.26) αk = euk (θk ,x) , be characterized as a ”bad” similarity measure. Therewhere uk (θk , x) is in [19] a feed-forward neural network. fore, we researched several other disease links, like those This method is easily optimized using gradient descent based on common patient profile or common history. methods. These functions are titled as uncertainty functions, since they provide significant improvement to the covariance matrix in terms of bias correction. 3 Medical Data The data source we used in this study, as we mentioned previously, is the State Inpatient Databases (SID) which is an archive that stores the universe of inpatient discharge abstracts from data organizations. It is provided by the Agency for Healthcare Research and Quality and is included in the Healthcare Cost and Utilization Project (HCUP) [1]. Particularly, we have access to the SID California database, which contains 35.844.800 inpatient discharge records over 9 years (from January 2003 to December 2011) for 19.319.350 distinct patients in 474 hospitals (436 AHAID identified; about 400 per year). For each patient there are up to 25 diagnosis codes in both CSS and ICD9 coding schemas, together with up to 25 procedures applied during this particular admission of the patient. There are also some demographic information about each patient, like age, birth year, sex, race, etc., as well as information about hospital stays, length of stay, total charges, type of payment, a payer, discharge month, survival information. Availability of this information made possible building of several graphs, and exploring usefulness of attributes and links of the graph for a chosen target attribute. In this study, we focused on disease based graphs and prediction of number of admissions for each disease. Graphs are constructed for these 9 years in monthly resolution. In our experiments, we used 231 out of 259 diseases that we were able to follow throughout all 9 years. For each node, we include temporal information, meaning that there are one, two or three previous values of the target variable (we refer to these attributes as lag1, lag2, and lag3) as attributes of a node in the current time step (more details about utilization of those attributes are given in Section 4). When it

Figure 1: Comorbidity graph variogram. Blue line represents disease comorbidity similarity and red line represents overall variance.

3.2 Jenson-Shannon divergence graph For the first case we measured Janson-Shanon divergence between distribution of two diseases based on several attributes. (3.27)

JSD(P ||Q) =

1 (KLD(P ||M ) + KLD(Q||M )), 2

where KLD is the Kullback-Leibler divergence, P and Q are the distributions of the selected disease attribute for two observed diseases and M = 21 (P + Q). The similarity obtained from this divergence is (3.28)

S(yp , yq ) =

1 . JSD(xp ||xq )

Utilizing the variogram technique showed that using the distribution of white people admitted for each disease showed the best performance among all other attributes of each disease (Figure 2). 3.3 Common history graph The best performance so far was obtained by the structure built on common history links as can be seen in Figure 3, calculated using formula: (3.29)

S(yi , yj ) = exp(−mean(abs(xhi − xhj ))).

Here, xhi represents vector of length h utilized for given attribute of the node x. For example, if we were

5

variables of up to three timesteps in history (lags) as features of the models. Unstructured predictors were optimized with a sliding window of 12 months. Among the unstructured predictors, the best linear and nonlinear predictors were chosen as inputs into the GCRF model. Separate GCRF models were optimized with linear and non-linear predictors so the effect of each on the results could be characterized separately. We are training and comparing three different GCRF models. The first is an original GCRF model (described in Section 2.2.2), with a slight difference that an α parameter was optimized for each node in a graph separately for fair comparison with the other GCRF models. We will denote this model simply as GCRF. The second GCRF model is the one described in Section 2.3.1, which includes the uncertainty estimation of the unstructured predictor to rescale parameter α with uncertainty of the unstructured predictor. In our results this method is called uGCRF. Finally, the third GCRF model is the one described in Section 2.3.2, where parameters α were modeled as feed-forward neural networks. The last GCRF model, with uncertainty functions, will be denoted as ufGCRF in our results. We will characterize all methods for their predictive power using the root mean squared error (RMSE). As we have mentioned before, an important goal of mentioned approaches is the improved uncertainty estimation power, which will be evaluated using negative log predictive density as we describe below. Uncertainty estimation quality measure The quality of uncertainty estimates is evaluated by measuring the average Negative Log Predictive Density (NLPD), a measure that incorporates both the accuracy and uncertainty estimate of the model. NLPD metric penalizes both over and under confident predictions. This measure is also used in data analysis competitions [16] where uncertainty estimation was of major importance. Smaller values of NLPD correspond to better quality of the estimates. For given prediction yi∗ , NLPD 2 reaches a minimum for σi∗ = (yi − yi∗ )2 .

Figure 2: JS divergence graph variogram. Blue line represents disease similarity according to JS divergence and red line represents overall variance. observing previous 2 time steps the similarity would be (3.30)

S(yi , yj ) = exp(−

t1 t2 t2 abs((xt1 i − xj ) + (xi − xj )) ). 2

The variogram was obtained using historical information of admitted whites for each observed disease in previous three timesteps. As this variogram is the best, we have decided to use this measure in our experiments.

Figure 3: Similarity history variogram. Blue line represents disease history similarity and red line is overall variance.

4 Experiments We have characterized mentioned methodology approaches to the HCUP diseases evolving graph. Our goal is to predict the number of people admitted for a disease as primary one in twelve months of the year 2011 in the state of California, which is the last year in our database. We have normalized the admission count to a 0-1 scale, making admission rate as the target variable. Predicted values can easily be converted back to counts of people admitted for that disease. Between nodes in each timestep we have defined the similarity values based on the exploration in Section 3. For the unstructured predictors we have used a linear and non-linear model. Both models were trained in an autoregressive fashion for each disease separately, observing several previous timesteps to infer one timestep ahead. For each of the two models we have used target

N

(4.31)

N LP D =

1 X (yi − yi∗ )2 2 + logσi∗ 2 2 i=1 2σi∗

2 where σi∗ is predicted uncertainty. Experimental results In our experiments, unstructured predictors provide higher predictive accuracy when using only the previous timestep as input (lag 1). As such, we have used the Linear regression and Gaussian Processes regression with lag 1 as unstructured models for the structured GCRF model. Results are shown in Table 1, the best results are bolded, the runner–up results are underlined, and third best results

6

we refer to the estimated confidence interval being too high (large gray area in Figure 4 in the case of the original GCRF). For instance in Figure 4 we observe that GCRF is giving estimate for admission of Hepatitis of about 0.61±0.78, making predictions of the GCRF model less useful. The two extensions narrow down the unnecessary large confidence interval in the original GCRF case, where their predictions are 0.81±0.16 for uGCRF and 0.71±0.25 for ufGCRF, which can be more helpful for decision making process. Since NLPD measure takes into account both accuracy of the model and uncertainty estimation quality, we see that this ability to narrow confidence interval of the two extensions, is also reflected in the results shown in Table 2. The fact that ufGCRF has better predictive power gives it better results in NLPD than uGCRF, which shows good uncertainty estimation, but introduce little bit more error than ufGCRF. Actually, the best performance in terms of NLPD measure gives ufGCRF using LR as unstructured predictor, but right after it goes again ufGCRF with GP, so we can conclude that this GCRF extension gives the best uncertainty estimation in our experiments among all models. The comparison of uncer-

Figure 4: Confidence interval plots for the three GCRF methods. Blue lines represent true values of admission rate for Hepatitis disease (24 months for years 2010 and 2011), while red lines represent prediction of the respective GCRF.

are in italic. From the Table 1 it can be observed that the GCRF model using GP as unstructured predictor gives the most accurate result and that two other GCRF models introduce slightly more error to the original GCRF model. The main objective of uncertainty functions is to scale belief towards unstructured predictors with respect to change of their respective input variables. Rescaling variance values (from which uncertainty is expressed) is another, more robust effect of uncertainty functions that significantly benefits the models uncertainty estimation, as described in Section 2.3. Following the original GCRF method results, the ufGCRF model introduces less error than the uGCRF method deeming it superior among the two extensions in this experiment. It should be noted that all GCRF methods displayed outperform all of the unstructured predictors. This evidently comes from the structure information included in the prediction. As a reminder, we are using common history graph described in Section 3.3. The other aspect of characterizing the methods includes evaluation of uncertainty quality, which is an important concept in this study. Uncertainty estimation is evaluated using the NLPD metric and the results are shown in the Table 2. The original GCRF model provides lowest quality of the uncertainty estimation among all of the other methods. This can be attributed to the underconfident predictions of the GCRF model, which can be observed also in Figure 4. By underconf idence

(a) GCRF

(b) uGCRF

(c) ufGCRF

Figure 5: Min and obtained NLPD (y-axis; smaller is better) for all 231 diseases (x–axis) for three GCRF models with Linear Regression as unstructured predictor. Red bars represent the average minimal NLPD value for given prediction of the node over 12 months and blue bars are the average obtained NLPD value for that node over 12 months.

tainty estimation quality of the three GCRF methods is also given in Figure 5 (this figure shows result using LR as unstructured predictor, however results for us-

7

ing GP are similar and are omitted for lack of space). Each bar represents mean NLPD measure for each disease averaged over the 12–months testing period. Red bars represent minimal NLPD value for given predictions (the best possible uncertainty estimation), while blue bars represent actual uncertainty quality obtained for NLPD metric in experiments. Values are sorted in descending order to show rate of the blue and red surface, where the model that has more dominant blue surface is better. We can see that GCRF fails to reach the minimal NLPD for its predictions, while uGCRF does a much better job. However, the best performance is provided by ufGCRF, where we see that obtained NLPD is very close to the minimal NLPD for given predictions in most cases. Both uGCRF and ufGCRF appear to estimate uncertainty much closer to the true error variance, with ufGCRF having superior performance in both uncertainty estimation and predictive power among two extensions. Also, note that both uGCRF and ufGCRF outperform the GCRF model for each disease in terms of uncertainty estimation.

6 Acknowledgments This research was supported by DARPA Grant FA955012-1-0406 negotiated by AFOSR, National Science Foundation through major research instrumentation grant number CNS-09-58854. Healthcare Cost and Utilization Project (HCUP), Agency for Healthcare Research and Quality, provided data used in this study. References

[1] Agency for Healthcare Research and Quality, HCUP State Inpatient Databases (SID). Healthcare Cost and Utilization Project (HCUP), Rockville, MD, 2005-2009. [2] A.L. Barabasi, N. Gulbahce, and J. Loscalzo, Network medicine: a network-based approach to human disease, Nature Reviews Gen., (2011). [3] D.A. Davis and N. Chawla, Exploring and exploiting disease interactions from multi-relational gene and phenotype networks, PLoS ONE, (2011). [4] S. Dey, J.S. Gyorgy, Westra B.L., M. Steinbach, and V. Kumar, Mining interpretable and predictive diagnosis codes from multi-source electronic health records, in SDM, 2014. [5] N. Djuric, V. Radosavljevic, V. Coric, and S. Vucetic, Travel speed forecasting by means of continuous conditional random fields, Journal of the Transportation Research, (2011). [6] N. Djuric, V. Radosavljevic, Z. Obradovic, and S. Vucetic, Gaussian conditional random fields for aggregation of operational aerosol retrievals, Geosc. and Rem. Sens. Let., (2015). [7] Lee D.S., Park J., Kay K.A., Christakis N.A., Oltvai Z.N., and Barabasi A.L., The implications of human metabolic network topology for disease comorbidity, Proc. of the Nat. Acad. of Sc. of the U.S.A., PNAS, (2008). [8] C.A. Hidalgo, N. Blumm, A.L. Barabasi, and N. Christakis, A dynamic network approach for the study of human phenotypes, PLoS Comput. Biol., (2009). [9] K. Jones, N. Patel, M. Levy, A. Storeygard, D. Balk, J. Gittleman, and P. Daszak, Global trends in emerging infectious diseases, Nature 451, (2008). [10] Goh K.I., Cusick M.E., Valle D., Childs B., Vidal M., and Barabasi A.L., The human disease network, Proc. of the Nat. Acad. of Sc. of the USA, PNAS, (2007). [11] Chun-Chi L., Yu-Ting T., Wenyuan L., Chia-Yu W., Ilya M., Andrey R., Fengzhu S., Michael S.W., Jeremy C., Preet C., Joseph L., Edward C., and Xianghong Z., Diseaseconnect: a comprehensive web server for mechanism-based disease-disease connections, Nucl. Acids Res., (2014).

5 Conclusions and Future Work In this study, the GCRF model is successfully applied to a challenging problem of admission rate prediction, based on a temporal graph built from HCUP (SID) data. For sensitive health care and medical applications, one should consider aspects of uncertainty, and use this information in a decision making process. Thus, it was important to address this aspect of the methods and compare their quality of uncertainty estimation. In the experiments we characterize several unstructured (Linear Regression and Gaussian Processes with lag 1, lag 2 and lag 3) and structured predictors (original GCRF, uGCRF and ufGCRF) for their predictive error and quality of uncertainty estimation. All three structured models outperformed unstructured ones in terms of predictive error, showing that structure brings useful information to this prediction task. However in terms of quality of uncertainty estimation, those unstructured predictors did a better job than original GCRF, but underperformed compared to the two extensions of the GCRF model. Even though the original GCRF model showed the best performance in predictions, it had the lowest quality of uncertainty estimation. Introducing small predictive error, uGCRF and ufGCRF models gained large improvements in uncertainty estimation, especially the ufGCRF model that had the better performance in prediction of these two GCRF model extensions. As the next task in future work, we plan to introduce uncertainty propagation in the GCRF model as the predictive horizon increases.

8

Table 1: RMSE results of admission rate prediction. First column are averaged results over 12 months of prediction period and the following columns are prediction results for each month in the year 2011. The best results are bolded, second best are underlined and third best are in italic. RMSE

Average

January

February

March

April

May

June

July

August

September

October

November

December

LR lag1 LR lag2 LR lag3 GP lag1 GP lag2 GP lag3 GCRF + LR uGCRF + LR ufGCRF + LR GCRF + GP uGCRF + GP ufGCRF + GP

0.3168 0.3296 0.3536 0.3098 0.3129 0.3158 0.2486 0.3072 0.2713 0.2447 0.3013 0.2685

0.3027 0.3236 0.3483 0.3013 0.3073 0.3092 0.2372 0.2938 0.2540 0.2365 0.2923 0.2569

0.3267 0.3262 0.3476 0.3306 0.3281 0.3299 0.2769 0.3193 0.2785 0.2686 0.3211 0.2839

0.3504 0.3644 0.3967 0.3309 0.3338 0.3354 0.2795 0.3395 0.3115 0.2717 0.3229 0.2943

0.3083 0.3191 0.3509 0.2933 0.2940 0.2961 0.2328 0.2956 0.2523 0.2298 0.2840 0.2516

0.3004 0.3051 0.3312 0.2935 0.2952 0.2949 0.2343 0.2900 0.2526 0.2325 0.2850 0.2532

0.3153 0.3228 0.3553 0.3097 0.3132 0.3182 0.2486 0.3073 0.2709 0.2472 0.3027 0.2718

0.3119 0.3230 0.3465 0.3063 0.3127 0.3172 0.2464 0.3025 0.2662 0.2460 0.2983 0.2669

0.3196 0.3331 0.3487 0.3176 0.3221 0.3239 0.2509 0.3106 0.2764 0.2532 0.3100 0.2762

0.3154 0.3341 0.3588 0.3124 0.3141 0.3160 0.2449 0.3061 0.2745 0.2415 0.3038 0.2697

0.3132 0.3320 0.3490 0.3035 0.3071 0.3092 0.2434 0.3041 0.2727 0.2362 0.2953 0.2635

0.3142 0.3326 0.3535 0.2988 0.3029 0.3061 0.2435 0.3049 0.2746 0.2320 0.2903 0.2610

0.3230 0.3393 0.3566 0.3200 0.3236 0.3330 0.2445 0.3123 0.2716 0.2410 0.3102 0.2734

Table 2: NLPD results of admission rate prediction. First column are averaged results over 12 months of prediction period and the following columns are prediction results for each month in the year 2011. The best results are bolded, second best are underlined and third best are in italic. NLPD

Average

January

February

March

April

May

June

July

August

September

October

November

December

LR lag0 LR lag1 LR lag2 GP lag0 GP lag1 GP lag2 GCRF + LR uGCRF + LR ufGCRF + LR GCRF + GP uGCRF + GP ufGCRF + GP

-0.0082 -0.0080 -0.0075 -0.0079 -0.0079 -0.0079 -0.0044 -0.0085 -0.0090 -0.0046 -0.0083 -0.0089

-0.0085 -0.0086 -0.0085 -0.0081 -0.0084 -0.0086 -0.0047 -0.0088 -0.0094 -0.0048 -0.0087 -0.0092

-0.0082 -0.0082 -0.0077 -0.0079 -0.0080 -0.0080 -0.0044 -0.0084 -0.0086 -0.0046 -0.0083 -0.0083

-0.0082 -0.0078 -0.0071 -0.0080 -0.0079 -0.0078 -0.0046 -0.0086 -0.0090 -0.0047 -0.0084 -0.0089

-0.0083 -0.0075 -0.0069 -0.0080 -0.0079 -0.0079 -0.0046 -0.0087 -0.0093 -0.0047 -0.0086 -0.0091

-0.0084 -0.0080 -0.0071 -0.0080 -0.0080 -0.0080 -0.0046 -0.0088 -0.0093 -0.0048 -0.0085 -0.0091

-0.0083 -0.0082 -0.0076 -0.0080 -0.0080 -0.0080 -0.0046 -0.0087 -0.0095 -0.0047 -0.0085 -0.0093

-0.0083 -0.0082 -0.0077 -0.0080 -0.0080 -0.0080 -0.0045 -0.0087 -0.0092 -0.0046 -0.0084 -0.0091

-0.0083 -0.0080 -0.0076 -0.0079 -0.0079 -0.0079 -0.0044 -0.0087 -0.0092 -0.0045 -0.0084 -0.0090

-0.0083 -0.0081 -0.0076 -0.0080 -0.0079 -0.0079 -0.0043 -0.0087 -0.0095 -0.0045 -0.0084 -0.0093

-0.0083 -0.0080 -0.0077 -0.0080 -0.0079 -0.0079 -0.0043 -0.0086 -0.0093 -0.0045 -0.0084 -0.0092

-0.0083 -0.0081 -0.0077 -0.0080 -0.0079 -0.0079 -0.0043 -0.0086 -0.0092 -0.0045 -0.0084 -0.0092

-0.0068 -0.0068 -0.0064 -0.0066 -0.0067 -0.0066 -0.0038 -0.0070 -0.0066 -0.0040 -0.0070 -0.0067

[12] J. Lafferty, A. McCallum, and F. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, in ICML, 2001. [13] M. Lu, Q. Zhang, M. Deng, J. Miao, Y. Guo, W. Gao, and Q. Cui, An analysis of human microrna and disease associations, PLoS ONE, (2008). [14] J. Park, D.S. Lee, N.A. Christakis, and A.L. Barabasi, The impact of cellular networks on disease comorbidity, Mol. Syst. Biol., (2009). [15] T. Qin, T.Y. Liu, X.D. Zhang, D.S. Wang, and H. Li, Global ranking using continuous conditional random fields, in NIPS, 2008. [16] J. Quinonero-Candela, C. E. Rasmussen, F. Sinz, and B. Schoelkopf, Evaluating predictive uncertainty challenge, in Machine Learning Challenges, Springer, 2006. [17] V. Radosavljevic, K. Ristovski, and Z. Obradovic, Gaussian conditional random fields for modeling patient’s response in acute inflammation treatment, in ICML 2013 workshop on M. L. for Sys. Iden., 2013. [18] V. Radosavljevic, S. Vucetic, and Z. Obradovic, Continuous conditional random fields for regression in remote sensing, in ECAI, 2010. , Neural gaussian conditional random fields, in [19] ECML, 2014, pp. 614–629.

[20] C.E. Rasmussen and C.K.I. Williams, Gaussian Processes for Machine Learning, MIT Press, 2006. [21] K. Ristovski, V. Radosavljevic, S. Vucetic, and Z. Obradovic, Continuous conditional random fields for efficient regression in large fully connected graphs, in AAAI, 2013. [22] G. Stiglic, F. Wang, A. Davey, and Z. Obradovic, Readmission classification using stacked regularized logistic regression models, in AMIA, 2014. [23] J. Stojanovic, M. Jovanovic, Dj. Gligorijevic, and Z. Obradovic, Semi-supervised learning for structured regression on partially observed attributed graphs, in SDM, 2015. [24] A. Uversky, D. Ramljak, V. Radosavljevic, K. Ristovski, and Z. Obradovic, Which links should i use?: a variogram-based selection of relationship measures for prediction of node attributes in temporal multigraphs., in ASONAM, 2013. , Panning for gold: using variograms to select [25] useful connections in a temporal multigraph setting., Soc. Netw. Analy. and Mining, (2014). [26] J. Zhou, J. Sun, Y. Liu, J. Hu, and J. Ye, Patient risk prediction model via top-k stability selection, in SDM, 2013.

9

Data Analytics Framework for A Game-based Rehabilitation System Jiongqian Liang∗

David Fuhry∗ David Maung∗ Roger Crawfis∗ Arnab Nandi∗ Srinivasan Parthasarathy∗

Abstract Stroke is a major reason for hemiparesis in United States. Considering the fact that traditional Constraint-Induced Movement therapy (CI therapy), though has been empirically proved effective, is inaccessible to most patients, we have developed a game-based rehabilitation system by borrowing the major principles from CI therapy. In order to help doctors and patients better understand and further improve the effect of the rehabilitation, we introduce a data analytics framework for our existing rehabilitation system in this paper. We design techniques of processing collected data and propose a series of kinematic variables to evaluate the efficacy of the therapy. We apply our data analytics framework on data from recruited patients. Our analysis shows that patients’ motor movement generally improves over the period of treatment, but in different ways, either movement speed, efficiency or range of motion. Our fine-grained analysis also reveals the presence of fatigue, which can be adopted to further improve the rehabilitation system.

1 Introduction It is estimated that there are around 1.2 million chronic stroke patients with upper extremity disability in the US, and 26% of them need others’ assistance in daily life[2, 7, 5, 1]. The expense of caring for these stroke survivors and accompanying loss of productivity costs the US economy more than $34 billion annually[9]. Moreover, only 30.7% receive outpatient rehabilitation services, the majority of whom do not receive treatments with strong empirical evidence of efficacy.[3]. Though there are some more advanced and effective rehabilitation approaches, such as Constraint-Induced Movement therapy (CI therapy)[11], they are costly and expensive for patients to access. In an attempt to address this issue, we collaborated with the stroke community and have designed a homebased rehabilitation game system[8]. The system uses advanced and commercially available gaming technology, and consists of a computer, a Microsoft KinectOne camera, the rehabilitation software, and a padded mitt restraint for the less affected arm. The interactive computer game adopts the critical principles of CI therapy. Participants are asked to play the game for a period ∗ Department

of Computer Science and Engineering, The Ohio State University. {liangji, fuhry, maung, crawfis, arnab, srini}@cse.ohio-state.edu † Wexner Medical Center, The Ohio State University. [email protected]

10

Lynne Gauthier†

of time during which their game and movement data is collected. Similar game-based rehabilitation systems have been developed in previous works such as [6, 10]. However, to our best knowledge, few of them have been deployed outside of lab experiments and put into practical use for real-world participants, and none of them contain a data analytics component to evaluate the effect of rehabilitation. In this paper, we propose a data analytics framework to quantitatively study the collected data and analyze the efficacy of our rehabilitation system on participants with greater sensitivity than can be obtained from standard clinical motor metrics, such as the Wolf Motor Function test[4]. Specifically, our contributions are: 1) We present simple but effective preprocessing steps to correct for noise and anomalies in collected sensor data. 2) We propose a series of kinematic variables to measure participants’ motor function continuously throughout the course of the intervention. 3) We jointly analyze the skeleton data from Kinect Sensors and the game contextual data from the software log. By doing this, we are able to analyze participants’ improvement separately for each movement. 4) We conduct a fine-grained movement data analysis, where we find the presence of fatigue in gameplay. The paper is organized as follows. Section 2 provides an overview of the rehabilitation system. Section 3 presents our data analytics framework in detail. Section 4 shows our analysis results on real world data collected from participants. We conclude in Section 5. 2 System Overview Here we briefly introduce the structure of the rehabilitation system, termed Recovery Rapids, and discuss how the data analytics framework interacts with other components. Figure 1 shows the overview of the rehabilitation system. The patient plays the game interactively in front of the computer and Kinect Sensors continuously record the position of 20 skeletal joints. In the game, the user plays the role as an Avatar and navigates a kayak down a river canyon that winds through various terrains (mountains, tropical, dessert, underground caves). The

Kinect-­‐based   Rehabilita0on  Game  

Skeleton  Data  

Game  Log  

Data  Preprocessing  

Rehabilita3on  Analysis    

Temporal  smoothing  

Speed  

Anomaly  filtering  

Range  of  Mo3on   Efficiency  

Data  Analy0cs   Framework  

Gesture  Segmenta3on  

Figure 2: Data Analytics Framework

Evalua0on  of  the   rehabilita0on  &  Feedback  

objective measure of motor performance over time. We next describe the data analytics framework in details.

Figure 1: Rehabilitation System Overview user is required to collect hidden treasure, fish for food, capture essential supplies, gather food along the river banks, and pull objects from the river (for more information about the game scene, refer to the video 1 ). Before playing the game, the user can customize relative frequency of different movements to obtain more practice on motor tasks that he/she has more difficulty performing. The game shapes the difficulty of each movement based on the patient’s performance. If a better movement is consistently recognized, the game will no longer accept weaker attempts. Therefore, the user is led to perform better over time, which motivates a user to continually perform to his/her maximal ability. For more details about design of the game itself, see [8]. The game collects fine-grained skeleton data of the user using the Kinect sensors at a frequency of 30 Hz. The Kinect sensors track 20 joints of the body, including the shoulder, elbow, wrist, hand, and hip center, which are most relevant to this application. Each entry in the skeleton data, called a snapshot, contains the 3-D coordinates of these 20 positions with a timestamp. The game also compares the skeletal trajectory to a library of coded gestures in real time. If the movement being performed meets prespecified criteria, a game mechanic is triggered and a timestamped record of the therapeutic movement (e.g., elbow flexion) is kept in the game log. Therefore, as shown in upper right corner in Figure 1, the system outputs two categories of data, skeleton data and game log. Note that both the skeleton data and game log contain timestamps, so it is possible to jointly analyze them. As shown in Figure 1, the data analytics framework takes the skeleton data and game log as input, and outputs the evalution of the rehabilitation. The evaluation result will then be forwared to users and clinical staff as feedback from the system and provide them with an 1 http://youtu.be/uAysIGueN9U

11

3 Rehabilitation Data Analysis Framework In this section, we describe the data analytics framework of our rehabilitation system. As shown in Figure 2, our data analytics framework consists of data preprocessing, gesture segmentation and rehabilitation analysis. 3.1 Data Preprocessing Before we analyze the collected data, we need to preprocess it to correct several problems which occur particularly with the skeleton data. The Kinect sensors collect data at frequency as high as 30 Hz. Though this provides excellent temporal granularity, two main problems remain with the raw data which we correct with the below operations. Temporal smoothing: One issue is that a pair of contiguous snapshots may be so close in time that most of the change in their skeleton coordinates might not be due to the user’s movement, but instead by random errors (i.e., “jitter”) resulting from limited sensor spatial resolution. Though the errors can be as small as a few millimeters, the sensors collect 30 snapshots every second and accumulating these errors over time can adversely affect the analytics result. To tackle this problem, we smooth the skeleton data using a sliding time window. Specifically, we manually set the width of a time window to T and then average the skeleton coordinates inside the time window. The time window keeps moving forward with stride of T and the skeleton data in each time window will be averaged. In this paper, we set T to 200ms which we empirically found to work well. Anomaly filtering: We also found that occasional erroneous snapshots are reported, whose coordinates are jarringly different from previous and subsequent snapshots. We removed these unrealistic anomalous snapshots with a simple statistical method. For each time window, we computed the average snapshot, then calculated the distance of each snapshot to the averaged snapshot, and assume the distances follow the normal

distribution. The snapshots with a distance more than 3σ to the average snapshot were removed, where σ is the standard deviation of the normal distribution. After applying these two operations to the data, all movements are smooth and continuous. Note that this data preprocessing need only be conducted on skeleton data for the trained (hemiparetic) body parts. For example, if the left arm is being trained, we only smooth the left hand, wrist, elbow, and shoulder joints. After the above preprocessing, we obtain a less fine-grained but more smooth and clean skeleton dataset.

distance speed by using Pn kpi − pi−1 k . (3.1) vd = Pi=2 n i=2 (ti − ti−1 )

Angle speed can be calculated similarly, except that to calculate the size of the angle it requires three points from each snapshot: a joint point and two other end points. For example, to calculate the angle of the elbow, we can use the elbow as the joint point, and then wrist and shoulder as end points. Let ri = (x1i , y1i , z1i ) denote coordinate of the joint point, and si = (x2i , y2i , z2i ) and ti = (x3i , y3i , z3i ) denote coordinates of two other 3.2 Gesture Segmentation While we can study the end points at time ti . Denoting the joint angle at ti as skeleton data across all the gestures, it is better if we αi , then can segment the data by gestures and analyze the user’s (si − ri ) · (ti − ri ) . αi = arccos movement in each gesture. Our rehabilitation system (3.2) ksi − ri kkti − ri k contains 12 coded gestures (rowing, holding hand out, raising arm side, reaching low across, etc.), which are And therefore, the angle speed can be figured out using used to match user’s movement. Different gestures Pn (αi − αi−1 ) focus on different parts of the body and require different (3.3) va = Pi=2 . n strength of movement. Because of these differences, we i=2 (ti − ti−1 ) intend to segment the skeleton data by the matched gestures in the game software. We expect that analytics Range of Motion: A patient’s range of motion is deon the segmented skeleton data will be more accurate fined as the difference between the maximum and minimum joint angles recorded during the rehabilitation. provided the data is from the same gesture. The game system can recognize gestures of the Range of motion measures the active range of motion patient and match them to different states of the coded of the body joint. To figure out the range of motion gestures[8], which are stored into the game log. By of a joint, we calculate the joint angle of each snapshot syncing the timestamps for the skeleton data and game during a game session using Equation 3.2 and sort those log, the skeleton data for each type of therapeutic values in ascending order, denoted as α(1) , ..., α(n) . To correct for outliers at the extremes of the joint angles, movement (e.g., rowing) can be isolated. we increase the robustness of the range of motion value 3.3 Rehabilitation Efficacy Analysis We seek to R by subtracting the k-th smallest from the k-th largest study the patient’s movement in the game throughout joint angle, i.e. the rehabilitation period and to evaluate the rehabilitaR = α(n−k+1) − α(k) tion efficacy. To this end, we propose a series of kine- (3.4) matic variables that help to measure the patient’s stroke where k is set as 10 in this paper, which we empirically seriousness from different perspectives. found to work well. Movement Speed: People with hemiparesis often Movement Efficiency: A patient executing a gesture perform movements more slowly. Motor speed was more quickly is an indication of improvement in stroke identified by the stroke community as a priority clinical recovery. Therefore, we are interested in temporal effioutcome because it indicates the extent to which tasks ciency of conducting a gesture. Temporal efficiency is of daily living can be performed quickly and efficiently. characterized by the amount of time spent executing a Movement speed can be categorized into distance speed gesture[12]. Note that movement efficiency is different and angle speed. To calculate distance speed of certain from movement speed. In some sense, movement effibody part (e.g. hand), we accumulate the distance that ciency characterizes the speed of effective movement for the body part moves in every two contiguous snapshots, a specific gesture instead of all movement. By looking and divide the distance by the accumulated time. Let at the game log, we are able to figure out the temporal pi = (xi , yi , zi ) denote the coordinate of a certain body efficiency of each gesture. Specifically, when a gesture is position in snapshot i with timestamp ti . Then given triggered in the game, we look backward for the timesa series of snapshots at t1 , t2 , ..., tn , we can get the tamp of each state under the gesture (each gesture contains multiple states). After finding the timestamp of

12

the initial state and last state for the triggered gesture, the time spent by the gesture can be simply calculated as the difference of the two timestamps. While the above kinematic variables could be applicable to all skeletal joints, we have restricted our analyses to the hemiparetic (weaker) extremity of our stroke participants. By looking at these kinematic variables throughout the rehabilitation period, we are provided with a continuous, objective measure of rehabilitation progress. 4 Experiment and Analysis In this section, we apply our data analytics framework on real-world data collected from participants and study how the rehabilitation system works for them. 80 participants were recruited from the local demographic of individuals with mild to moderate hemiparesis following a stroke for at least 3 months. Parcipants were loaned a gaming system and other hardware equipments (such as instrumented restraint mitt) and agreed to play the game for 1.5 hours for at least 10 (not necessarily consecutive) days. Participants received an initial consultation session, which involves learning how to play the game and customizing the game to the participant. Participants’ rehabilitation data is electronically captured by the software during live gameplay and stored on each participant’s machine. The data acquired consists of skeleton data from the Kinect camera tracking system and the game log from the software. Before and after the participants use the rehabiliation system, they are asked to conduct the Wolf Motor Function test [4] in hospital. The tests contain a series of tasks, such as extending elbow, lifting pencil and so on, each of which are rated based on the amount of time consumed. The rate of these pre-test and post-test of the patients serves as a comparison for our data analytics framework. The game software was continually updated throughout the study; one such update added the capability to collect the game log data necessary for our analytics tasks. In addition, data of some participants is still under collection. Of the patients in the study who finished using the rehabilitation system and for whom the data collected is valid and complete, we select four participants for a detailed analytic analysis. The basic information of these four patients is shown in Table 1. As the table shows, two have right upper extremity hemiparesis and two have left upper extremity hemiparesis. We preprocess the data following the techiniques mentioned in Section 3.1. We then employ the proposed kinematic variables to study the efficacy of our rehabilitation system. In order to analyze the effect of rehabilitation, we adopt a two-

13

half comparision approach. We divide the rehabilitation period into halves, i.e. the first half and the second half. We caclutate the kinematic variables in these two divisions respectively, and analyze their differences. We also conduct t-test for the two halves in order to test whether their averaged values are identical. Table 1: Basic Information of Patients Patient ID P1 P2 P3 P4

Weak Arm Right Right Left Left

Days 10 9 11 13

Total Mins 992.2 380.0 1029.7 613.5

Avg Mins/Day 99.2 42.2 93.6 47.2

4.1 Movement Speed As we pointed out, movement speed contains distance speed and angle speed. Here, we study the distance speed of the hand and the angle speed of the elbow. We use the two-half comparison approach for both hand distance speed and elbow angle speed (see Figure 3a and Figure 3c and p value of the t-test is shown). Furthermore, as a comparison, we adopt the rating result from the pre-test and post-test of participants in the hospital. We extract the rates for hand and elbow from pre-test and post-test, shown in Figure 3b and Figure 3d. From Figure 3a and Figure 3c, we can observe that the majority of the patients have obvious improvement in hand distance speed and elbow angle speed during the rehabilitation. Moreover, it is interesting to see that the improvement trend and scale in the rehabilitation period (shown in Figure 3a and 3c) are quite consistent with the pre/post-testing results from the hospital, shown in Figure 3b and Figure 3d, which supports the validity of the proposed movement speed and demonstrates the efficacy of the rehabilitation system. 4.2 Range of Motion We also study the range of motion of the joint in the patient’s weak arm, which are measured by Equation 3.4. We mainly focus on elbow and shoulder range of motion here. Figure 4 shows the comparison of the range of motion in the first and second half. We can observe that Figure 4a and Figure 4b follow almost identical patterns. From Figure 4, we can observe that Patient 1 does have significant improvement on elbow and shoulder range of motion in these two halves while Patient 3 and 4 do not change much. It is interesting to see that, Patient 1, the only patient that we observe decrease on hand speed in Figure 3a and Figure 3b, does have significant improvement on both elbow and shoulder range of motion. Our analysis results here show the massed practice provided through the rehabilitation system may benefit different individuals in different ways. This is further evidence of the need to capture recovery throughout an intervention using kine-

P4

P3

Angle Speed(degree/s)

(a) Hand Speed

70 FirstHalf 60 SecondHalf 50 40 30p=0.173 p=0.036 20 10 0 P1 P2

p=0.078

Pre-test Post-test

P1

P2

P3

P4

(b) Hand Movement Rating in pre/post-test 200

First Half Second-Half p=0.003

3000 2500 2000 1500

p=0.000

1000 500 0

p=0.001

P1

P2

P3

p=0.000

P4

(a) Efficiency of Raise-arm-side

Avg time for rowing(ms)

80 70 60 50 40 30 20 10 0

Avg time for raise-arm-side(ms)

p=0.000

Avg Testing Rate of Hand

p=0.364

Avg Testing Rate of Elbow

Hand Speed(m/s)

0.8 FirstHalf 0.7 SecondHalf 0.6 0.5p=0.054 0.4 p=0.000 0.3 0.2 0.1 0.0 P1 P2

350

First Half Second-Half p=0.000

300 250 200

p=0.000

p=0.459 p=0.000

150 100 50 0

P1

P2

P3

P4

(b) Efficiency of Rowing

Figure 5: Movement Efficiency Comparison Between Two Halves

Pre-test Post-test

Range of Motion (degree)

Range of Motion (degree)

based on each gesture. Here, we mainly study two gestures, rowing and raise-arm-side, which together make 100 p=0.875 up around 60% of all the recognized gestures. Figure 5a and Figure 5b show the averaged time consumed to 50 conduct these two gestures in the first and second half 0 of the rehabilitation period. More time consumed inP4 P1 P4 P3 P2 P3 dicates lower efficiency of executing the gesture. From (c) Elbow Angle Speed (d) Elbow Movement Rating in the figures, we can observe that Patient 1 and Patient 2 pre/post-test show significant improved efficiency in executing raiseFigure 3: Movement Speed Compared to Pre/Post-test arm-side and rowing gesture while the other two do not show much improvement (in fact, Patient 3 decreases in executing rowing gesture). We pointed out the differ250 200 FirstHalf FirstHalf ence between movement speed and movement efficiency SecondHalf SecondHalf 200 p=0.000 p=0.000 p=0.301 in Section 3.3. And here it is interesting to observe p=0.000 150 p=0.000 p=0.000 150 p=0.000 that Patient 1, though having no obvious improvement 100 p=0.000 in hand movement speed (shown in Figure 3a), has be100 come more efficient in executing gestures. Therefore, 50 50 we think that movement efficiency is a complimentary 0 P1 0 P1 P4 P4 P2 P3 P2 P3 aspect to show the efficacy of the rehabilitation system. 150

(a) Elbow Range of Motion

(b) Shoulder Range of Motion

4.4 Fine-grained Analysis We now conduct a fineFigure 4: Range of Motion Comparison of First/Second grained analysis. We study the hand movement speed Half. of the patient in every ten minutes on each day so matic variables that can more specifically isolate various that we can understand how the patient’s performance patterns of motor recovery. Existing clinical motor met- changes in each day and across the whole period. Figure rics are unable to isolate the independent contributions 6 shows the trend of hand speed for all the patients, of motor speed and active range of motion on overall where each dot shows the averaged speed during the ten minutes and each curve segment represents the trend performance. of speed in that day. We can observe that the overall 4.3 Gesture Segmentation and Movement Effi- trend is quite consistent with Figure 3a, with Patient ciency We can segment gestures as mentioned in Sec- 1 slightly downward and the rest upward. Moreover, tion 3.2. We conducted similar analysis on movement notice that for all the patients, hand movement speed speed and range of motion under different gestures, and is more frequently decreasing at the end of a day. This decreasing phenomenon is more evident on Patient 1, found they are analogous to the above results. However, gesture segmentation becomes crucial 3 and 4, who play more time each day compared to when we analyze the movement efficiency. As discussed Patient 2. Similar patterns occur in analysis of range in Section 3.3, movement efficiency is characterized by of motion and movement efficiency (figures are omitted the amount of time the patient consumes to execute due to space limitation). These decreasing trends, a gesture. Considering the difference of difficulty of especially the inverted ”U” shape ones, may be evidence the gestures, movement efficiency should be evaluated of fatigue[13]. Patients get tired after playing the game

14

Hand Speed(m/s)

Hand Speed(m/s)

1.0 0.8 0.6 0.4 0.2 0.00

10

20

30

40

50

60

70

80

Every Ten Minutes of Each Day

90

1.0 0.8 0.6 0.4 0.2 0.00

[3] [4]

10

20

40

60

80

100

Every Ten Minutes of Each Day

(c) P3

30

40

50

(b) P2

Hand Speed(m/s)

Hand Speed(m/s)

(a) P1

1.0 0.8 0.6 0.4 0.2 0.00

20

Every Ten Minutes of Each Day

120

1.0 0.8 0.6 0.4 0.2 0.00

[5]

[6]

10

20

30

40

50

Every Ten Minutes of Each Day

60

(d) P4

[7]

Figure 6: Hand Speed Trend in Each Day for a period beyond their tolerance and thereafter are likely to perform worse for the rest of the day. It is desirable for patients to avoid fatigue during the rehabilitation period. Fatigue detection provides a way to further improve rehabilitation and we plan to add a fatigue alert to our system. 5 Conclusion In this paper, we propose a data analytics framework for our existing game-based rehabilitation system. Data processing techniques and several kinematic variables for efficacy analysis are discussed. We also conduct some experiments and reveal the efficacy of the therapy from different aspects. In the future, we will apply our data analytics framework to the remaining participants after they finish using the system, which will better support the efficacy of the rehabilitation system. We also intend to study the efficacy of our system on stroke patients over a longer period (one or two years). In addition, we want to adopt other strategies to evaluate the efficacy of the rehabilitation more comprehensively. Acknowledgements: This work was supported in part by a Patient-Centered Outcomes Research Institute (PCORI) and UL1TR001070 from the National Center For Advancing Translational Sciences.

[8]

[9]

[10] [11]

[12]

[13]

References [1] F. de NAP Shelton and M. J. Reding. Effect of lesion location on upper limb motor recovery after stroke. Stroke, 32(1):107–112, 2001. [2] C. for Disease Control, P. (CDC, et al. Prevalence of disabilities and associated health conditions among

15

adults–united states, 1999. MMWR. Morbidity and mortality weekly report, 50(7):120, 2001. Heart and S. F. of Ontario. A report from the consensus panel on stroke rehabilitation system. 2001. T. M. Hodics, K. Nakatsuka, B. Upreti, A. Alex, P. S. Smith, and J. C. Pezzullo. Wolf motor function test for characterizing moderate to severe hemiparesis in stroke patients. Archives of physical medicine and rehabilitation, 93(11):1963–1967, 2012. M. Kelly-Hayes, A. Beiser, C. S. Kase, A. Scaramucci, R. B. DAgostino, and P. A. Wolf. The influence of gender and age on disability following ischemic stroke: the framingham study. Journal of Stroke and Cerebrovascular Diseases, 12(3):119–126, 2003. B. Lange, S. Koenig, E. McConnell, C. Chang, R. Juang, E. Suma, M. Bolas, and A. Rizzo. Interactive game-based rehabilitation using the microsoft kinect. In Virtual Reality Short Papers and Posters (VRW), 2012 IEEE, pages 171–172. IEEE, 2012. D. Lloyd-Jones, R. Adams, M. Carnethon, G. De Simone, T. B. Ferguson, K. Flegal, E. Ford, K. Furie, A. Go, K. Greenlund, et al. Heart disease and stroke statistics2009 update a report from the american heart association statistics committee and stroke statistics subcommittee. Circulation, 119(3):e21–e181, 2009. D. Maung, R. Crawfis, L. V. Gauthier, L. WorthenChaudhari, L. P. Lowes, A. Borstad, R. J. McPherson, J. Grealy, and J. Adams. Development of recovery rapids-a game for cost effective stroke therapy. In Proceedings of the International Conference on the Foundations of Digital Games, 2014. V. L. Roger, A. S. Go, D. M. Lloyd-Jones, E. J. Benjamin, J. D. Berry, W. B. Borden, D. M. Bravata, S. Dai, E. S. Ford, C. S. Fox, et al. Executive summary: heart disease and stroke statistics–2012 update: a report from the american heart association. Circulation, 125(1):188, 2012. C.-J. Su et al. Personal rehabilitation exercise assistant with kinect and dynamic time warping. 2013. S. L. Wolf, C. J. Winstein, J. P. Miller, E. Taub, G. Uswatte, D. Morris, C. Giuliani, K. E. Light, D. Nichols-Larsen, et al. Effect of constraint-induced movement therapy on upper extremity function 3 to 9 months after stroke: the excite randomized clinical trial. Jama, 296(17):2095–2104, 2006. C.-y. Wu, K.-c. Lin, H.-c. Chen, I.-h. Chen, and W.-h. Hong. Effects of modified constraint-induced movement therapy on movement kinematics and daily function in patients with stroke: a kinematic study of motor control mechanisms. Neurorehabilitation and neural repair, 21(5):460–466, 2007. N. Yozbatıran, F. Baskurt, Z. Baskurt, S. Ozakbas, and E. Idiman. Motor assessment of upper extremity function and its relation with fatigue, cognitive function and quality of life in multiple sclerosis patients. Journal of the neurological sciences, 246(1):117–122, 2006.

A simple and effective method for anomaly detection in healthcare Luiz F. M Carvalho∗ Carlos H. C. Teixeira∗ Ester C. Dias† Wagner Meira Jr.∗ Osvaldo F. S. Carvalho∗ Abstract Anomaly detection in public health records is a challenging task that can reveal frauds, errors and other significant events. In this paper we propose a simple, intuitive and generic method for anomaly detection based on score transference from consumers to providers for scenarios in which the analysis of the consumers enables the identification of anomalous providers or when no data about the providers is available. Our method consists of two steps: consumers score assignment and score transference to providers. Besides discussing the method details and their implications, we create a modeling in which the cities represent the consumers and hospitals the providers and applied it to a real database of the Brazilian public health system. The three distinct types of procedures considered in our analysis earned for the hospitals about 3.5 billion dollars from 2008 to 2012. We also show that our method can be implemented with multiple algorithms. We performed an analysis of the implementation and a case study showing that we were able to identify anomalous providers and potential frauds. Keywords: Anomaly Detection, Unsupervised Learning, Public Health Data Analysis, Fraud Detection, Data Mining

1 Introduction Anomaly detection in records of public healthcare systems is an important task in healthcare management that can reveal logistic problems, overloads, regional lack of professionals and services, diseases outbreaks, errors in the data and suspect activities. Hence, it is a key task to support cost reduction, improve records documentation, support investments planning and especially to reduce fraud occurrence. This last task is crucial and can avoid loss of huge amounts of money. For example, according to the FBI, from 3% to 10% of the activities of the US insurance Medicaid is fraudulent [1]. In order to detect frauds, the primary premise is that a fraudulent entity presents anomalous patterns in one or more aspects. However, despite its importance, detecting anomalies in public healthcare systems is also a challenge due to many reasons such as the poor quality of data, the lack of data, the complexity and dynamism of the field, the restricted access to data due to privacy issues and ∗ Computer Science Department, Universidade Federal de Minas Gerais, Brazil. † Doctor and public healthcare analyst at Secretariat of Health of the city of Belo Horizonte, Brazil

16

the lack of a global standardization in healthcare system organizations. However, even if a complete and reliable database is available, it is not feasible to check all values and records declared by the healthcare providers once that auditing processes are usually expensive and complex. So, it is necessary to automate the process to select or rank those entities to be audited while reducing the rate of false positives. Exploring a predictive model is a popular solution for this purpose, but for the best of our knowledge there is not any labelled dataset for anomaly detection in healthcare and building one is not trivial due to the mentioned complexity and dynamism. As it is also hard to define the expected pattern for all entities, a solution is the use of unsupervised learning. In order to detect anomalous healthcare providers, the existing works propose methods that explore providers records. However, in many scenarios there is not a dataset to support the analysis of the providers patterns. Furthermore, some anomalous patterns of the providers can only be revealed if we analyze the entities affected by their activities. In these cases, although the providers are responsible for the actions, a solution for identifying the anomalous providers (or ranking them) is the identification of unexpected patterns in the consumers associated with them and then to infer which providers are likely to be anomalous.

Figure 1: Example of the method: the left plot shows the consumers distribution according to two dimensions. The consumers colors indicate the associated provider. From the premise that anomalous providers are related to anomalous consumer, it is possible to infer the providers anomaly as shown in the right plot.

Figure 1 illustrates our method. In the left plot we show the consumers distribution considering two dimensions of analysis. In this example each consumer is associated with one provider and the color of each

consumer indicates which provider is related to it. The right plot shows the anomaly degree of the providers. Hence, from the premise that anomalous consumers are associated with anomalous providers, the anomaly in the providers can be estimated. In the analysis that we performed over a real dataset, the consumers entities are the cities and the providers are the hospitals. The anomaly projection from consumer to providers is justified by the fact that we do not have information about the hospitals structure to identify anomalies and by the fact that the cities patterns enable an anomaly analysis from a perspective that is not possible with only the providers data. The goal of this work is a solution for anomalous providers identification from the consumers related to them. The main contributions of this paper are: • A simple, effective and intuitive method for anomalous providers identification. Our method is based on score transference in which the anomaly degree of the providers is inferred from the consumers. For the best of our knowledge, this is the first method that applies the concept of score transference between entities of different types for anomaly detection. • Our method is also generic and can be implemented with many combinations of algorithms for anomaly detection and functions for score transference. • We applied the method to a real database of the Brazilian public healthcare system considering three distinct types of procedures that cost about 3.5 billion dollars1 between Jan/2008 and Dec/2012. We performed both an analysis over the results generated with multiple implementation and a case study. This paper is organized as follows. Section 2 presents the related work. Section 3 describes the method. Section 4 shows the modeling and how the method was applied for healthcare anomaly detection. Section 5 describes the database. Section 6 presents the results and Section 7 concludes the work. 2 Related Work Anomaly detection is a core problem in data mining that has been the subject of many works recently. It has been applied to several purposes as detailed in [2]. Most of existing works deal with just one type of entity to detect anomaly as fraud detection in specific domains as [3] and temporal data outlier detection 1 The original values were expressed in the Brazilian currency (Real). The Real value ranged from 0.40 to 0.65 US dollars approximately from 2008 to 2012. In this paper we adopt an exchange rate of 0.525.

17

[4]. The majority of the existing works for anomaly detection that deal with multiple entities focus on finding anomalous links, such as [5]. Our method deals with multiple entities types but the goal is to detect anomalies in the entities of a type based on the anomalies of the other type. The main techniques for anomaly detection are detailed and discussed in [6]. They can be divided into supervised and unsupervised methods. The supervised techniques require labeled data as training set and among these techniques, a popular and effective one is the One Class SVM [7]. However, as labelled data for anomaly detection is rarely available and it is hard to cover all the possible anomalies that can occur in a domain, unsupervised techniques are most widely applicable for the problem of anomaly detection. Unsupervised learning: According to [8] the algorithms for unsupervised anomaly detection can be divided into: i) distribution-based, ii) depth-based, iii) distance-based, iv) clustering-based and v) densitybased. A simple unsupervised approach is the use of a clustering algorithm to identify either objects in small clusters or objects not assigned to any cluster. From the premise that anomalies occur in sparse neighbourhoods, some unsupervised techniques are based on the distance of the instances from their nearest neighbours, such as the algorithm in [9]. Other popular approaches consist of statistic and probabilities analysis over the data, such as the method in [10] that verifies the variance in the data when an object is removed. Further unsupervised methods for outlier detection can be found in [2, 11, 12]. In order to implement our method, we applied three proximity-based algorithms: a KNN approach [13], Reverse KNN [14] and Local Outlier Factor [15] as detailed in Section 4. These methods are relatively non-parametric approaches and they explore the wellknown nearest-neighbor principle. Efficient algorithms for that purpose are discussed in [16]. Healthcare analysis: Among the anomaly detection applications, medical and public health management is a core one. With the improvements of management systems for healthcare observed in the past years, many works have been developed to deal with the produced data and extract knowledge from it. A survey on the main Electronic Healthcare Record standards is available in [17]. In [18] it is presented a solution for temporal data mining in healthcare informatics. The work in [19] aims to detect anomalies in medical time series. In [20] a Bayesian network is applied to detect disease outbreaks through anomaly detection. The identification of frauds is also an important problem in healthcare anomaly detection. In [21] the framework MCI HCFAD is introduced for healthcare fraud and abuse detection. A complete description of the popular types of healthcare frauds is shown in [22].

To the best of our knowledge, our method is the provider may be any healthcare provider, such as hosfirst one to detect anomalous healthcare providers from pitals, doctors, companies, managers and others. The consumers may be cities, regions, states, people, famithe consumers analysis. lies or any entity that demands the healthcare services. The anomaly in the consumers can be assigned consid3 Method Our method for anomaly detection can be divided into ering the rate of contamination, the rate of procedures, two steps: score assignment through outlier analysis the amount spent in healthcare or several other criteria and score transference. It deals with two types of which distinguishes anomalous from regular behaviour. In our analysis the goal is to find hospitals that entity: consumers and providers. The consumers are the declared and charged for more procedures than the objects about which we have data to measure anomaly expected. We apply the algorithm to measure the degree, so we use them to infer the providers anomaly. anomaly degree of the cities, then we verify the relation Figure 2 illustrates the score transference between consumers and providers. The score of a provider is de- between cities and hospitals and identify which hospitals fined by both the anomaly degree of the consumers and are more likely to be anomalous. So, cities represent the the link weight which indicates the relation magnitude consumers and hospitals the providers. Given a medical between a pair of a consumer and a provider. In this ex- procedure, the rate of occurrence in each city defines ample, the color intensity indicates the anomaly degree the consumers score and the number of procedures performed by each hospital on the population of each and the link width indicates its weight. city defines the weight of their relationship. If a hospital is strongly connected to a city in which the rate of a procedure is anomalous, it is likely that the hospital is charging more procedures than the true amount. For example, suppose three cities: A, B and C. Their populations are related to four hospitals: 1, 2, 3 and 4. Given the number of occurrences of procedures Figure 2: Score transference from consumer to providers. in each city by each hospital, we want to know the The greater are the consumer anomaly and the link weight, anomalous hospitals. Table 1 shows the score of each more intense is the anomaly in the providers. city after the step of consumer score assignment. 3.1 Outlier analysis The first step is the outlier analysis and its goal is to measure the degree of anomaly Table 1: Scores for the regions A, B and C according to its of the consumers. In order to assign a score for rate of occurrence of a procedure. each consumer, an unsupervised algorithm for outlier City A City B City C detection should be used and the input is the algorithm 0.41 1 0.82 parameters and the database containing information about the consumers. At the end of this step, each Thus, city A is not anomalous, while cities B and consumer object C has an anomaly score S(C). C presents high anomaly degree. The fraction of people 3.2 Score transference The score transference is from each region treated in each hospital is shown in the last step and consists of assigning a score S(P ) to Figure 3. each provider P . The inputs are the anomaly degree S(C) of each consumer C and the weight W (Ci , P ) of the relation between the pairs of consumer and providers. The score transference from a consumer C to a provider P is controlled by a function f which may be defined in multiple ways. X Figure 3: Amount of the population of each city treated S(P ) = f (S(Ci ), W (Ci , P ))

in each hospital. As cities B and C present high scores, we conclude that providers 3 and 4 are anomalous.

Ci ∈ Consumers

In the next section we show how we modeled and implemented the method to identify anomalies in It is possible to conclude that hospitals 3 and 4 healthcare data. are more likely to be fraudulent because they are the healthcare providers more related to the cities with high 4 Anomaly Detection in Healthcare anomaly degrees. Our method can be instantiated in multiple ways to We believe that the amount of people from each deal with many healthcare entities combination. The city cared by each hospital is a trivial information

18

that is included in most of healthcare systems, because each patient usually has to inform its address to the hospital before the care service. Moreover, as most of the countries or cities perform demographic surveys, the population size of each city is usually a known information. Thus, we conclude that this analysis can be performed in a vast amount of healthcare databases. Although the method is able to find anomalous providers that causes relevant changes in the cities rate, detecting providers related to many low-score consumers is out of its scope. For example, in our analysis, if a hospital keeps small fraudulent activities in many cities, the hospital is going to receive a low score from each city and its final score is also going to be low. To detect hospitals with such behaviour, an effective approach is verifying the number of cities related to the hospital and their geographical distances from it. If the hospital is related to many cities or they are too far from the hospital, it is possible that the hospital is performing many smalls frauds. 4.1 Implementation In this section we present the algorithms and the implementation decisions. Our goal is to keep the method simple, intuitive and effective. As mentioned in the previous section, the step of outlier analysis aims to assign an anomaly score for each consumer. To show the generality and versatility of the method, we applied three algorithms: KNN, reverse KNN and Local Outlier Factor. The choice of these algorithms is due to their simplicity, effectiveness and low cost. In addition, as all of them are proximitybased algorithms, they take the same parameters and their implementation and evaluation are related. In the future we aim to apply other approaches. Each algorithm requires three parameters: K, W and S. The parameter K indicates the number of neighbours to be considered in the analysis. As the values are grouped by months, the analysis is performed in intervals called window and the parameter W defines the number of months in each window w and S defines the sliding distance between two adjacent windows. For example, suppose that each month is represented by U . If the value of W and S are respectively 6 and 3, each window has size 6U and two adjacent windows share three months, as illustrated below.

Each algorithm assigns a score S(C)w for each consumer C in each window w. After this step, the final score S(C) of each consumer is computed through the sum of the product T (C)w by S(C)w for each window

19

in set M of all windows. The value of T (C)w indicates the number of procedures provided to consumer C in window w. X S(C) = (S(C)w ∗ T (C)w ) w∈M

The values of K and W can impact in the results. If these values are too low, the result becomes too sensitive to variations, then it looses its reliability. On the other hand, if they are too large, the score tends to loose its significance and the anomalous behaviours can be smoothed. Thus, the best values of K and W enable us to distinguish the isolated objects from nonanomalous objects in a reliable way. The value of S defines the precision and the execution cost. If S is large, the algorithm executes faster and each month is considered just in one or few windows. On the other hand, if S is small, the execution is slower but each month is analyzed in different windows, then the analysis becomes more reliable. We compared results generated by three values of k: 3, 5 and 8; three values of w: 1, 3 and 6. We just used the value 1 for the sliding s parameter in order to keep our analysis robust since we do not skip months between two windows. As all methods demand a distance metric, we employed the Euclidean distance because it is a simple and known metric, but other metrics can be employed such as the Manhattan and the Chebyshev distances. Next, we present each algorithm applied and the score transference approach. KNN: this algorithm is based on the work in [13] and computes the anomaly score based on the distance from each object to its neighbours. In each window w, for each consumer C, the distance from C to the K-nearest objects (its neighbourhood N (C)w ) is computed. We observe that the neighbourhood of each consumer can be different in different windows. Given an object C, its anomaly score S(C)w is defined as the sum of the distance between C and its K neighbours in window w, as shown above. X S(C)w = dist(C, ni )w ni ∈N (C)w

Reverse Nearest Neighbour approach: The main assumption of this algorithm proposed in [14] is that, if an object is not among the closest neighbours of its neighbours, it is isolated and then it is an anomaly. The main step of the algorithm is the construction of a directed graph in which the nodes represent the objects and the edges link an object to its neighbours in each window. Thus, each object has outdegree equal to K. The anomaly score S(C)w of each consumer C in 1 each window w is assigned as S(C)w = IN (C) , being w IN (C)w the indegree of its node in window w.

Table 2: Data description for each type of procedure in the five years period from 2008 to 2012: amount of hospitals, procedures performed, money generated, percentage of months with null occurrence and monthly rate of occurrence considering the non null months. Type Cardiov. Surg. Scintigraphy Glauc. Surg.

Number of hospitals 368 518 299

Number of procedures 661821 1832862 5126166

Money (millions) $3054.96 $270.20 $168.45

Local outlier factor: the LOF is a density-based algorithm proposed in [15] that finds isolated objects even when there are groups with different densities. The neighbourhood of a consumer object C in window w is composed by those objects located in a distance smaller or equal to the distance DK (C)w from C to its k-nearest object. Hence, the size |N |w of the neighbourhood N (C)w can be greater than k in case of ties. The algorithm is based on three concepts: Reachability, Average Reachability and LOF Value. For each object C the Reachability(O, vi )w between C and each one of its neighbours ni in window w is defined as: Reachability(C, ni )w = max(dist(C, ni )w , DK (ni )w ) The AverageReachability(C) is the average Reachability of C to all of its neighbours: P ni ∈N (C)w Reachability(C, ni )w AR(C)w = |N (C)w | The value of LOF (C) is the anomaly degree S(C)w of C in window w and defined as: P AR(C)w LOF (C)w =

ni ∈N (C)w AR(ni )w

|N (C)w |

If an object is not isolated, we expect that its Average Reachability is similar to the AR of its neighbours, so the expected value for the LOF in this case is 1. If the LOF of an object is much larger than 1, it is likely to be anomalous. Score transference: The score transference was performed through a trivial function that, for each pair of provider P and consumer C multiplies the score of consumer C by the number B(P, C) of procedures performed by P on population of C. S(P ) =

X

S(Ci ) ∗ B(P, Ci )

Ci ∈ Consumers

In order to keep the score values between 0 and 1, we applied a Gaussian normalization as described in [23] to the score of the cities before transferring to hospitals. After the score transference process, we also apply the same normalization method to the hospitals scores.

20

Nulls 57% 56% 70%

Average per month 14.17 26.67 276.06

5 Data In our analysis, we used a real database containing information about the unified Brazilian public healthcare system. For the best of our knowledge, this is one of the largest and most complete databases from public healthcare in the world. Although it is a public database available in [24], we omit the identification of the hospitals and cities in this paper due to ethical issues. To show that our method can be applied in different scenarios, for our analysis we considered three ambulatory procedures: Cardiovascular Surgery, Scintigraphy and Glaucoma Surgery. For each month and each type of procedure, we have the number of procedures performed by each hospital in patients from each city from January of 2008 to December of 2012. We also have a dataset containing the population size of each city in each month. Table 2 shows some information about the data in the period 2008-2012. The three first columns show, for each type of procedure, the number of hospitals that performed it at least once, the total amount of procedures and the total amount of money paid in millions of dollars. Considering all months in the period and all hospitals that performed it at least once, we also show, in the fourth column, the percentage of months with null occurrence. Considering the months with nonzero occurrence, the rate of procedures per month was computed as shown in the fifth column. We performed a preprocessing step to deal with cities with different sizes as following. Consumer unbalancing issue: as we compare rates of procedures in the population of all cities, it is a problem dealing with unbalance cities sizes. The problem occurs in both small and big cities. In small cities few procedures might represent large variations in the rate, so the confidence in such cities is small. To tackle this problem, for each month, after computing the rate of procedure, we applied an Empirical Bayes Estimator as described in [25] to smooth their variations towards the average. The lower the confidence in the cities rate, the greater the approximation to the average. Therefore, if a population of a city is too small, the method approximates the rate to the average in order to reduce the variation impact. There is also a problem with big cities: if the population is too big, the rate of procedures performed in the city is hardly affected by few anomalous hospitals.

(a) KNN-Based

(b) Reverse KNN

(c) Local Outlier Factor

Figure 4: Distribution of the nonzero scores of cities and hospitals produced by the three algorithms for Cardiovascular Surgery.

So, even if a hospital is extremely fraudulent, the results of its activities are smoothed by the big population, thus the rate of the city does not indicate occurrence of anomaly. However, the solution for this problem is not in the scope of this work and, as a future work, we plan to divide big cities into districts according to the census division to avoid the analysis over huge populations. Therefore, it is very unlikely that we are able to outliers from big cities. 6 Results In this section we show the impact of the parameters and algorithms in the results and a case study over the results. As there is neither a labelled set nor a generic metric to evaluate the results, the decisions were taken and the cases were selected based on visual analysis of graphs. 6.1 Implementation analysis In this section we present a comparison between the three algorithms applied, the three values of the number K of neighbours and the three values of window size W . In order to compare the results of each one of these implementation issues we vary its values whereas the values of the other issues are kept fixed in the values that, according to our beliefs, generate the best results. For the algorithms comparison, we set the parameters K and W to 8 and 3, respectively. For analyzing the number K of neighbours, we used the KNN algorithm and the window size W was fixed in 3. The results of the window size W comparison were also generated with the KNN algorithm and K equal to 8. We begin the analysis showing in Figure 4 the distribution of cities and hospitals scores produced by each algorithm for the procedure Cardiovascular Surgery. As previous stated, the score values range from 0 to 1 due to the normalization step. As we use an unsupervised approach, the anomalies are those instances presenting isolated behaviour. According to Figure 4(a), we consider that the KNN algorithm produced the best score assignment for the cities once that just a few cities received high scores. The score of cities assigned by the LOF represented

21

in Figure 4(c) is not good for the mentioned purpose. In relation to the providers scores, all algorithms produced desirable distributions: few anomalies and most of the instances are considered as regular. However, although the providers score produced by the LOF is in accordance with the desired distribution shape, it is very unlikely that the results are good since its cities score is not good and the transference occurs from cities to providers. For each pair of algorithms and values of K and W , we compared the two rankings of hospitals produced. For each pair, we considered only hospitals with nonzero anomaly score in common. The metric applied is the Kendall Tau distance that computes the rate of pairs of items with different relative position in two rankings. Thus, the higher the distance, more pairs of hospitals (A,B) exists such that A is more anomalous than B in one ranking whereas B is more anomalous than A in the other ranking. The results are shown in Table 3. For each set of algorithms and values of K and W , the number of hospitals with non-zero anomaly score in common is shown in the Venn diagrams of Figure 5. The results show that all algorithms agree about most of the anomalies of Cardiovascular Surgery. Considering the other two procedures, it is possible to note according to Table 2(a) and Figure 5(a) that the results of the Reverse KNN and LOF algorithms are more related to each other than to KNN. However, the same cannot be verified in the results of Glaucoma Surgery analysis. In the analysis of score distribution, the quality of the Reverse KNN results was not so clear according to Figure 4(b), but its correlation with the LOF results indicates that its results might be worse than the KNN results. Table 2(b) and Figure 5(b) show that the parameter K has little impact in the results for all procedures. For the window size parameter, Table 2(c) and Figure 5(c) show that there is little difference between the results for W = 3 and W = 6 for all the three types of procedures. On the other hand, the difference is larger when these values are compared with W = 1, hence the smoothing effect difference between W equal

(a) Algorithms

Algorithms KNN x RKNN KNN x LOF RKNN x LOF

Card. 0.22 0.29 0.10

Scint. 0.37 0.41 0.10

(b) Number of neighbours

Glau. 0.23 0.26 0.24

K 3x5 3x8 5x8

Card. 0.02 0.03 0.01

Scint. 0.02 0.04 0.02

Glau. 0.02 0.01 0.02

(c) Window size

W 1x3 1x6 3x6

Card. 0.24 0.29 0.06

Scint. 0.20 0.22 0.04

Glau. 0.14 0.19 0.08

Table 3: Kendall Tau distances between the rankings generated with different algorithms, number of neighbours and window size. For each pair, only hospitals that presented nonzero score were considered in the ranking comparison. If the values are near to zero, the rankings present few pair of hospitals with relative position inverted in the two rankings.

(a) Algorithms

(b) Number of neighbours

(c) Window size

Figure 5: Venn diagrams showing the number of hospitals with nonzero score in common for each type of procedures in the analysis of algorithms and values of K and W .

to 1 and W equal to 3 is bigger than between 3 and 6. 6.2 Case study In this section we present a case study over the anomalous hospitals found. The cases shown here were selected due to three reasons: we verified their evident anomaly patters through visual analysis, they perform a representative amount of procedures and they received the maximum score in most of the experiments performed. We emphasize that our goal is to detect anomalies, so auditing steps are necessary to either show that the providers are fraudulents or to find events and causes to explain their patterns in case of genuine variations. We show three hospitals H1, H2 and H3 that presented anomalous patterns for the procedures of Cardiovascular Surgery, Scintigraphy and Glaucoma Surgery, respectively. In order to show why such providers are evident anomalies, we also show the patterns of three cities CH1, CH2 and CH3 which contributed the most for H1, H2 and H3 scores, respectively. Hence, these cities were anomalous due to the providers’ activities. During the analysis period of 5 years, H1 earned more than $68 million with Cardiovascular Surgery, H2 earned more than $13 million with Scintigraphy and H3 earned about $5.75 million with Glaucoma Surgery. The result are shown in Figure 6. The graphs in Figures 6(a), 6(d) and 6(g) show the contribution of C1, C2 and C3 to the anomaly

22

score of H1, H2 and H3 in each month, that is, the magnitude of score transferred from these cities to these hospitals. For this analysis we use the results of the KNN implementation with W equal to 1 and K equal to 8. Such images confirm that these cities are the main explanation for the hospitals anomaly in most of months. Figures 6(b), 6(e) and 6(h) present the rate of procedures in the cities compared to the Brazilian average rate. It is clear that the rates of all cities are higher than the average in the country. Finally, we show that H1, H2 and H3 are the main responsibles for the high rates of procedures of C1, C2 and C3. In Figures 6(c), 6(f) and 6(i) the green curves represent the total amount of procedures in the city and the gray line mark the amount of procedures provided in each city that was performed by the related hospital. In all plots, the gray lines are on the top of the curves, thus it is possible to conclude that H1, H2 and H3 performed almost all the procedures in the cities. We also estimated how much it would be earned by H1, H2 and H3 if they were regular providers. Suppose the ideal scenario in which there is no unbalance or shortage of services in the Brazilian cities, nor any case of disease outbreaks, nor active health campaign that could increase the amount of procedures in a period. Considering this assumption, the proper amount of procedures ideal(C)m in each city C in each month

(a) Score transference proportion to H1 from C1

(b) Rate of procedures of Cardiovascular Surgery in city C1 compared to the Brazilian average

(c) Number of occurrences of Cardiovascular Surgery in city C1 and the amount performed by H1

(d) Score transference proportion to H2 from C2

(e) Rate of procedures of Scintigraphy in city C2 compared to the Brazilian average

(f) Number of occurrences of Scintigraphy in city C2 and the amount performed by H2

(g) Score transference proportion to H3 from C3

(h) Rate of procedures of Glaucoma Surgery in city C3 compared to the Brazilian average

(i) Number of occurrences of Glaucoma Surgery in city C3 and the amount performed by H3

Figure 6: Case study: for each procedure, we selected a hospital with anomalous pattern. For each one, we selected a related city and show that the city anomaly is due to the hospital activities. Figures 6(a), 6(d) and 6(g) show the contribution of the cities in the hospitals score. Figures 6(b), 6(e) and 6(h) present the rate of procedures in the cities in comparison to the average in country to show that the cities are anomalous. Figures 6(c), 6(f) and 6(i) show that the hospitals are the main healthcare provider in such anomalous cities. The green curves show the amount of procedures provided in the cities and the gray lines show the amount provided by the hospital. It is possible to see that the gray line is on the top of the green curves in most of the months in the three plots.

m were computed as the product of the population size by the average rate of procedures of all cities in the country. For each city C we also computed the fraction of procedures f raction(C, H)m provided by each hospital H in each month. As the fraction may be a non-integer number, we round it up to the nearest whole number. Finally, we computed the average price average(H)m of the procedures in each hospital in each month. With these values we estimated the amount of money that would be earned by each hospital H in each month m providing services to each city C as f raction(C, H)m ∗ ideal(C)m ∗ average(H)m . From this estimate, we conclude that the amount

23

received by H1, H2 and H3 due to their activities in C1, C2 and C3 could be reduced about 74%, 88% and 56% of the real amounts, respectively. 7 Conclusion and Future Work In this work we proposed a simple and generic method for anomaly identification. The central idea of our method is to identify anomalous providers based on their consumers by transferring scores from consumers to the related providers. We performed a case study over a real and representative dataset, where the providers are the hospitals and the consumers are the cities. The implementation analysis shows that the

KNN implementation generated the best results and the case study shows that we were able to find anomalous hospitals which are potential fraudulent providers. In the future, we aim to divide big cities into smaller regions and evaluate again so that we will be able to detect anomalies in big cities. We also aim to compare the cities according to economic and geographic features. In addition we want to implement and evaluate the method with other unsupervised algorithms and ensemble techniques as presented in [23]. In order to better evaluate the method, we intend to label the database of the Brazilian public healthcare system with help of experts and to generate a synthetic labeled dataset from its characterization in order evaluate the accuracy of the method in different scenarios. Acknowledgements This work is partially supported by CNPq, Fapemig, InWeb and Infosas project. We would like to thank the Infosas Team for the insights, data, and comments on the method and the results.

[11]

[12]

[13]

[14]

[15]

[16]

References [17] [1] U.S. Federal Bureau of Investigation. Financial crime report 2010-2011. www.fbi.gov/statsservices/publications/financial-crimes-report-2. Online. [2] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM Comput. Surv., 41(3):15:1–15:58, July 2009. [3] Clifton Phua, Damminda Alahakoon, and Vincent Lee. Minority report in fraud detection: Classification of skewed data. SIGKDD Explor. Newsl., 6(1):50–59, June 2004. [4] Manish Gupta, Jing Gao, Charu C. Aggarwal, and Jiawei Han. Outlier detection for temporal data: A survey. [5] Jimeng Sun, Huiming Qu, Deepayan Chakrabarti, and Christos Faloutsos. Relevance search and anomaly detection in bipartite graphs. SIGKDD Explor. Newsl., 7(2):48–55, December 2005. [6] Charu C Aggarwal. Outlier analysis. Springer, 2013. [7] Bernhard Sch¨ olkopf, Robert C Williamson, Alex J Smola, John Shawe-Taylor, and John C Platt. Support vector method for novelty detection. In NIPS, volume 12, pages 582–588, 1999. [8] Wen Jin, Anthony K. H. Tung, and Jiawei Han. Mining top-n local outliers in large databases. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’01, pages 293–298, New York, NY, USA, 2001. ACM. [9] Ji Zhang and Hai Wang. Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowledge and Information Systems, 10(3):333–355, 2006. [10] Andreas Arning, Rakesh Agrawal, and Prabhakar

24

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

Raghavan. A linear method for deviation detection in large databases. In KDD, pages 164–169, 1996. VictoriaJ. Hodge and Jim Austin. A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2):85–126, 2004. Arthur Zimek, Erich Schubert, and Hans-Peter Kriegel. A survey on unsupervised outlier detection in highdimensional numerical data. Statistical Analysis and Data Mining, 5(5):363–387, 2012. Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD ’00, pages 427–438, New York, NY, USA, 2000. ACM. Ville Hautam¨ aki, Ismo K¨ arkk¨ ainen, and Pasi Fr¨ anti. Outlier detection using k-nearest neighbour graph. In ICPR (3), pages 430–433, 2004. Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and J¨ org Sander. Lof: Identifying density-based local outliers. SIGMOD Rec., 29(2):93–104, May 2000. Gustavo H. Orair, Carlos H. C. Teixeira, Wagner Meira, Jr., Ye Wang, and Srinivasan Parthasarathy. Distance-based outlier detection: Consolidation and renewed bearing. Proc. VLDB Endow., 3(1-2):1469– 1480, September 2010. Marco Eichelberg, Thomas Aden, J¨ org Riesmeier, Asuman Dogac, and Gokce B. Laleci. A survey and analysis of electronic healthcare record standards. ACM Comput. Surv., 37(4):277–315, December 2005. Rui Henriques, S Pina, and Cl´ audia Antunes. Temporal mining of integrated healthcare data: Methods, revealings and implications. SDM IW on Data Mining for Medicine and Healthcare, pages 52–60, 2013. J. Lin, E. Keogh, A. Fu, and H. Van Herle. Approximations to magic: finding unusual medical time series. In Computer-Based Medical Systems, 2005. Proceedings. 18th IEEE Symposium on, pages 329–334, June 2005. Weng-Keen Wong, Andrew Moore, Gregory Cooper, and Michael Wagner. Bayesian network anomaly pattern detection for disease outbreaks. In ICML, pages 808–815, 2003. Wan-Shiou Yang and San-Yih Hwang. A processmining framework for the detection of healthcare fraud and abuse. Expert Systems with Applications, 31(1):56 – 68, 2006. Robert Fabrikant, Paul E Kalb, Pamela H Bucy, and Mark D Hopson. Health care fraud: enforcement and compliance. Law Journal Press, 2014. Hans-Peter Kriegel, Peer Kroger, Erich Schubert, and Arthur Zimek. Interpreting and Unifying Outlier Scores, chapter 2, pages 13–24. DATASUS Informatics department of the Brazilian public health system. http://www2.datasus.gov.br/DATASUS/index.php. Online. Roger J Marshall. Mapping disease and mortality rates using empirical bayes estimators. Applied Statistics, pages 283–294, 1991.

Mining Strongly Correlated Intervals with Hypergraphs Hao Wang ∗

Dejing Dou †

Yue Fang ‡

Abstract Correlation is an important statistical measure for dependency estimation between numerical attributes in multivariate data sets. Previous research related to correlation discovery mostly dedicates to find attribute sets with strong piecewise correlations. Other research such as correlation preserving discretization builds discretization schema with preserved underlying correlation. In this paper, we consider the research problem to discover strongly correlated interval sets from numerical data. We propose a hypergraph model to capture the underlying correlation structure in the multivariate numerical data and an algorithm to discover the strongly correlated interval sets. Strongly correlated interval sets are interesting because the data could be strongly correlated within certain interval sets while the two attributes are generally less or not correlated. Normalized correlation is also proposed as the optimization goal that helps generating the intervals without user defined thresholds. Our experiment results in synthetic, neuroscience and social health network data validate our algorithm.

1 Introduction Correlation is a widely used statistic measure for dependency estimation between numerical attributes in multivariate data sets. Its value typically reflexes the degree of covariance and contra-variance relationship between the attributes of the data. The data in strongly correlated attributes not only happen together, but also associate with unanimous trends of raising and falling. Previous data mining algorithms focus on discovering attribute sets with high piecewise correlations between attributes. For example, from the meteorology data one might discover that humidity and precipitation has strong positive correlation while air pressure and precipitation has a negative correlation. These correlations indicate that, whenever the humidity (or air pressure) is increasing (or decreasing), the precipitation has a high probability to be increasing as well. The discovered strongly correlated attribute sets present a high level picture of the dependency relationship in the data. However, these correlations only reveal the dependency between the attributes for the data in the full ranges. Numerical data is usually a set of samples from continuous variables. It should contain richer information than just the high level piecewise correlations between attributes. The pattern discovered from the numerical data should be in a more detailed form in order to represent the underlying intricate dependency structure. For example, in the meteorology ∗ Computer

and Information Science, University of Oregon. Computer and Information Science, University of Oregon. ‡ Department of Decision Science, University of Oregon § Department of Decision Science, University of Oregon †

25

Yongli Zhang §

data, the precipitation usually involves in a very complicated dependency relationship with other attributes which makes the weather forecasting a hard problem. In general, the rate of precipitation is more positively (or negative) correlated with humidity (or air pressure) especially when the humidity is large enough. The common way to define the more detailed pattern on numerical data is through a set of intervals with corresponding properties. In this paper, we try to discover the interval sets with strong correlations. The pattern discovered will be in the form of, for example, “Humidity[20%, 30%], Precipitation[70%, 90%], Correlation 0.81”. The correlation of the interval set, in this example 0.81, is defined by the correlation of the data that falls inside of all the intervals in the set. When intervals from more than attributes are involved, the correlation of the interval set is defined as the average piecewise correlation for all intervals in the interval set. Comparing with the correlations between numerical attributes, the strongly correlated interval sets have the potential to represent much more detailed patterns. Therefore, the discovered strongly correlated interval sets would provide us valuable insight with more complicated dependency relationships encoded in the data. For example, in the stock market, the demand of stocks and bonds generally raises as the price falls. This market principle of price and demand is the corner stone of a stable financial market. However, under certain circumstance such as a potential economic crisis, when the price falls below certain thresholds and crash in the stock market is triggered, the demand would decrease together with the decline of the price for certain price range. In this example, the association rules from historical transaction data with either positive or negated correlation on certain price ranges will provide the investors with useful insight on when and how to reduce the risk in the financial investment. The problem of discovering interesting interval sets has been addressed by research from many perspectives. Discretization [7] is one of the intuitive ways to generate intervals on numerical attributes. Various trade-offs were made for different discretization algorithms towards the information loss, compression rate, and computation complexity. Mehta et al. [10] proposed a PCA based unsupervised discretization method which could preserve the underlying correlation structure during the discretization process. Correlation preserving discretization algorithm makes segmentations on the numerical data while taking the underlying cor-

relation structure into consideration. Most of the time, the correlation preserving discretization serves as a preprocessing step for various other algorithms such as classification and frequent itemset mining. While it is able to discover strongly correlated interval sets, the drawbacks of the discretization still hold in this method. Crisp boundary problem [6] in discretization forces the boundary optimization to make trade-offs between all attributes that as a result none of the discovered interval sets is global optimum. The number of segmentations is another dilemma that discretization based methods have to face. More segmentations means less information loss during the discretization process, while discretization with less segmentations can also result in large intervals that correlations within small intervals are not able to be discovered. In this paper, we propose an algorithm with the hypergraph random walk model to discover the strongly correlated interval sets in numerical data. We propose to model the numerical data with a hypergraph representation and use the average commute time distances to capture the underlying correlation structure. An algorithm is proposed to discover the interval sets with high piecewise correlations. In our algorithm, the interval sets and their correlation optimization work as a single step process. The boundaries of intervals in each interval set are optimized independently. Therefore, they would not suffer from the crisp boundary problem as in the discretization method anymore. The correlation in this paper is defined by the Spearman’s rank correlation coefficient [5]. The correlation coefficient is an estimation of correlation which is a random variable depends on both the underlying correlation priori and the number of data samples available. Statistically, the estimation deviation has a negative relation with the number of instances [5]. For example, in data set only 2 data instances the correlation is always ±1 with infinite deviation while for data set with infinite number of samples, the estimation converges to the true correlation value with 0 deviation. Due to the reasons stated above, in this paper we introduce the normalized correlation ρnorm [12] to discover the interval set without user predefined threshold. The normalized correlation is defined as √ ρnorm = ρ n, in which ρ is the correlation estimation, such as the Spearsman’s rank correlation and n is the number of data samples falls used for the estimation. The normalized correlation ρnorm has the following advantages: • It makes trade off between the estimated value and the estimation accuracy. The patterns we found will not contain the patterns with huge deviation such as the patterns with only two data instances. • The effort to search and optimize a user defined threshold is then discarded. The rule with too less data in-

26

stances will result in low ρnorm due to the large deviation and so does the rule with large data instances such as the full range rules due to lower correlation value. • It renders us the ability to find the interesting interval sets based patterns that used to fall outside the threshold. Our main contributions in this paper are: • We propose an hypergraph random walk model to capture the underlying correlations of the numerical data. This model constructs and measures the average commute time distance between numerical data values, thus is able to capture the correlation relationship well in the data. • We propose a strongly correlated interval sets discovery method based on the hypergraph random walk model. This algorithm is able to generate strongly correlated interval set efficiently with high accuracy. • We propose the normalized correlation to generate strongly correlated interval sets with user predefined threshold. The rest of this paper is organized as follows: We give a brief introduction of related works in Section 2. We present the problem definition in Section 3. We make a detailed description of our method in Section 4. We report experiment results in 5. We conclude the paper in Section 6. 2 Related work Previous research has proposed various methods for the discovery of interesting interval sets. Discretization and quantitative association mining(QAM) are the two most common methods. Discretization is the one of the most straight forward way to generate intervals for various optimization goals. Kotsiantis and Kanellopoulos [7] made a thorough survey of the discretization techniques. These discretization methods target different optimization goals, such as minimizing the information loss, maximizing the entropy so on and so fourth. Mehta et al. [10] proposed a PCA based unsupervised discretization method which could preserve the underlying correlation structure during the discretization process. The discretization process serves as an independent process which can be feed into many other data mining tasks such as association mining and clustering. Quantitative association mining is another technique related to interval set discovering. The quantitative association mining is an extension of the traditional association mining. It generates interval sets on the numerical data instead of the categorical data. Srikant and Agrawal [11] first an algorithm that dealt with numerical attributes by discretizing numerical data into categorical data. Fukuda et al. [13][14] proposed several meth-

ods that either maximize the support with predefined confidence or maximize the confidence with predefined support by merging up adjacent instance buckets. In these algorithms above, the algorithms that use discretization method suffer from information loss during the discretization, the catch-22 problem [11] and the crisp boundary problem [3] as well.

Figure 2: Correlation Measure

in which x¯ and y¯ stand for the average order ranks of attribute x and y respectively. The value of the Speareman’s rank correlation coefficient is between [-1, 1], in which +1 and -1 indicate the maximum positive and negative correlations Figure 1: Numerical Data Representation respectively. As shown in Figure 2, the values of attribute 2 in itemset 1 raises strictly along with the values in attribute 1, the correlation measure is therefore +1. In itemset 2, the values of attribute 2 fall when values of attribute 1 raise, 3 Problem Definition the correlation is -1. Intervals with 1 or -1 are of strong Before introducing our novel method for discovering correlation and the uncorrelated interval sets have correlation strongly correlated intervals, we present the formal defini- close to 0. tion of the problem. 4 Method 3.1 Data Representation As illustrated in Figure 1, it To discover the strongly correlated interval sets is a complex shows an example of numerical data with values sorted in process as the combination of correlation is not linear and ascending order on attributes. Let A = {Att1 , Att2 , ..., Att M } recalculation of it is required each time the intervals are be the set of attributes, in which M is the number of combined. For interval set {S att , S inter }, the correlation is total attributes. Let I = {Inst1 , Inst2 , ..., InstN } be the defined as the average correlations between each pair of two set of data instances, in which N is the number of total intervals in the set. The computation of this correlation data instances. The value of the ith attribute Atti on the requires a complexity of C 2 n2 = O(N 2 M 2 ) in which C n m natt inst jth data instance Inst j is denoted as Di j . The cutting stands for the number of combination. Using the bottom points ci1 , ci2 , ... , ci(N−1) are the averages of each two up merging method, for example, we initialize N intervals adjacent values on corresponding attribute atti . The interval on each attribute and merge the adjacent intervals with boundaries of the patterns are defined on these cutting points. the minimum correlations, a computation of correlation is The set of attributes, instances and intervals for an interval required for each pair of intervals which are possible to set are denoted as S Att , S Inst and S Inter respectively. A pattern be merged. This computation requires a complexity of R is in the form R = ∪Atti ∈S Att Atti [cil , cih ], [correlation c]. O(NC N2 Cn2att N 2 ) = O(N 5 M 4 ). As the amount of combination of the attribute sets is O(M!), the complexity of bottom up 3.2 Correlation Measure The correlation we used in this mining that applied on all combination of attribute sets is paper is the Spearman’s rank correlation coefficient [5]. The then O((M + 4)!N 4 ). Therefore, to use the straight forward Spearman’s rank correlation coefficient is defined as: merging algorithm for discovering the strongly correlated interval sets is a non-trackable process. In this section, P ¯) we present our hypergraph based method which efficiently i (xi − x¯ )(yi − y ρ = pP , P 2 2 solve the problem. We propose a hypergraph based model (x − x ¯ ) (y − y ¯ ) i i i i

27

to represent the numerical data. The correlation measure and edges respectively and VG = tr(Dv ) is the volume is captured by the average commute time distance between of hyper graph. The average commute time distance is defined by the inversion of normalized average commute vertices in the hypergraph model. time similarity [8]. As mentioned in [9] the commute4.1 Hypergraph representation of numerical data Hy- time distance n(i, j) between two node i and j has the pergraph is a generalization of regular graph that each edge desirable property of decreasing when the number of paths is able to incident with more than two vertices. It is usually connecting the two nodes increases. This intuitively satisfies represented by G = (V, E, W), in which V, E and W are the the property of the effective resistance of the equivalent set of vertices, edges and weights assigned to corresponding electrical network [4]. edges respectively. The incident matrix of a hypergraph G is defined by H in which ( 1 if v ∈ e (4.1) H(v, e) = 0 if v < e

Figure 4: Relationship Between Correlation and Average Commute Time Distance

Figure 3: Hypergraph Representation of Numerical Data As shown in Figure 3, each data value Di j corresponds to a vertex in the hypergrah. Each pair of vertices with adjacent values, Di j and Di( j+1) , are connected through a hyperedge with a weight proportion to the inversion of the distance between them. The vertices in the same data instance with value Di1 , Di2 ,...,DiM are connected by a hyperedge as well with user defined weight. Zhou et al. [15] generalized the random walk model on hypergraph and defined the average commute time similarity S ct and the Laplacian similarity S L+ . The average commute time similarity n(i, j) is defined by (4.2)

n(i, j) = VG (l+ii + l+j j − 2l+i j ),

in which l+i j is the ith and jth element of matrix L+ , L is the hypergraph Laplacian defined by (4.3)

T L = Dv − HWD−1 e H

, and {.}+ stand for Moore-Penrose pseudoinverse. Dv and Dv denote the diagonal matrix containing the degree of vertices

28

4.2 Correlation and Average Commute Time distance The average commute time distance is a random walk distance that represents the average distance one goes through when traversing from one node in a graph to another through all possible routes. The relationship between the random walk average commute time distance and correlation is as follows. As in Figure 4, the data instances t1 , t2 and t3 are strongly negative correlated and the data instances t4 , t5 and t6 are not strongly correlated. As mentioned in the last section, the weight of the hyperedge between each two adjacent values is the inversion of their distances in the original data. The non direct random walk paths on strongly correlated data instances, for example, N11 → N21 → N22 → N12 is shorter than the path on loosely correlated data instances, for example, N11 → N21 → N22 → N12 because the corresponding data values on the other correlated attributes are closer as well. In this case, the average commute time distance between the strongly correlated vertices is relatively shorter than the not strongly correlated vertices. For the reason above, the average commute time distance from random walk model is capable in capturing the correlation measures for our strongly correalted interval set discovery problem.

Function: Interval Set Discovery Input: L, H, Initialize: C = R M×M f or i, j = 1 to M Ci, j = VG (l+ii + l+j j − 2l+i j ) while: mergeable f or i, j = 1 to M Merge Interval Function: Merge Interval Input: C Initialize: IS = I1 , I2 , I3 ... I M with nk ∈ ik , D=C while: rule generated do: k = min index(Dk,k+1, k ∈ 1...M − 1) ik = Merge(kk ,ik+1 )

Figure 5: Algorithm Illustration

scan the corresponding intervals on the rest attributes. The normalized correlation which is calculated for these pairs of intervals. The interval sets are used to update the final result, and only the top k interval sets with best normalized correlations are kept in set. If the distance metric is above the user predefine threshold such as the Intervala and Intervalb in Figure 5, then the two intervals are combined into one attribute interval set {S att , S inter } and an interval set is generated. The mining process continues to generate interval sets till intervals on all attributes merges into full range.

for k = 1 to M-1 Di,k = Di,k + Di+1,k Dk,i = Dk,1 + Dk,i+1 delete Di delete D j

Table 1: Algorithm Description

4.3 Algorithm Description Base on the observation in the last section, we propose an algorithm for discovering strongly correlated interval set in numerical data. The pseudo code of our algorithm is shown in Table 1. The algorithm first builds up the hypergraph model random walk model as described in Section 4.1, the adjacency matrix H is built up for the hypergraph. In function Interval Set Discovery, the Laplacian matrix is computed from equation 4.3. The average commute time distance matrix is generated with entry i, j according to formula 4.2. The distance between the adjacent data instances are the inversion of the average commute time similarity. In function Merge Interval, a bottom up mining process is applied simultaneously on all the attributes. For each attribute, an interval ik is initialized with boundary [ci , ci+1 ] for node nk , i.e., exactly one node for each interval. A distance matrix Ci is maintained for each attribute, the distances between intervals are initialized as the average commute time distance for the node corresponds to this interval. In each iteration, for each attribute, the algorithm looks up the minimum distance between the adjacent intervals ik and ik+1 , then merge these two intervals. The distance matrix is updated accordingly. In every few iterations, for each generated interval, we

29

5 Experiment Result In this section, we present the experiment result applied with our algorithm. We first evaluate our algorithm with the synthetic data set to validate the accuracy of the interval sets found by our algorithm. We perform the evaluation on two random distributions: the Guassian distribution and the Mixed Guassian distribution. We further evaluate our algorithms on two real life data sets. The two data sets we used are the neuro science data set, the NEMO [1] data set and the SMASH [2] data set which is a health social network data set. The experiment results from the synthetic data, NEMO data and the SMASH data are presented in Section 5.1, 5.2 and 5.3 respectively.

Quality Gaussian Mixed Gaussian

Weight Setting A 90% 81%

Weight Setting B 91% 76%

Table 2: Experiment On Synthetic Data Set

5.1 Experiment on the synthetic data Since the real life data set with benchmark on correlation is rare, we first

adjustable in our algorithm. As mentioned in Section 3.2, the weight setting between the two kinds of hyperedge adjusts the weighing of random walk. Intuitively the higher weight assigned on hyperedge for adjacent data values, the higher probability the random walk will travel locally instead of across the attributes. This renders the algorithm the better ability in distinguishing the interference from the other data sources. The results in Table 6 validate this intuition. In this table, the weight setting A has higher weight on adjacent data value hyperedges and weight setting B has a higher weight on instance hyperedges. The weight setting A results in a better performance in the Mixed Gaussian data.

Figure 6: Interval set Discovery from Mixed Gaussian Data

evaluate our algorithm with synthetic data of which the parameters are controlled. We define the quality of an interval set by proportion of intersection with the benchmark rule generated by the brute force search. We perform this evaluation on the randomly generated Gaussian and Mixed Gaussian distribution for 1000 times respectively. The experiment results are shown in Table 2. As we mentioned in Section 4, there are two different hyperedges in our hypergraph model, hyperedges between adjacent data values on the same attribute and the hyperedges across the attributes which represent the data instances. In Table 2, the weight setting A and weight setting B puts a higher weight on the first kind and second kind hyperedges respectively. The rules generated by our algorithm reach roughly 90% and 80% accuracy on the Guassian and Mixed Gaussian distribution respectively. It proves that our algorithm is indeed capable to find the most strongly correlated interval sets efficiently. Notice that the experiment on Mixed Gaussian distribution has 10% performance downgrade comparing with the results with Gaussian Distribution. This is because in the process of building up interval set from lower dimensions to high dimensions, the algorithm relies on the projection on lower dimension in the first few iterations. In the case that that two Guassian source have mixed projection on all related dimensions, a deviated interval might be generated. Figure 6 show a Mixed Gaussian distribution with mixed projection on both dimensions and the interval set generated accordingly. The blue and red rectangles indicate the area found by hypergraph method and brute force search respectively. In this case we can clearly see how the deviation happen: the projection of two Guassian sources has interference on each of the lower dimension where the intervals are first built up. The ability to react against the projection interference is

30

5.2 Experiment on the NEMO data NEMO [1] is the abbreviation of the Neural ElectroMagnetic Ontologies system. It is designed for the purpose of neuro science data sharing and analysis. The data used from NEMO system is the eventrelated potentials (ERP) or the “brainwave” data. In the past decades, the studies on ERP data have resulted in many complex neural patterns that can be used to predict human behavior cognition, and neural function. The ERP data is measured through the electro encephalography, i.e., the measure of electrical activity on the scalp using electrodes. The values of ERP data are therefore all numerical. Each tuple in the data represents the activities in various parts of the cortical network at some specific time. The study of correlation of the ERP data has the potential to reveal the connection and function pattern in the cortical network. In Table 3(a) and Table 3(b) we list the patterns found by correlation preserving discretization algorithm [10] and our hypergrah based algorithm respectively. The attributes in these tables reflex the activity measured in certain part of the cortical network at some specific time. For example, the attribute IN-LOCC stands for left occipital lobe and IN-ROCC stands for the right occipital lobe. The pattern discovery from the NEMO data could reveal interesting information with regarding to the connection and function pattern in the cortical network. For example, the rule in Table 3(a), “IN-LOCC[-1.87, 0.10], IN-ROCC[-1.90, -0.09]” reveals a correlation between left occipital lobe and the right occipital lobe. The left occipital lobe in human brain is responsible for the vision and eye control functions while the right occipital lobe is in charge of reading and writing skills. The relation between IN-LOCC and IN-ROCC implies the potential connection between the left and right lobe in the cortical network. It also reveals the possible affiliation of functions between the vision and reading, writing skills. We evaluate the correlation interval sets by both correlation and correlation gain. Notice that even if the correlation for some rules is high enough in Table 3, the correlation gain is relatively low. The correlation gain is defined as the ratio between the correlation of the interval set and the correlation of the attributes involved. High correlation gain implies that

(a) Top Five Rules from The Correlation Preserving Discretization IN-LATEM[2.79, 3.31] IN-LPTEM[-0.91, -0.42] IN-LPTEM[-0.42, -0.33] IN-LOCC [-1.87, 0.10] IN-RPTEM[-0.88, -0.40]

Attribute Set IN-LFRON[-3.42, -1.24] IN-RATEM[-0.91, -0.40] IN-RFRON[-1.24, -0.38] IN-ROCC[-1.90, -0.09] IN-LPTEM[-0.91, -0.42]

IN-RFRON[-3.42, -1.24]

Correlation -0.87 0.89 0.94 1.00 0.99

Correlation Gain 21.40 6.43 1.13 1.00 0.99

Normalized Correlation 0.93 0.43 0.26 0.66 0.48

(b) Top Five Rules Ranked by Correlation Gain Attribute Set TI-max[236.00, 308.00] IN-RFRON[-39.26, -0.34] IN-LFRON[0.99, 7.11] IN-RATEM[2.41, 17.28] IN-LATEM[0.45, 1.31] IN-LFRON[-5.48, -1.89] IN-LFRON[0.99, 7.11] IN-LATEM[2.43, 17.38] IN-RORB[-1.05, -0.07] IN-LPTEM[0.22, 3.34]

Correlation 0.90 1.00 -1.00 1.00 -1.00

Correlation Gain 66.89 32.95 27.21 27.21 12.25

Normalized Correlation 6.61 9.53 1.75 9.55 1.74

(c) Top Five Rules Ranked by Normalized Correlation Attribute Set IN-LATEM[-7.47, 24.30] IN-RATEM[-7.58, 24.81] IN-LOCC[-11.80, 11.32] IN-LPAR[-9.72, 9.69] IN-LPTEM[-0.87, 19.82] IN-RPTEM[-0.95, 20.13] IN-LPAR[-9.77, 9.69] IN-RPAR[-9.76, 9.43] IN-LATEM[-7.47, 18.95] IN-RORB[-9.08, 17.83]

Correlation 0.98 0.98 0.98 0.96 0.95

Correlation Gain 1.00 1.21 1.00 1.04 3.13

Normalized Correlation 13.63 13.63 13.55 13.29 13.11

Table 3: Experiment Results From The Neuro Science Data even though the two attributes are largely uncorrelated most of time, strong correlation might still be found inside some interval sets. The rule with high correlation gain is valuable because it can reveal the interval sets that do not show up under a general picture, i.e., the strongly correlated interval sets hidden under the less correlated attributes. On the other hand, for some rules with high correlation but a low correlation gain measure, most of the contribution of the correlation comes from the attributes but not the interval itself. For example, for the last rule in Table 3(a), “IN-RPTEM[-0.88, 0.40], IN-LPTEM[-0.91, -0.42]”, even if it has a near full correlation 0.99, the correlation gain is 0.99 which means correlation for these two intervals is even lower than the overall correlation between these two attributes. These rules are therefore not useful since the correlation between the two attributes already gives us all the information that we need. Comparing the experiment results from Table 3(a) and Table 3(b), our rules shows not only better correlation but better correlation gain as well. The interval sets returned by our algorithm have an average correlation of 0.98 and average correlation gain of 33.2, while the rules return by correlation preserving discretization have an average correlation of 0.94 and average correlation gain of 6.2. After a deeper look into the NEMO data, we found that the attributes in the NEMO data have a relative high correlation between attributes. The correlation between the left right brain counterpart, for example, the IN-LOCC and IN-ROCC, is usually greater than 0.9. In this case, the ability to find the interval sets with higher correlation gain is more useful since the correlation between the correlated attributes could be easily discovered by other previous methods. In table 3(c) we list the results from our algorithm with

31

normalized correlation as the correlation measure. The results from our algorithm in Table 3(b) and 3(c) also has a higher normalized correlation than the results from the correlation preserving discretization in Table 3(a). As we mentioned in Section 1, the normalized correlation makes tradeoffs between the correlation estimation value and estimation accuracy. The two factors that affect the normalized correlation are the correlation estimation value and the number of data samples. With regarding to these two factors, correlation preserving discretization has to make trade-offs on the parameter of number of segmentations. Discretization with more segmentations would result in smaller intervals and statistically higher correlation estimation. However, with smaller intervals, less data samples will fall into the interval which as a result will reduce the correlation estimation accuracy and the normalized correlation value as well. It is usually hard for the discretization method to make a good choice of number of segmentations that result in both good correlation estimation value and estimation accuracy. 5.3 Experiment on the SMASH data SMASH [2] is the abbreviation of Semantic Mining of Activity, Social, and Health Data Project. It was designed for the purpose of learning the key factors that spread the healthy behaviors in social network. It consists of distributed personnel device and webbased platform that collect data from both social and physical activity. The data collected in this project include social connections and relations, physical activities and biometric measures from the subjects. After preprocessing, the input data in our experiment contain the following attributes for the physical activities and biomedical measures. The physical activity indicator Ratio No.Steps is the ratio of step that

(a) Top Five Rules From The Correlation Preserving Descretization Attribute Set Ratio LDL[0.74, 0.85] Ratio No.Steps[0.32, 0.89] Ratio BMI[0.93, 1.01] Ratio No.Steps[0.89, 1.21] Ratio LDL[0.99, 1.09] Ratio BMI[0.93, 1.01] Ratio BMI[1.01, 1.12] Ratio LDL[0.99, 1.09] Ratio BMI[1.01, 1.12] Ratio LDL[1.09, 1.31]

Correlation 0.62 0.82 0.81 0.78 0.64

Correlation Gain 13.70 5.13 2.86 2.75 2.25

Normalized Correlation 1.68 1.31 0.73 1.16 1.34

(b) Top Five Rules From The Hypergraph Based Method Attribute Set Ratio HDL[0.97, 1.26] Ratio BMI[0.89, 1.00] Ratio HDL[0.79, 0.91] Ratio No.Steps[0.78, 3.94] Ratio LDL[1.00, 1.02] Ratio No.Steps[0.39, 3.82] Ratio HDL[0.78, 1.25] Ratio No.Steps[0.27, 3.96] Ratio LDL[0.88, 1.02] Ratio No.Steps[0.01, 3.82]

Correlation 0.94 0.57 1.00 0.37 0.72

Correlation Gain 45.22 28.15 22.11 18.41 16.12

Normalized Correlation 1.67 2.51 2.01 3.85 1.65

(c) Top Five Rules Ranked by Normalized Correlation Attribute Set Ratio LDL[0.70,1.00] Ratio HDL[0.86,1.00] Ratio LDL[1.00,1.18] Ratio HDL[1.00,1.14] Ratio LDL[0.69,0.93] Ratio BMI[0.97,0.99] Ratio LDL[0.92,1.05] Ratio No.Steps[1.17,2.31] Ratio HDL[1.02,1.08] Ratio No.Steps[1.20,1.91]

Correlation 0.55 0.53 0.63 0.56 0.79

Correlation Gain 2.01 1.93 21.01 10.68 194.41

Normalized Correlation 5.75 4.99 4.14 3.87 3.71

Table 4: Experiment Results from the SMASH Data the human subjects walked through in two consecutive periods of time. Three biomedical measures HDL, LDL and BMI are used for the health condition indicators. The HDL and LDL stand for the high density lipoprotein and low density lipoprotein respectively. The rate of HDL usually relates with decreasing rate of heart related disease and the reverse case for LDL. The BMI stands for body mass index which is a common indicator of the obesity level. The study of this data set dedicates to discover the relations between physical activities, obeisity, and health condition of cardiovascular system. In Table 4 we list our experiment results from the SMASH data. The interval sets in Table 4(a) and 4(b) are results from the correlation preserving discretization [10] and our hypergraph based method respectively. The second and third interval sets in Table 4(b) demonstrate two origins of the high correlation gain. Although the correlation for the second rule is only 0.56, it results in an even higher correlation gain than the third rule. For the third rule, although the correlation is maximum which is 1.00, part of the high correlation in it comes from more of the two attributes rather than from the interval set. Comparing the experiment results from the above two tables, our algorithm not only returns interval sets with higher correlations, but also higher correlation gains. The first interval set in Table 4(b) “Ratio HDL[0.97, 1.26], Ratio BMI[0.89, 1.00], Correlation 0.94, Correlation Gain 45.22” has a fairly high correlation gain 45.22 which means the data in this interval set is 45 times more correlated than the general correlation between these two attributes. Strongly correlated interval set provides us interesting information with regarding to relationship between the health condition

32

of cardiovascular system and obesity. As we mentioned in last paragraph, as a good health indicator, HDL is usually expected to have a strong correlation with other health condition factors such as BMI. However, the correlation between the two attributes, HDL and BMI, will be mostly ignored because the correlation is not large enough to raise any attention. The rule discovered by our algorithm shows that, when the Ratio BMI change is under the moderate range, say close to 1.00, the correlations are much larger than in the rest conditions. It indicates that with regarding to the weight variations, no matter increasing or losing weight, the first few pounds might be ones that affect the health condition most. On the other hand, it also indicates that drastic exercises with a rapid weight losing rate will be likely, on the contrary, result in a deterioration of cardiovascular health condition. Notice that how the rules discovered by our algorithm has the ability to be overlapping with each other. For example, the second and third rule in Table 4(b) “Ratio HDL[0.79, 0.91], Ratio No.Steps[0.78, 3.94]” and “Ratio LDL[1.00, 1.02], Ratio No.Steps[0.39, 3.82]”, the intervals on Ratio No.Steps has a large overlapping between 0.78 and 3.82. Comparing with the correlation preserving discretization method, the optimization of interval boundaries in our algorithm is independent of each interval set. The rules found by correlation preserving discretization suffers from the crisp boundary problem heavily. Notice how the boundaries of intervals on attribute Ratio BMI are decided for the last four interval sets in Table 4(a). The intervals for Ratio BMI in these rules only have two choices, Ratio BMI[0.93, 1.01] and Ratio BMI[1.01, 1.12]. The similar cases hold for the other attributes Ratio LDL and Ratio No.Steps as well.

The decision of the boundary on each attribute has to take References into consideration of the correlations with all other attributes. The trade-off for the boundary decisions in the discretization methods makes them hard to make an optimization for ev- [1] AIMLAB, University of Oregon. Neural ElectroMagnetic Ontologies. http://aimlab.cs.uoregon.edu/NEMO/web/. Last ery interval set discovered in the data. Therefore, all the inaccessed: Oct, 2014. terval sets found by the correlation preserving discretization [2] AIMLAB, University of Oregon. Semantic method are suboptimal. Mining of Activity, Social, and Health data. In Table 4(c) we listed the result from our algorithm http://aimlab.cs.uoregon.edu/NEMO/web/. Last accessed: with the normalized correlation as the correlation measure. Oct, 2014. Because normalized correlation provides a trade off between [3] Y. Aumann and Y. Lindell. A statistical theory for quantitative association rules. In ACM International Conference correlation estimation and estimation accuracy, the correlaon Knowledge Discovery and Data Mining, pages 261–270, tion estimation value in this table is lower than the ones in 1999. the other two tables. However, the results in this table re[4] P. Doyle and L. Snell. Random walks and electric networks. veal interesting patterns as well. As we mentioned earlier the Applied Mathematics and Computation, 10:12, 1984. LDL and HDL are positive and negative indicators for health [5] D. Freedman, R. Pisani, and R. Purves. Statistics. W. W. especially obesity related diseases. People would usually esNorton & Company, 2007. timate that the LDL and HDL will always have a negative [6] H. Ishibuchi, T. Yamamoto, and T. Nakashima. Fuzzy data correlation. However, the first and second rule in Table 4(c) mining: effect of fuzzy discretization. In IEEE International gives us a positive correlation between HDL and LDL when conference on Data Mining, pages 241–248, 2001. the change ratio is close to 1, i.e., when the subjects do not [7] S. Kotsiantis and D. Kanellopoulos. Discretization techincur drastic change of activities. niques: A recent survey. International Transactions on Computer Science and Engineering, 32(1):47–58, 2006. The last rule in Table 4(c) which has extremely high correlation gain does not show up in Table 4(a) and 4(b). This [8] H. Liu, P. LePendu, R. Jin, and D. Dou. A hypergraph-based method for discovering semantically associated itemsets. In rule is ignored in the first two tables because the sizes of its IEEE International Conference on Data Mining, pages 398– intervals are not above the user defined threshold. Without 406, 2011. the user defined threshold, the normalized correlation ren[9] L. Lov´asz. Random walks on graphs: A survey. Combinaders us the potential to find the rules valuable but not signiftorics, Paul erdos is eighty, 2(1):1–46, 1993. icant enough to notice. The rule indicates the fact that even [10] S. Mehta, S. Parthasarathy, and H. Yang. Toward unsuperthe situation is rare, the HDL and No.Steps has a positive corvised correlation preserving discretization. IEEE Transacrelation when the amount of exercise increases drastically. tions on Knowledge and Data Engineering, 17(9):1174–1185, Although generally exercises do not make drastic changes 2005. on the HDL, in the situation when the subject changes the [11] R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In ACM International Conferamount of exercises drastically, such as at the begining of ence on Management of Data, pages 1–12, 1996. weight reducing program, it will result in a greater change of [12] V. Struc and N. Pavesic. The corrected normalized correlation the HDL index. 6 Conclusions We present in this paper an algorithm for discovering strongly correlated interval sets from numerical data. Previous research either dedicates to discover the correlated attribute sets from the full ranges of the numerical data or use discretization methods to transform and compress the numerical data before the pattern discovery. These methods, however, have drawbacks from various perspectives. The method we proposed in this paper can discover strong correlated interval sets from the less correlated or uncorrelated attributes in numerical data. These discovered interval sets are not only strongly correlated but also have independently optimized boundaries with regarding to the correlation measures. Our experiment results show that we are indeed able to find better strongly correlated interval sets with higher correlation, correlation gain, and normalized correlation.

33

coefficient: A novel way of matching score calculation for lda-based face verification. In International Conference on Fuzzy Systems and Knowledge Discover, pages 110–115, 2008. [13] F. Takeshi, M. Yasuhido, M. Shinichi, and T. Takeshi. Mining optimized association rules for numeric attributes. In Symposium on Principles of Database Systems, pages 182–191, 1996. [14] F. Takeshi, M. Yasukiko, M. Shinichi, and T. Takeshi. Data mining using two-dimensional optimized association rules: scheme, algorithms, and visualization. In ACM International Conference on Management of Data, pages 13–23, 1996. [15] D. Zhou, J. Huang, and B. Sch¨olkopf. Learning with hypergraphs: Clustering, classification, and embedding. In Advances in Neural Information Processing Systems, pages 1601–1608, 2007.

Modeling Trajectories for Diabetes Complications Pranjul Yadav



Katherine Hauwiller

Lisiane Pruinelli ∗



Andrew Hangsleben †‡

Bonnie L Westra Michael Steinbach



Connie W Delaney Gyorgy J Simon

Abstract Diabetes mellitus (DM) is a prevalent and costly disease and if not managed effectively, it leads to complications in almost every body system. Evidence-based guidelines for prevention and management of DM exist, but they ignore the trajectory along which the disease developed. With the implementation of electronic health records (EHRs), sufficiently detailed data are available to elucidate diabetes trajectories and sequences of diabetes-related comorbidities across subpopulations. As a first step, we developed a Diabetes Mellitus Complication Index (DMCI) using EHR data to summarize the patients' condition pertinent to diabetes into a single score. Next, we modeled trajectories of developing diabetic complications based on groups of patients with varying diabetic complications at baseline. Such knowledge can form the basis for future trajectory-centered guidelines.

1 Introduction and Background Diabetes mellitus (DM) affects 11.3% (25.6 million) of Americans age 20 or older and is the seventh leading cause of death in the United States [1]. There is considerable research on risk factors to predict and manage diabetic outcomes [1]. Without appropriate management of diabetes, patients are at risk for secondary diseases in almost every body system at later time points. Evidence based practice (EBP) guidelines for management and prevention of diabetic complications synthesize the latest scientific evidence. While EBP guidelines have been shown to improve care, they neither consider the patient's trajectory nor the sequence of events that lead up to the patient's current conditions. In this work, we show that such information is invaluable; a patient's risk of developing further complications depends on their ∗ Department

34

Sanjoy Dey †‡



Vipin Kumar





trajectory thus far. Simple disease models describing a single typical diabetes trajectory as a sequence of successively worsening conditions exist [2]. However, these models were aimed more at patient education than at a physiologically accurate description of the evolution of the underlying disease pathology. Such simple models obviously cannot form the basis of evidence based guidelines. In heterogeneous diseases, analyzing the data on a persubpopulation basis has been shown to elucidate more interesting patterns than analyzing the entire population [3, 4]. In this work, we hypothesize, with abundant supporting evidence [5], that diabetes and the underlying metabolic syndrome follows multiple trajectories. We aim to develop a methodology that is capable of elucidating scientifically accurate diabetes trajectories retrospectively from the extensive clinical data repository of a large Midwestern health system. Specifically, we study a diabetic population and track changes to their health over time in terms of diabetes-related comorbidities as documented in the electronic health record (EHR). Diabetes, its severity and the ensuing complications can be described most accurately through a large number of correlated EHR data elements, including associated diagnoses, laboratory results and vitals. The relationships among these data elements, known as multicollinearity, render efforts to track patients' conditions across time fraught with data overfitting issues. To contain the collinearity problem we summarize the patients' condition into a single dimension (a single score), which we term the Diabetes Mellitus Complication Index (DMCI). The development of severity indices from EHRs builds on a rich history. Even in the context of DM, several risk scores for diabetes from EHRs have been developed [6]. Most risk score models focus on predicting the risk

of Computer Science and Engineering, University of Minnesota of Nursing, University of Minnesota ‡ Institute for Health Informatics, University of Minnesota † School



of diabetes rather than the risk of the associated complications. Two risk scores have specifically focused on diabetes complications [7, 8] to predict outcomes; however their diabetes complication indices were limited to the use of complications based on International Classification of Diseases (ICD) codes alone [8] or asking patients if they were ever informed that they had DM complications [5]. In a diabetic population like ours, good predictors of the complications do not necessarily coincide with good predictors of diabetes given that the metabolic syndromes in our patients have already evolved past diabetes. The inclusion of additional variables, such as lab results and vital signs may provide useful information for early prediction of complications. This necessitates the development of a new diabetes complication index to be used in our effort to study patient trajectories. Our work makes the following novel contributions. First, we develop DMCI which summarizes a patient's health in terms of post-diabetic complications into a single score. Second, through the use of this score, we track a patient's health and show that distinct trajectories in diabetes can be identified, demonstrating the need and laying the foundation for future clinical EBP guidelines that take trajectories into account.

laboratory results or vitals before 2009 were excluded on the basis that they show no indication of receiving primary care at the health system. The final cohort consists of 13,360 patients. Patients' initial DMCI was determined at baseline, and their health (in terms of the DMCI score) was followed until last the follow-up. The mean time for follow-up was 1568 with a standard deviation of 263 days. 4 Diabetes Risk Score Development The novel DMCI was developed using Cox proportional hazards survival modeling techniques. Each of the 7 complications (CKD, CVD, CHF, PVD, IHD, Diabetic Foot, Ophthalmic) were modeled through a separate Cox regression model using patients who did not already present with the complication at baseline. Cox Proportional Hazard Models [10] are survival models which estimate the hazard λj (t) for patient j at time t based on covariates Zj and a baseline hazard λo (t). The hazard function has the form as shown in equation 1.

λj (t/Zj ) = λo (t)exp(Zj β) . . . . . . . . . . . . . . . (1) The coefficient vector β is estimated through maximizing the partial likelihood. The partial likelihood can be maximized using the Newton-Raphson algorithm [11]. 2 Data Preparation The partial likelihood [12] has the form as shown in After Institutional Review Board (IRB) approval, a equation 2. de-identified data set was obtained from a Midwest θi University's clinical data repository (CDR). The CDR . . . . . . . . . . . . . . . (2) L(β) =πi:Ci =1 P contains over 2 million patients from a single Midj:Yj ≥Yi θj west health system that has 8 hospitals and 40 clinθj has the form exp(Zβ) and Ci is an indication ics. Data elements included various EHRs attributes, function. Ci is 1 if the event occurred and Ci = 0 if the such as demographic information (age, gender), vital event was censored. The baseline hazard is common to signs: systolic blood pressure (SBP), diastolic blood all patients. Besides the complications (except for the pressure (DBP), pulse, and body mass index (BMI); one we are modeling) age, gender, obesity, hypertension and laboratory test results: glomerular filtration rate and hyperlipidemia diagnosis, laboratory test results (GFR), hemoglobin A1c, low-density lipoprotein choles- and vitals, were included as covariates. Backwards terol (LDL), high-density lipoprotein cholesterol (HDL), elimination [13] was employed for variable selection. triglycerides and total cholesterol. Further ICD-9 codes Each of the 7 regression models (one for each complirelated to both Type 1 and Type 2 DM, and their ac- cation) provided an estimate of the coefficients, which companied complications such as ischemic heart dis- can be interpreted as the relative risk of developing the ease (IHD), cerebrovascular disease (CVD), chronic kid- complication in question. For example, the first regresney disease (CKD), congestive heart failure (CHF), pe- sion model estimates the risk of a patient developing ripheral vascular disease (PVD), Diabetic Foot, and CHF. Similarly the second regression model estimates Opthalmic complications were used in this study. the risk of patient developing IHD. In order to compute a patient’s risk for developing diabetes induced com3 Study Design and Cohort Selection plications, we compute a weighted average of patient’s For our study, we used Jan. 1, 2009 as a baseline. The risk from the six regression models. Ophthalmic condistudy cohort consists of patients with type 1 or type 2 tions are no longer considered as they have insufficient DM at baseline, identified in billing transactions. Pa- patient coverage (less than 100 patients). Patient’s risk tients were included if they had at least two A1c results from individual regression model was computed using at least 6 months apart after baseline. Patients with no equation 3.

35

line with one elbow ( at x ˆ). These trajectories can be expressed in the form below, ( a ∗ x + b, if x < x ˆ where rij denotes the ith patient risk for the jth y= complication, Zi represents the covariates for the ith c ∗ x + d, if x ≥ x ˆ patient and βj are the coefficients estimated for the jth where in a,b,c,d  R. complication. Using this information, the ith patient’s Residual sum of squares (RSS) was used as the risk (Ri ) for any diabetes induced complication is then objective function to obtain the coefficients of the segcomputed using equation 4. mented linear regression and the location of the elbow point. RSS has the form in equation 5 6 P Ri = wj ∗ rij . . . . . . . . . . . . . . . (4) n P j=1 (yi − yˆi )2 . . . . . . . . . . . . . . . (5) RSS = i=1 This risk Ri is the risk of a patient developing diabetes related complications. We named this risk DMCI. In equation 5, yi is the risk at ith time stamp and The DMCI score is the weighted sum of the linear pre- yˆi is the corresponding risk computed using segmented diction from the six regression models. Concordance linear regression. probability estimates [14] are used to determine the performance of the corresponding regression models. They are also used to weight the individual regression 6 Results models. Table 1 represents weights assigned to each Table 2 provides the count of patients in various cohorts. model, with the respective complication as the outcome. Table 3 provides the count for various populations with comorbidities. rij = Zi ∗ βj . . . . . . . . . . . . . . . (3)

Table 1. Weights For Individual Regression Model Complication Model Weight CHF 0.787 IHD 0.569 CVD 0.694 PVD 0.688 CKD 0.758 FOOT 0.712

Table 2. Patient Counts for Single DM Comorbidity Comorbidity Count Comorbidity Count IHD 4398 CHF 741 CVD 986 PVD 662 CKD 742 Foot 267

Therefore, the DMCI score can be thought of as approximately 6 times the relative risk a patient faces in developing a complication (any diabetic complication). 5 Subpopulation Trajectory Extraction Using the DMCI score, the health status trajectory of every patient from 2009 onwards was calculated. Since individual patient trajectories might be susceptible to noise and outliers, we decided to group patients and their trajectories (time stamped sequence of DMCI scores) by complications. First, we considered a single complication at a time, creating seven categories: patients presenting with CKD, CVD, etc. at baseline. A patient presenting with multiple complications falls into all applicable categories. Next, we considered pairs of complications: e.g. one possible category could consist of patients with IHD and diabetic foot problems. For every category (sub-population of patients), the shape of the DMCI score trajectory was determined through segmented linear regression with 3 knots. One can think about these regression models as a straight

36

Table 3. Patient Counts Comorbidity Count IHD, CVD 457 IHD, CKD 361 IHD, CHF 478

for DM Comorbidities Comorbidity Count IHD, PVD 379 IHD, Foot 662

Figure 1 presents the DMCI trajectory for varying subpopulations. The horizontal axis denotes time since baseline in days and the vertical axis corresponds to the DMCI score. Each curve in the graph represents a subpopulation defined by a single complication. For example, the bottommost curve corresponds to patients presenting with CHF at baseline. Their average risk of developing a complication (other than CHF, which they already have) is 4.4 at baseline, It increases steadily for approximately 550 days, at which point it reaches 4.7 and then it becomes flat (stops increasing materially going forward). As observed from the graph, the average risk associated with patients diagnosed with CKD is comparatively higher than that of patients diagnosed with CHF. Figure 1 shows that (i) subpopulations defined by various complications at baseline have a different average

risk at baseline. This information is readily incorporated into existing indices and guidelines. The figure also shows that (ii) these patients have different patterns of risk moving forward. For example, the risk of developing a complication increases sharply for CHF Table 4. Complication IHD CHF PVD CKD CVD Diabetic Foot

patients for 550 days and then becomes flat. In contrast, the risk of IHD increases steadily (but at a lower rate) throughout the observation period; and CKD (topmost curve) increases at a much lower rate.

Distribution of Scores Min-Risk Risk-25 -6.02 1.86 -5.17 1.98 -5.23 2.07 -5.62 1.77 -6.72 2.16 -4.64 2.08

for Different Subgroups Risk-50 Risk-75 Max-Risk 3.92 5.98 22.68 3.86 5.91 12.55 3.90 6.20 14.37 3.94 5.94 14.71 4.00 6.04 14.37 3.95 6.08 14.37

Figure 1: Health status trajectory for varying subpopulations

Figure 2: Shape of the individual quartiles for patients diagnosed with diabetic foot

37

Figure 3: Shape of the individual quartiles for patients diagnosed with diabetic foot

Figure 1 presents the average risk for each population. To illustrate the distribution of the risk, in Table 3, we provide the interquartile range of the DMCI score in each subpopulation. Using the information from table 1, the risk trajectories of patients belonging to the top 25 in their respective subgroups were analyzed. Figure 2 presents the average behavior for the highest-risk quartile. In order to investigate whether the shape of the healthrisk trajectory for each quartile within a subgroup is similar, the patterns for each quartile for multiple subpopulations were explored. In Figure 2, the shape of the individual quartiles for patients diagnosed with diabetic foot is depicted. The figure shows that having a different risk at baseline only tells a part of the story. These patients not only have different risks, but they also exhibit different progression patterns: their DMCI curves have different shapes. Figure 3, depicts the trajectories of patients with IHD and an additional complication. The results suggest that even in a subpopulation defined by a single complication, significant heterogeneity exists, as evidenced by differing shapes of the trajectory curves.

7 Discussion and Conclusion The purpose of this study was to model patients' progression towards diabetes complications through the use of a novel index, DMCI, derived from EHR data. The DMCI was used to stage the patients' health in terms of diabetic complications. Results clearly demonstrated the existence of multiple trajectories in diabetes thereby confirming the complex heterogeneity of the disease. Specifically, we divided patients into multiple (potentially overlapping) subpopulations based on their baseline complications and confirmed that patients with dif-

38

ferent baseline complications have different risks of developing additional complications. Second, we have also shown that these patient subpopulations differ not only in their risk but also in the temporal behavior of their risk: patients in certain subpopulations ‘accrue ’risk at a higher rate initially and at a slower rate later, while the DMCI score in patients in other subpopulations increases at a steady rate throughout the follow-up period. Third, we have also demonstrated that the trajectories differ even within the same patient subpopulation. Patients presenting with additional complications (e.g. a second complication on top of IHD) have different risks and different trajectories. Finally, we have also shown that when we stratify patients within the same subpopulation by their baseline risk, they exhibit different trajectories. This can naturally be a consequence of these patients suffering from additional complications explaining their increased relative baseline risk. These findings support a conclusion in a previous study that patient subgroups vary by level of severity. Dey et al. [4] used a national convenience sample of 581 Medicare-certified HHC agencies' EHRs for 270,634 patients to understand which patients are likely to improve in their mobility and found that mobility status at admission was the single strongest predictor of mobility improvement [4]. However, very different patterns were apparent when conducting the analysis within the level of severity for mobility at admission. An interesting finding in our study is that patients with diabetic foot problems have the highest severity at base line, and more so when combined with IHD. This finding may be associated with the strict relationship between glycemic control and microvascular complications. Foot problems are associated both with nerve and vascular damage, creating a risk for infections. Uncontrolled glucose further exacerbates the potential for severe infections and potential amputations. Patients with diabetic

foot complications are likely to continue having an increasing risk for additional problems, as foot problems are a leading cause of hospital admission, amputation, and mortality in diabetes patients [9]. Through our previous work [3] in investigating diabetic subpopulations and their risk of mortality, we have already gained an appreciation of the immense heterogeneity of diabetes and the metabolic syndrome. Studying trajectories expands this heterogeneity along a new dimension. While the preliminary work presented in this study merely offers a glimpse at the complexity of diabetes and its complications, it demonstrates the value of trajectories in understanding patient progression and possibly prognosis. Further research in this direction will undoubtedly lead to improvements in EBP guidelines by taking trajectories into account. Limitations of this study include the secondary use of EHR data and its associated challenges. The data in this study represent care provided in a single health system; the study needs replication in additional health settings and under different clinical conditions. The DMCI score was developed from EHR data retrospectively and independent validation would be beneficial. 8 Acknowledgements This study is supported by National Science Foundation (NSF) grant: IIS-1344135. Contents of this document are the sole responsibility of the authors and do not necessarily represent official views of the NSF. This was partially supported by Grant Number 1UL1RR033183 from the National Center for Research Resources (NCRR) of the National Institutes of Health (NIH) to the University of Minnesota Clinical and Translational Science Institute (CTSI). References [1] Centers for Disease Control and Prevention (CDC). National diabetes fact sheet. Available from: http://www.cdc.gov/diabetes/pubs/pdf/ndfs2011.pdf.

39

[2] B. Ramlo-Halsted, S. Edelman The natural history of type 2 diabetes: practical points to consider in developing prevention and treatment strategies, Clinical diabete, 2000. [3] V. Kumar, P. Yadav, et al. Progression and risk assessment of comorbid conditions in type 2 Diabetes Mellitus, Biomedical Informatics and Computational Biology (BICB) Symposium, Rochester, MN, 2014. [4] S. Dey, J. Weed, et al. Data mining to predict mobility outcomes in home health care, AMIA, 2013. [5] T. Tuomi, N. Santoro, S. Caprio, M. Cai, J. Weng, L. Groop. The many faces of diabetes: a disease with increasing heterogeneity. Lancet, 2014; 383: pp. 1084-94. [6] G. Collins, S. Mallett, O. Omar, L. Yu. Developing risk prediction models for type 2 diabetes: A systematic review of methodology and reporting. BMC Med. 2011;9:103-7015-9-103. [7] B. Fincke, J. Clark , M. Linzer , et al. Assessment of long-term complications due to type 2 diabetes using patient self-report: the diabetes complications index. J Ambul Care Manage. 2005;28(3): pp. 262-273. [8] B. Young , E. Lin, M. Von Korff , et al. Diabetes complications severity index and risk of mortality, hospitalization, and healthcare utilization. Am J Manag Care. 2008;14(1):pp 15-23. [9] N. Avitabile, A. Banka, V. Fonseca. Glucose control and cardiovascular outcomes in individuals with diabetes mellitus: lessons learned from the megatrials. Heart Failure Clinics. 2012;8(4):pp. 513-522. [10] D. Cox, Regression Models and Life Tables. Journal of the Royal Statistical Society, Series B 34:pp. 187?220. [11] I. Newton, Methodus fluxionum et serierum infinitarum.pp. 1664-1671. [12] P. Andersen, R. Gill. Cox’s regression model for counting processes, a large sample study. Annals of Statistics 10 (4): 1100:1120 [13] R. Hocking. The Analysis and Selection of Variables in Linear Regression. Biometrics, 1976.pp: 32 [14] G. Mithat, G. Heller Concordance probability and discriminatory power in proportional hazards regression, Biometrika, 2005.

Hospital Corners and Wrapping Patients in Markov Blankets Dusan Ramljak

Adam Davey Alexey Uversky Zoran Obradovic∗

Abstract The American health care system is rife with perverse incentives. Providers are reimbursed for providing more care rather than preventive or higher quality care. Starting in 2012, the Affordable Care Act (ACA) sets Medicare reimbursement rates based on hospital performance for 30day preventable readmissions relative to expectations using 3 specific target diagnoses (AMI, CHF, and PN). This may have introduced an incentive for hospitals to under-diagnose these illnesses by substituting related diagnoses for which they will not be held accountable. We identify Markov blankets, diagnoses that can shield target diagnoses from the rest of the disease network. Each target diagnosis can be accurately identified and inferred from a small subset of related diagnoses. This work suggests several important directions for evaluating implementation of this component of the ACA. Specifically, this method can be used for problems such as identifying true cases with target diagnoses, estimating the extent of gaming via substitute diagnoses, and also to suggest related sets of diagnoses which, in combination, may provide more stable methods for setting reimbursement rates.

1 Introduction Under §3025 of the Affordable Care Act (ACA), as of October 1, 2012 hospital reimbursements have been based on performance relative to preventable 30-day Medicare hospital readmission rates observed in hospitals with similar predicted risk profiles. Three specific diagnoses are used to track reimbursement rates: acute myocardial infarction (AMI), congestive heart failure (CHF), and pneumonia (PN). As a direct result of this change in the structure of Medicare reimbursements, there is now more focus on problems such as the ability of health care providers to identify changing predictors of 30-day hospital readmissions [4, 5, 13], as well as to identify characteristics of individuals and providers associated with above-average levels of readmission risk. Hospitals that perform below expectations will see a reduction of up to 1% in Medicare-based reim∗ Center

for Data Analytics and Biomedical Informatics, Temple University Philadelphia, PA 19122 USA {dusan.ramljak, adavey, auversky, shoumik.rc, zoran.obradovic}@temple.edu

40

Shoumik Roychoudhury

bursements for services related to all diagnostic-related groups (DRGs). Based on performance levels in 2010, these targets would have placed half of all hospitals in the under-performing group. In coming years, additional diagnoses will be added to the list used to determine reimbursement rates. Because hospitals cannot be penalized for diagnoses they do not make, physicians are incentivized to choose similar, but distinct, diagnoses for criterion diagnoses. For example, a patient who is admitted to a hospital with AMI may initially be diagnosed as having chest pains or coronary atherosclerosis. If this patient was subsequently readmitted within the following 30 days, this diagnosis could not be used to penalize the hospital for poor performance. Similarly, PN may initially be diagnosed as having acute bronchitis or an upper respiratory infection, and CHF may instead be diagnosed at first as chronic obstructive pulmonary disease. In practice, only those assessed as having the lowest risk of 30-day readmission may be likely to receive the target diagnosis. However, several specific diagnoses are likely to cooccur with the target diagnosis and some procedures (e.g., angioplasty) may be strongly indicative of a specific underlying true diagnoses (e.g., AMI), serving as good proxy indicators of the true diagnosis. Evidence for changes to clinical practice, diagnoses, and associated procedures in response to changes in reimbursement has been well-documented for more than 30 years e.g., [11] and there is reason to suspect that similar changes are already occurring due to the most recent changes enacted under the ACA. One way of estimating the extent of these changes, and identifying cases that represent the true diagnoses of criterion diagnoses, is considering diagnoses as a set of connected nodes in a graph (connected by aspects such as co-occurrence). The Markov blanket (MB) for a node is the set of nodes that shield it from the rest of the network. Previous studies have shown that knowing the Markov blanket of a diagnosis node is all that is required in order to predict the value of the criterion, either by classification or regression [10, 7, 8]. If the MB of a specific diagnosis can be identified prior to a policy change, it may be used to more accurately identify the set of criterion diagnoses

following the policy change, which can in turn be used to estimate true cases, as well as the extent of gaming of diagnoses which will occur due to the policy change. In this paper, we investigate these issues in a series of experiments aimed at answering these important questions: 1) Which diagnoses can be used to approximate the MB of AMI, CHF, and PN? 2) Can a similar procedure be used to identify the MB for diagnoses for which similar policy changes have been enacted in the past, such as sepsis? Our experiments used discharge data from the California State Inpatient Databases (SID), obtained from the Healthcare Cost and Utilization Project (HCUP) provided by the Agency for Healthcare Research and Quality1 . The SID is a component of the HCUP, a partnership between federal and state governments and industry, which tracks all hospital admissions at the individual level. We included all data from January 2003 through December 2011. Patients were excluded from the analysis if they did not have Medicare or Medicaid as the primary payer and if they were younger than 19 years of age. The final dataset included 16,736,927 discharge records, with the primary set of features used in our experiences being the Clinical Classifications Software (CCS) diagnoses for ICD9-CM. CCS codes, developed as part of the HCUP, are designed to cluster patient diagnoses (hereinafter DX) and procedures (hereinafter PX) into a manageable number of clinically meaningful categories (272 diagnoses and 231 procedure codes). The rest of the paper is organized as follows. The model that we use in the experiments is described in Section 3. Data and experimental setup are described in Section 4, and the results are discussed in Section 5. We conclude the paper by providing interesting areas of future studies. 2 Related Work Some prior research has examined the role of comorbid conditions with the aim of identifying longer-term effects and mortality risk with a single target diagnosis in mind. Each of our target diagnoses has been considered in this fashion: AMI [12], CHF [1], PN [15], and sepsis [14]. To the best of our knowledge, ours is the first study concerned with identifying co-occurring diagnoses and procedures that can themselves serve as a proxy indicator of the target diagnosis, something necessary to

1 HCUP

State Inpatient Databases (SID), Healthcare Cost and Utilization Project (HCUP). 20032011. Agency for Healthcare Research and Quality, Rockville, MD. www.hcupus.ahrq.gov/sidoverview.jsp http://www.ahrq.gov/research/data/hcup/index.html

41

identify potential instances of hospitals gaming the system to reduce risk exposure. 3 Model Description In our quest to find a minimum subset of the most informative diagnoses associated with the diagnoses we analyse we start with identifying DXs and procedures that most frequently co-occur with our Target DX. The most frequent co-occurrence by itself is necessary but not sufficient to establish appropriate approximation of the Markov Blanket. Therefore we apply two different methodologies: a heuristic we defined based on PageRank, and an established approximation of the Markov Blanket using Hilbert-Schmidt Independence Criterion (HSIC). 3.1 Approximating the Markov Blanket (MB) using PageRank We are not able to build a Bayesian network since it is not clear which diagnoses, procedures are parents, or children and we therefore opt for a Markov Network where the dependencies are defined by co-occurrences. However, every node that co-occurs with our Target DX is an element of a Markov blanket by definition. Therefore, we need to make a distinction between the nodes. Having the weights defined by co-occurrences and information from the structure of the network of DXs and PXs that co-occur with our Target DX we could use PageRank value as a criterion to identify important nodes. For a subset of highly important nodes, nodes with highest PageRank we could then say that it represents an approximate Markov Blanket for our Target DX. 3.2 Approximating the Markov Blanket (MB) using HSIC As an established machine learning approach we will adopt a feature selection method based on an efficient approximation of the Markov blanket (MB), which is a set of variables that can shield a certain DX from the rest of diagnoses and procedures [9]. MB-based feature selection process has been shown to result in a theoretically optimal set of features [16]. However, it’s computational cost is prohibitive for application to high dimensional Electronic Health Records (EHR) data. Therefore, instead of relying on conditional independence or network structure learning, we will use HSIC as a measure of dependence among variables in a kernel-induced space. This will allow effective approximation of the MB that consists of multiple dependent features rather than being limited to a single feature. Benefits of using HSIC include: 1) It can detect any dependence between two variable sets with a universal ker-

nel in high dimensional kernel space; 2) It can measure the dependence between both discrete and continuous variables; and 3) It is easy to compute from the kernel matrices without density estimation. Given a set of features, we can check whether the set is the MB (M Bi ) of feature Fi . However, evaluating all subsets of F for this property is prohibitively costly. Therefore, to reduce the search cost we will evaluate some MB candidate subsets. Often, there might not be an exact MB for a given feature, but we can still identify an approximating MB, which largely subsumes the information about this feature, so that we can remove this feature with little useful information lost. In this work we use a simple but effective method to find an approximating MB. The proposed method consists of 3 steps: 1) Identifying MB candidates, 2) Screening MB candidates, and 3) Feature selection. In particular, we compute a Markov blanket candidate M Bi for a feature Fi such that each of its features FM Bi satisfies: argmaxFC HSIC(KFC , Ki ), where FC ∈

than 35 million (35,844,800) inpatient discharge records over the specified 9 years for 19,319,350 distinct patients in 474 hospitals (436 AHAID identified; about 400 per year). The information is not specific to a group of hospitals, but rather represents the data for the entire state. The database also includes demographic information for each patient (like age, birth year, sex, race), diagnosis (primary and up to 24 secondary), procedures (up to 21), information about hospital stays, and other information (including length of stay, total charges, type of payment and payer, discharge month, and survival information). In addition to addressing the problem that we are considering, using this data could potentially give us insight into many healthcare problems. For instance, we can explore and correct global trends of Target diseases which intrigues many healthcare practitioners [3, 6]. After excluding patients younger than 19 years of age and patients that did not have Medicare or Medicaid as the primary payer, we were left with 16,736,927 discharge records. Table 1 shows the frequency of each Target DX in the SID data set, while Figure 1 displays Bi − M Bi ∪ {FM Bi } In order to determine whether the approximate this frequency over time. candidate MB of feature Fi (referred to as M Bi ) can be regarded as an actual approximation of the Markov Target DX blanket, we use the following screen test: (3.1) (3.2)

HSIC(M Bi , C) > HSIC(M Bi ∪ Fi , C)

Freq.

HSIC(M Bi , C) > HSIC(Fi , C) ∧

Sepsis 1,027,088

AMI 544,228

CHF 2,907,625

PN 1,577,822

Table 1: Each Target DX frequency in the data set

HSIC(M Bi , Fi ) > HSIC(Fi , C)

In this test, C is the target variable, and HSIC(X, Y ) is defined as the dependence measure between two variable sets X and Y . Condition 3.1 is satisfied by an irrelevant feature, while Condition 3.2 is satisfied by a redundant feature. By applying a Markov blanket to select the minimum subset of the most informative diagnoses and procedures associated with diagnoses that we analyze, we hope to be able to estimate true prevalence of various diseases. Additionally we would like to provide more robust methods to estimate true hospital readmission rates where intentional under-diagnosis of such sentinel diagnoses is likely. 4 Data Description and Experimental Setup 4.1 Data Description The data that we use in our experiments comes from the HCUP family of databases, and the raw data consists of patient hospital visit records from California’s SID in the period from January 2003 up to December 2011. Each record consists of a number of attributes, which are explained in detail on the HCUP website1 . This database contains more 1 http://www.hcup-us.ahrq.gov/db/state/siddist/sid_ multivar.jsp

42

4.2 Experimental Setup We first identified subsets of records containing each Target DX (Sepsis, AMI, CHF, PN). Within each subset, diagnoses and procedures were ranked by pagerank values in a network of co-occurrence with Target DX and networks of cooccurrences without Target DX were generated as well. As proof of concept for this method, exactly 50 DX and PR (roughly 10%, accounting for approximately 95 of records including each Target DX) were identified. In general, the pagerank value of the first selected DX is approximately 10 times higher than the last DX selected. 10 replicate experiments were performed using a randomly selected set of 1000 examples with each Target DX and a set of 1000 examples where the Target DX does not appear. The goal of these analyses was to identify the subset of DXs that can be used to accurately predict the presence of Target DX in the records. To achieve this goal we identified the approximate MB for each Target DX. Performance was evaluated by comparing classification accuracy using only the DXs in the MB against using the entire set of DXs.

Figure 1: Monthly admission rates in a period 2003 up to 2011 for Target DX (Sepsis, AMI, CHF, PN) 5 Results Each of the 4 Target DX we examined in this study was fairly prevalent. As shown in Table 1, CHF was most common (17.37%), followed by PN (9.43%), sepsis (6.13%), and AMI (3.25%). There were also considerable seasonal and secular trends in these Target DX. Prevalence of sepsis increased steadily across the study period. The three other Target DX showed gradual decreases in prevalence over time, but also very strong seasonal trends. Table 4 shows the performance of MB and PageRank methods in terms of accuracy, precision, sensitivity, specificity, and F1. Both measures perform well on accuracy and precision for sepsis. For other diagnoses, accuracy is generally higher via MB, but precision is generally higher for PageRank. Excepting sepsis, sensitivity and specificity are higher for PageRank than MB, but F1 values are higher for MB than PageRank. For consistency with HSIC results (below), based on page ranks, we provide the top 7 DX/PR for sepsis and 5 DX/PR identified for each Target DX (see Table 2). For sepsis, the list included peritonitis, injuries, and shock for diagnoses and tracheostomy, non-cardiac catheterization, ostomy, and intubation. For AMI, no diagnoses were identified, but CABG, angioplasty, coronary thrombolysis, cardiac catheterization, and urinary endoscopy were identified. For CHF, heart valve disorders, carditis, and pulmonary heart disease were identified diagnoses, and heart valve procedures and SwanGanz catheterization were the procedures. Finally, for PN, respiratory failure and shock were identified diagnoses and tracheostomy, bronchoscopy, and other respiratory procedures were identified. As shown in Table 3, the Markov blanket for each

43

Target DX consisted of a small subset of common diagnoses and procedures. For sepsis, the MB included hemorrhagic disorders; intestinal infection; urinary tract infections; skin ulcers; complications of device, implant, or graft; and other injuries as diagnoses and non-cardiac catheterization as a procedure. For AMI, the MB included atherosclerosis and rehabilitation care for diagnoses and endoscopy of the urinary tract, echocardiogram, and other diagnostic procedures. For CHF, the MB included other nervous system disorders, essential hypertension, hypertension with complications, atherosclerosis, and respiratory failure as diagnoses; no procedures were included in the MB. Finally, for PN, the MB included diabetes without complications, hemorrhagic disorders, hypertension with complications, congestive heart failure, and other lower respiratory disease; no procedures were included in the MB. 6 Conclusion The US healthcare system is rife with opportunities for perverse incentives. Implementation of any new healthcare policy results in changes within healthcare system in order to minimize the adverse consequences of the policy change for healthcare providers. Changes that began in 2012 under the Affordable Care Act can be expected to reduce the number of individuals receiving target diagnoses of AMI, CHF, and PN as healthcare providers move to reduce their exposure to adverse consequences of hospital readmissions. In this paper, we propose two novel applications to the problem of under-diagnosing, specifically, Markov blankets and page rank. We find that, for each Target DX, and sepsis, a small number of diagnoses and

Sepsis Diagnoses 148 Peritonitis and intestinal abscess 244 Other injuries and cond. due to ext. causes 249 Shock Procedures 34 Tracheostomy; temporary and permanent 54 Other vasc. catheteriz.; not heart 73 Ileostomy and other enterostomy 216 Resp. intub. and mech. vent. AMI Diagnoses None Procedures 44 Coronary artery bypass graft (CABG) 45 Percutan. translum. coron. angiopl. (PTCA) 46 Coronary thrombolysis 47 Diag. cardiac catheteriz.; coron. arteriogr. 100 Endosc. and endosc. biopsy of the urin. tract CHF Diagnoses 96 Heart valve disorders 97 Peri-; endo-; and myoc.; cardiomyop. (except TB or STD) 103 Pulmonary heart disease Procedures 43 Heart valve procedures 204 Swan-Ganz catheterization for monitoring PN Diagnoses 131 Resp. failure; insufficiency; arrest (adult) 249 Shock Procedures 34 Tracheostomy; temporary and permanent 37 Diag. bronchosc. and biopsy of bronchus 41 Other non-OR therap. proc. on resp. sys.

Sepsis Diagnoses 62 Coagulation and hemorrhagic disorders 135 Intestinal infection 159 Urinary tract infections 199 Chronic ulcer of skin 237 Complication of device; implant or graft 244 Other injuries and cond. due to ext. causes Procedures 54 Other vascular catheterization; not heart AMI Diagnoses 101 Coron. atheroscl. and other heart dis. 254 Rehab. care; fit. of prosth.; and adj. of devices Procedures 100 Endosc. and endosc. biopsy of the urinary tract 193 Diag. ultrasound of heart (echocardiogram) 227 Other diagnostic procedures CHF Diagnoses 95 Other nervous system disorders 98 Essential hypertension 99 Hypert. with compl. and sec. hypert. 101 Cor. atheroscl. and other heart dis. 131 Resp. fail.; insuffic.; arrest (adult) Procedures None PN Diagnoses 49 Diabetes mellitus without complication 62 Coagulation and hemorrhagic disorders 99 Hypert. with compl. and sec. hypertension 108 Congestive heart failure; nonhypertensive 133 Other lower respiratory disease Procedures None

Table 2: Table showing the diagnoses used to form the Table 3: Table showing the diagnoses used to form the PageRank approximated MB for each target DX HSIC approximation of MB for each target DX

procedures can serve to shield the target diagnosis from the rest of the disease network. Performance using this subset of diagnoses suggests performance that generally quite high for accuracy and precision. Additionally, these diagnoses and procedures often point to clinically meaningful patterns. In general, while nearly all of the selected diagnoses and procedures make sense, with considerable overlap between MB and PageRank methods, it is generally the case that the associations are more directly obvious for selections via PageRank, and somewhat subtler for selections via MB. The results suggest that page rank may be useful for providing an initial screening of potentially useful diagnoses. However, it is unclear which will ultimately prove most useful as the

44

network of diagnoses and procedures surrounding a Target DX change in response to policy. To some extent, this problem is likely to pose a continuously moving target and so future research should more fully develop understanding of the temporal forces as well to determine whether, for example, the indicators of AMI depend on month of admission. The approach used here is likely to be useful in the analysis of healthcare data in several ways. First, it provides a set of associated diagnoses and procedures that can be used to “impute” missing or unobserved data in an effort to estimate true prevalence of various diseases. Second, it can be used to estimate the extent of “gaming” of diagnoses in response to policy changes.

Acc (All) Acc (PR) Acc (HSIC)

Sepsis 91.4% 91.4% 91.45%

DX code AMI CHF 79.6% 77.5% 71.2% 65% 76% 72.6%

PN 73.4% 64.8% 74.3% [6]

Table 4: The classification accuracy for each Target DX using the MB alone (generated by PageRank (PR) and generated by HSIC) compared with the entire set of DXs This suggests that our approach may also prove useful in order to adjust estimates for this kind of gaming and could also provide more robust methods to estimate true hospital readmission rates where intentional under-diagnosis of such sentinel diagnoses is likely. Historically, there are several precedents for this kind of under-reporting. The effects of the Omnibus Budget Reconciliation Act of 1987 (OBRA87) was observed to have considerable impact of medical practice in nursing home settings [2]. 7 Acknowledgements This research was supported by DARPA Grant FA955012-1-0406 negotiated by AFOSR and NSF through grant number NSF-14476570. HCUP, Agency for Healthcare Research and Quality, provided data used in this study.

[7]

[8]

[9]

[10]

[11]

References [1] J. B. Braunstein, G. F. Anderson, G. Gerstenblith, W. Weller, M. Niefeld, R. Herbert, and A. W. Wu, Noncardiac comorbidity increases preventable hospitalizations and mortality among medicare beneficiaries with chronic heart failure, Journal of the American College of Cardiology, 42 (2003), pp. 1226– 1233. [2] R. Elon and L. G. Pawlson, The impact of obra on medical practice within nursing facilities., Journal of the American Geriatrics Society, 40 (1992), pp. 958– 963. [3] K. E. Jones, N. G. Patel, M. A. Levy, A. Storeygard, D. Balk, J. L. Gittleman, and P. Daszak, Global trends in emerging infectious diseases, Nature, 451 (2008), pp. 990–993. [4] P. S. Keenan, S.-L. T. Normand, Z. Lin, E. E. Drye, K. R. Bhat, J. S. Ross, J. D. Schuur, B. D. Stauffer, S. M. Bernheim, A. J. Epstein, et al., An administrative claims measure suitable for profiling hospital performance on the basis of 30-day all-cause readmission rates among patients with heart failure, Circulation: Cardiovascular Quality and Outcomes, 1 (2008), pp. 29–37. [5] H. M. Krumholz, Z. Lin, E. E. Drye, M. M. Desai, L. F. Han, M. T. Rapp, J. A. Mattera, and

45

[12]

[13]

[14]

[15]

[16]

S.-L. T. Normand, An administrative claims measure suitable for profiling hospital performance based on 30-day all-cause readmission rates among patients with acute myocardial infarction, Circulation: Cardiovascular Quality and Outcomes, 4 (2011), pp. 243–252. C.-C. Liu, Y.-T. Tseng, W. Li, C.-Y. Wu, I. Mayzus, A. Rzhetsky, F. Sun, M. Waterman, J. J. Chen, P. M. Chaudhary, et al., Diseaseconnect: a comprehensive web server for mechanism-based disease–disease connections, Nucleic acids research, 42 (2014), pp. W137–W146. Q. Lou and Z. Obradovic, Feature selection by approximating the markov blanket in a kernel-induced space., in ECAI, 2010, pp. 797–802. , Predicting viral infection by selecting informative biomarkers from temporal high-dimensional gene expression data, in Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on, IEEE, 2012, pp. 1–4. Q. Lou, H. P. Parkman, M. R. Jacobs, E. Krynetskiy, and Z. Obradovic, Exploring genetic variability in drug therapy by selecting a minimum subset of the most informative single nucleotide polymorphisms through approximation of a markov blanket in a kernelinduced space, in Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2012 IEEE Symposium on, IEEE, 2012, pp. 156–163. J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference, Morgan Kaufmann, 2014. T. H. Rice, The impact of changing medicare reimbursement rates on physician-induced demand., Medical care, 21 (1983), pp. 803–815. F. A. Spencer, M. Moscucci, C. B. Granger, J. M. Gore, R. J. Goldberg, P. G. Steg, S. G. Goodman, A. Budaj, G. FitzGerald, K. A. Fox, et al., Does comorbidity account for the excess mortality in patients with major bleeding in acute myocardial infarction?, Circulation, 116 (2007), pp. 2793–2801. G. Stiglic, A. Davey, and Z. Obradovic, Temporal evaluation of risk factors for acute myocardial infarction readmissions, in Healthcare Informatics (ICHI), 2013 IEEE International Conference on, IEEE, 2013, pp. 557–562. D. Weycker, K. S. Akhras, J. Edelsberg, D. C. Angus, and G. Oster, Long-term mortality and medical care charges in patients with severe sepsis, Critical care medicine, 31 (2003), pp. 2316–2323. S. Yende, D. C. Angus, I. S. Ali, G. Somes, A. B. Newman, D. Bauer, M. Garcia, T. B. Harris, and S. B. Kritchevsky, Influence of comorbid conditions on long-term mortality after pneumonia in older people, Journal of the American Geriatrics Society, 55 (2007), pp. 518–525. L. Yu and H. Liu, Feature selection for highdimensional data: A fast correlation-based filter solution, in ICML, vol. 3, 2003, pp. 856–863.

4th Workshop on Data Mining for Medicine and Healthcare

ing this data could potentially give us insight in many healthcare problems. Among many difficult tasks, capturing global trends of diseases are the ones that intrigue many healthcare practitioners [9, 11]. Being able to confidently predict future trends of diseases, may lead to better anticipation of healthcare systems and allow ...

4MB Sizes 1 Downloads 246 Views

Recommend Documents

Data Mining Applications in Healthcare
For example, data mining can help healthcare insurers detect fraud and abuse, healthcare .... towards understanding a data set, especially a large one, and detecting ..... project management, inadequate data mining expertise, and more.

Data Mining Applications in Healthcare
Data mining is not new—it has been used intensively .... methodology for data mining: business understanding, ... computer science, including artificial intelligence and machine .... suppose that as part of its healthcare management program,.

Review on Data Warehouse, Data Mining and OLAP Technology - IJRIT
An OLAP is market-oriented which is used for data analysis by knowledge ..... The data warehouse environment supports the entire decision. Database. Source.

Review on Data Warehouse, Data Mining and OLAP Technology - IJRIT
used for transactions and query processing by clerks, clients. An OLAP is market-oriented which is used for data analysis by knowledge employees, including ...

On Approximation Algorithms for Data Mining ... - Semantic Scholar
Jun 3, 2004 - Since the amount of data far exceeds the amount of workspace available to the algorithm, it is not possible for the algorithm to “remember” large.

On Approximation Algorithms for Data Mining ... - Semantic Scholar
Jun 3, 2004 - problems where distance computations and comparisons are needed. In high ..... Discover the geographic distribution of cell phone traffic at.

On Approximation Algorithms for Data Mining ... - Semantic Scholar
Jun 3, 2004 - The data stream model appears to be related to other work e.g., on competitive analysis [69], or I/O efficient algorithms [98]. However, it is more ...

thesis on data mining pdf
Page 1. Whoops! There was a problem loading more pages. thesis on data mining pdf. thesis on data mining pdf. Open. Extract. Open with. Sign In. Main menu.

Text and data mining eighteenth century based on ...
COMHIS Collective. BSECS Conference ... Initial data. Evolving set of analysis and processing tools ... statistical summaries and data analysis - work in progress.

NATIONAL WORKSHOP ON “ECONOMETRIC APPLICATIONS FOR ...
Jan 19, 2015 - ORGANISED BY. THE DEPARTMENT OF COMMERCE & BUSINESS MANAGEMENT ... Baroda College (1881) which is one of the oldest centres of learning in .... All Faculty members, Research scholars (M.Phil. and. Ph.d) of ...

DE-GMDH for Data Mining
International Workshop on Inductive Modeling, IWIM 2007, Prague, Czech; ... Exploratory analysis and model and hypothesis selection: choosing the data mining ..... copied to Excel spreadsheets and archived on daily basis as well as monthly ...

Download Transparent Data Mining for Big and Small ...
Download Transparent Data Mining for Big and. Small Data (Studies in Big Data) Full Books. Books detail. Title : Download Transparent Data Mining for Big q.