Journal of Biomedical Informatics 75 (2017) 35–47

Contents lists available at ScienceDirect

Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin

Identifying prescription patterns with a topic model of diseases and medications Sungrae Park a, Doosup Choi a, Minki Kim b, Wonchul Cha c, Chuhyun Kim d, Il-Chul Moon a,⇑ a

Department of Industrial and Systems Engineering, KAIST, Yusung-gu, Daejon, Republic of Korea College of Business, KAIST, Seoul, Republic of Korea c Department of Emergency Medicine, Samsung Medical Center, Seoul, Republic of Korea d Department of Emergency Medicine, Inje University College of Medicine and SeoulPaik Hospital, Seoul, Republic of Korea b

a r t i c l e

i n f o

Article history: Received 4 August 2016 Revised 4 September 2017 Accepted 5 September 2017 Available online 27 September 2017 Keywords: Topic modeling Probabilistic graphical models Medical information

a b s t r a c t Wide variance exists among individuals and institutions for treating patients with medicine. This paper analyzes prescription patterns using a topic model with more than four million prescriptions. Specifically, we propose the disease-medicine pattern model (DMPM) to extract patterns from a large collection of insurance data by considering disease codes joined with prescribed medicines. We analyzed insurance prescription data from 2011 with DMPM and found prescription patterns that could not be identified by traditional simple disease classification, such as the International Classification of Diseases (ICD). We analyzed the identified prescription patterns from multiple aspects. First, we found that our model better explain unseen prescriptions than other probabilistic models. Second, we analyzed the similarities of the extracted patterns to identify their characteristics. Third, we compared the identified patterns from DMPM to the known disease categorization, ICD. This comparison showed what additional information can be provided by the data-oriented bottom-up patterns in contrast to the knowledge-based topdown categorization. The comparison results showed that the bottom-up categorization allowed for the identification of (1) diverse treatment options for the same disease symptoms, and (2) diverse disease cases sharing the same prescription options. Additionally, the extracted bottom-up patterns revealed treatment differences based on basic patient information better than the top-down categorization. We conclude that this data-oriented analysis will be an effective alternative method for analyzing the complex interwoven disease-prescription relationship. Ó 2017 Elsevier Inc. All rights reserved.

1. Introduction Cold is the most common respiratory disease with which people become infected, yet it is surprising to see such diverse treatments, from a grandmother’s chicken soup to the prescription of recent antibiotics. The diversity of treatments makes patients, and sometimes doctors, wonder which treatment would be best for their cases. The treatments are mixtures of curing the causes of illnesses and easing the painful symptoms, but there is no standard for creating these mixtures because this process largely depends on the prescribing doctor’s perspective as well as the presented case. That being said, although a case-by-case approach may be inevitable to a certain extent [1], medical prescriptions also require either

⇑ Corresponding author. E-mail addresses: [email protected] (S. Park), [email protected] (D. Choi), [email protected] (M. Kim), [email protected] (W. Cha), [email protected] (C. Kim), [email protected] (I.-C. Moon). https://doi.org/10.1016/j.jbi.2017.09.003 1532-0464/Ó 2017 Elsevier Inc. All rights reserved.

standards or guidance to sort and categorize the diversity of drug options. One way to categorize prescriptions is using prescription metainformation (i.e., the disease category [2], the prescribing doctor’s specialization [3], the chief symptom or information reported by the patient, etc. [4,5].). However, the categorization of prescriptions has limitations because it may only capture a single aspect of the prescription rather than a holistic view, and because the prescribed medication or diagnosis needs to be pre-categorized with domain knowledge. There are categorizations based on observing the doses or the existence of specific medications, but these categorizations often fail to include the entire spectrum of medications used in the real world. Therefore, our research question is how to categorize prescriptions from a medication-oriented perspective using the guidance of meta-information as well as the entire spectrum of medications. There are multiple approaches to categorizing prescriptions. The first approach is a categorization by subject-matter experts.

36

S. Park et al. / Journal of Biomedical Informatics 75 (2017) 35–47

For example, some cancers have standard care procedures, from surgery to therapy, and this includes a number of prescription patterns for each particular disease [6,7]. The second approach is a categorization based in the key aspects of the prescriptions. Examining the antibiotics in a prescription is one way of dividing a set of prescriptions into two different patterns [8,9]. The third approach is a categorization driven by the data analysis of a large collection of prescriptions. This is fundamentally a bottom-up procedure in which the data collection determines the key aspects of a prescription through statistical inferences. This approach identifies prescription patterns from the data rather than from subject-matter experts’ knowledge or established theories. Therefore, this approach is prone to errors that would be easily caught by humans, but it is also capable of objectively drilling down large-scale prescription collections to find new information [10,11]. This paper introduces a statistical model for categorizing prescriptions with large-scale data. Specifically, the introduced model is a variant of topic models for identifying prescription patterns that reflect prescription meta-information and the entire spectrum of medication usages. Here, we limit ourselves to finding patterns, which we define as prescription soft-clusters, not categorizations that would imply the hierarchical division of prescriptions. A prescription pattern extracted by our model consists of a set of diseases and a set of medicines. The set of diseases in a cluster are likely treated by the set of medicines in the cluster. These clusters are found by investigating co-occurrences in a structured prescription dataset, and our model reflects this structure and cooccurrence as a probability model for statistical inference. While we developed the statistical model, analyzing a large collection of prescriptions resulted in various analytical issues, from simple data handling to pattern finding, that were unforeseen without the data. In particular, pattern finding is challenging because the analysis requires an assessment of the context of each disease diagnosis and its corresponding prescriptions. Furthermore, the contexts are noised and biased; i.e., some disease diagnoses are incorrectly recorded, and some prescriptions are biased based on the doctor’s preferred medicine; or potential misdiagnosis and misprescription. Summarizing collected prescriptions, with their high variance and diverse contexts, cannot be performed using a conventional statistical analysis tool, so we developed a probabilistic model for overcoming these difficulties. Our model summarizes the Health Insurance Review & Assessment Service (HIRA) dataset, which is a national-level collection of prescriptions over a year, into low-dimensional representations similar to previous medical information analyses [12,13]. In the machine learning field, the dimensionality reduction is the extraction of low-dimensional representations for various variables; this low dimensional information can be further used for decisionmaking, information retrieval, visualization, etc. The dimensionality reduction in our work facilitate the compact representation of prescriptions, and the compact representation entails the extraction of hidden patterns. Our statistical model identifies clusters, and each prescription has a proportion distribution of belonging to clusters. From this proportion of individual prescriptions, the prescription can be represented as a mixture of clusters, and the proportion becomes the low dimensional representation of the prescription. After model introduction, we applied the introduced model to identify prescription patterns for the following two sets of historical prescription records: (1) long-term patients’ records across all diseases and medications, and (2) respiratory system disease (J00-47) records, including cold, with the most diverse prescriptions [14,15]. Our model identified prescription patterns, or the clusters of prescriptions, from more than four million prescriptions with statistical significance.

2. Previous research 2.1. Prescription analysis Generally, prescriptions contain information on patients, treated diseases, and prescribed medicines. There are too many types of diseases and medicines available today to be analyzed individually. Hence, previous studies defined a number of groups among medicines, then the previous works diverged into multiple analytical approaches. For instance, some studies calculated and compared the proportion of the chosen medicines by population types, such as age groups. Afterwards, the proportions were compared using traditional statistical analysis tools, such as the ttest, linear regression, and analysis of variance (ANOVA). Ferrajolo et al. [16] analyzed the yearly usage of new statin treatments to investigate the change in statin prescribing patterns in outpatient treatment in Southern Italy. Tamaya et al. [17] evaluated the use of new anti-hyperglycemic drugs and found that treatments for diabetes have different patterns based on the region and the patients’ socioeconomic positions. Similarly, Park et al. [18] analyzed the socioeconomic and clinical factors of patients for which doctors prescribed first and second-generation anti-psychotics. Nakaoka et al. [19] analyzed the use of anti-Parkinson drugs from 2005 to 2010 in Japan to evaluate the trends for the annual proportion of patients subjected to these treatments. Olson et al. [20] analyzed the temporal dispensing patterns of medications prescribed to children in the United States. Specifically, they categorized the drugs into eight therapeutic categories by considering the highest prevalence, the usage over years, and the sex and age. The aggregated medicine proportions of the interested groups can be used to compare the usage of the medicines from separated groups, but this approach is only appropriate for an aggregated-level comparison, not for each prescription. Another approach is calculating the proportions of the chosen medicines per prescription; this approach analyzes the relationship between the proportions of some chosen medicines and factors such as the number of medicines and the prescription dosages. Vallano[21] used factors, such as the WHO/INRUN, and additional factors, including the proportion of antibiotic medicines, the proportion of analgesics, or the proportion of non-steroidal anti-inflammatory drugs. They analyzed the relationship between the previously mentioned factors for extracting the prescribing patterns. The approach provides some insights into prescription patterns regarding the statistical aggregation of the chosen medicine groups. These studies represent typical research approaches from the medicine or medical informatics fields, and the statistical analyses clarify the fundamental relationships between patients and prescriptions. Some researchers have analyzed information from prescriptions without the statistical aggregations of medicines. Nepolitano et al. [22] analyzed prescriptions with knowledge from clinical experts to classify inappropriate prescriptions for elderly patients. Tamaya et al. [17] found that diabetes treatments have different patterns based on the region and the patients’ socioeconomic positions. They extracted the top ten medicines from each patient group and compared the prescribed medicines per group. These studies analyzed the prescriptions without statistical aggregations, but they used the knowledge of clinical experts, or they only focused on the most used medicines without considering the correlations between medicines. Some researchers analyzed relational patterns between medicines from electronic medical records by applying clustering techniques. Caldern-Larraaga et al. [23] demonstrated the existence of systematic associations between the drug prescriptions. Specifically, they extracted polypharmacy patterns by using exploratory

S. Park et al. / Journal of Biomedical Informatics 75 (2017) 35–47

factor analyses, methods that as also used to identify multimorbidity patterns [24]. They successfully identified the medicine patterns for each group, which was categorized based on patient sex and age, but they did not consider the correlation between the diseases and medicines. To compensate for this limitation, this paper identified the prescription patterns, which are represented by the probabilities over diseases and medicines, by considering the co-occurrence of diseases and medicines. 2.2. Topic models in biomedical domains In the biomedical domain, some previous studies demonstrated the effectiveness of dimensionality reduction techniques using topic models in the text-mining fields with patient records. Topic models are statistical models for automatically extracting cluster words hidden in a document collection, or a corpus; this document collection could be a set of academic papers as well as prescriptions. The cluster words show the key concepts of a single cluster, which are often interpreted as a topic or an issue in the document set, and the topic model is able to extract a certain number of clusters to divide and categorize the documents probabilistically. Blei et al. [25] analyzed natural language documents retrieved from the Caenorhabditis Genetic Centers (CGC) bibliography using the latent Dirichlet allocation (LDA) model. Zheng et al. [26] examined the major topics in MEDLINE titles and abstracts by applying the LDA model. These applications showed that topical information from biomedical academic papers was useful in summarizing a large number of papers. Several researchers have proposed extended LDA models to extract additional information, i.e., a decision support system with topics from a collection of PubMed articles. Mrchen et al. [27] suggested the topic-concept model, which analyzes articles in the medical literature from PubMed to extract frequent biomarkers, such as genes, antigens, receptors, proteins, and cells (MeSH Terms), appearing as words in the articles. The identified biomarkers were clustered by the topic of the literature, i.e., cancer-related articles, and the relationship between the biomarkers and the clusters were analogized to the concept words and topics in the texts. Wang et al. [28] proposed the bio-LDA model, which integrates biological terms and journal information into a single topic model to deliver a better summary of the biomedical corpus. From a topic extraction, they developed a network of keywords, genes, and drugs sharing the same topic that shows which diseases, genes, or drugs link different clinical clusters. Although these studies have proven that topic models are useful for analyzing natural language documents in the biomedical domain, the results were limited to the application of academic articles, excluding prescriptions. The recent trend is to apply these probabilistic models to clinical data, such as electronic medical records (EMR) and prescription records. For example, Halpern et al. [29] examined various dimensionality reduction models, i.e., LDA [30], supervised LDA [31], singular value decomposition [32], and MedLDA [33], to analyze the topical representations of each patient’s triage notes. Topic extraction was useful in providing a latent representation of the patients; this latent information was used to predict whether a patient had an infection or not and whether the patient needed to be admitted to the intensive care unit (ICU). Chan et al. [34] visualized the themes, summarizing electronic patient records that featured clinical notes in an unstructured natural language to find trends and correlations between genetic variables and latent patient states with the dimensionality reduction techniques of a topic model. Amold and Speier [35] proposed a topic model that integrated individual disease progressions to analyze the progress distributions by each topic. They extended the LDA models to capture temporal topic patterns in individual patients’ clinical information, such as discharge summaries, radiology reports, and pathology reports. By inheriting and

37

following these previous lines of work, which analyzed prescription patterns with traditional and recent statistical models, this paper proposes an extended topic model adapted to prescription information. In particular, the proposed model focuses on identifying the interaction patterns between disease categorization and prescription information. The previous pattern analyses limited themselves to analyzing prescriptions for either a specific disease or all diseases, and this might not reveal the key patterns for prescriptions differentiated by the detailed categorization of diseases. Some of the previous works are similar to the study presented in this paper. Among the similar works, there are papers presenting analysis results on medical data without any specialized model development. Chen et al. [36] used LDA to extract topics from the texts in electronic health records. This paper compared the LDA topics from two different datasets and found that the LDA topic could be another standard to match multiple clinical datasets that have different standards for using clinical vocabularies. This paper is close to our research because the study hypothesized that the extracted topics could be matched to ICD codes, and we perform a similar analysis with a further specialized probabilistic model. A Yao et al. [37] finding used labeled LDA, which is an extended version of the original LDA. However, there was no customization to reflect the medical data structure in the model because labeled LDA is a general-purpose model. Hasan et al. [38] studies automatic annotation methods for clinical texts, and the authors adopted DiscLDA, which is an extended version of LDA. DiscLDA is a general-purpose model for treating text labels as the conditions for the latent topics of texts. DiscLDA also does not incorporate the structure of the medical texts, particularly the prescriptions included. Another group of papers presents topic models on electronic health records. Speier et al. [39] proposes a probabilistic model for clinical reports. The model is similar to ours because both have plates of patients and documents. However, their proposed model is applied to medical reports while ours experiments with prescriptions. Because of this difference, our network, grouping, and ICD matching analyses became a unique feature that was not presented in their paper. Huang et al. [40] provides a probabilistic model for risk stratification. The model is applied to the electronic health records, and the model is specialized in estimating the risk level of patients, so the model does not provides a pattern for diseases or their corresponding medicines. Huang et al. [41] presents a probabilistic model of clinical pathways, or CPM, extracted from clinical reports. The model has an event and an activity data to show the path for treating patients. Their model is similar to our work as both are topic model variants, but the modeled data and objectives are different. CPM provides a more specific eventactivity pair model though it does not relevantly model patient aspects. Moreover, the diagnosis part was missing in CPM. Huang et al. [42] presents an extended version, DTM, of CPM by adding a diagnosis model, and now the model becomes more similar to our presented model, DMPM. However, there are multiple differences. First, DMPM has patient modeling with the patient plate, while DTM has no element for modeling the patient. Second, DMPM has a latent variable placed at each patient layer, while DTM has no latent variable model for patient characteristics. Third, DTM has more extensive modeling on the event-activity pair, while DMPM simplifies the event-activity to the medication, which makes DRM more microscopic and useful for short-period data analyses; this characterizes DMPM as a model for a national and long-term analysis. Because of these differences, DMPM provides pattern analysis that matches the ICD standard was able to provide patient profiling with demographics. Finally, a group of papers introduces topic models on the prescription data. Pivovarov et al. [43] extends the original LDA for phenotyping analyses with EHR texts. Similar to our paper, the

38

S. Park et al. / Journal of Biomedical Informatics 75 (2017) 35–47

paper introduced the UPhenome model and its inference procedure. However, the UPhenome model is different from DMPM in this paper because the DMPM, which reflects the prescription data structure, has the patient plate nesting the patients prescriptions, which also recursively nests the prescriptions medicines. UPhenome has a completely different plate structure that models the phenotyping and the medications. These structural differences originates from the ultimate goal of the model, which are diagnosis-medicine analyses with patient latents and phenotypemedication pattern finding. Lu et al. [44] proposed the MCLDA model that also analyzed the prescription data similar to our data used in this paper. Our DMPM model could be considered a more complex version of MCLDA that incorporates patient information. DMPM has a patient plate because our dataset was able to track individual patients over a year, while MCLDA does not have the patient tracking capability. 2.3. Contributions of this study We expect that the contributions of this paper will be laid out in two directions. The first layer of contribution is providing further insights for medical practitioners. The practitioners will be better informed when they use the prescription patterns to compare their practices to existing patterns. Previously, there was no golden rule for prescribing drugs for ordinary diseases, such as influenza, because of huge variations without a method to cluster and link the variations. Additionally, this pattern finding uncovers the latent relationship among diseases and medicines. Our patterns indicate that some diseases are quite similar and have multiple corresponding prescription patterns, which have only been enabled by the specific probabilistic modeling structure presented by this paper. Moreover, other patterns suggest that one prescription pattern could be used for multiple diseases. We also correlated the identified patterns with the prescription meta-information, such as patient ages and prescribing doctors’ specialties, to discover why multiple patterns were employed to cure a single disease cluster. These contributions for the practitioners are supported by the analysis on nationwide prescription data, which was not easy to collect due to possible privacy concerns. Finally, we analyzed how to match the extracted patterns to existing standards, i.e., ICD. These softclustering methods were considered to be fuzzy, and there was no clear method for how to interpret the patterns within the standard. We show that there are distinct pattern proportion changes across the ICD branches. Thus, this could be the first step in augmenting the collection of data regarding existing medical knowledge. The second layer of contribution is the model development on the national scale prescription dataset. There were a number of models [42] that incorporates the details of clinical path ways in a short period time. However, our model is different from the previous models because our model has a more plates and a latent variable dedicated to the patient model. Additionally, our model has a different structure compared to the electronic health record models because of distinct modeling on the diagnosis and medications. Finally, the post-analysis after the model inference is clearly different from the performance evaluation of the previous works. We performed network visualization and clustering to show the meta-grouping of patterns. We show the relevance of the probabilistic modeling results on the ICD and the patient demographics. From this aspect, there is a merge in the model development contribution and the insights found for medical practitioners. 3. Materials and methods This section presents the utilized prescription dataset, the analysis process, and our probabilistic model used in the process. The

dataset consists of national prescription data collected with the support of the national health insurance service of South Korea. The dataset was processed and cleaned to evaluate a selected disease categorization branch. Then, we applied our model to the dataset to extract the patterns. Our model is a topic model variant that assumes the generative process of prescriptions from latent factors. The details of the dataset, the process, and the model are described in turn. 3.1. NPS dataset and its filtering HIRA [45] manages the Korean health care data warehouse to improve service quality and to prevent illegal claims from hospitals and patients. One dataset provided by HIRA is the National Patients Sample (NPS) dataset,1 which includes 13% of the total outpatient prescription data from 2011. This dataset is anonymized to prevent identifying individuals, and the dataset is processed to track an anonymized entity over a single year. The dataset is a collection of prescriptions from approximately 1.1 million patients. A single prescription contains its ICD category, prescribed medicines, and side information on the hospital and the patient, such as the patient’s age, sex, and residing district. We analyzed the following two subsets of prescriptions: (1) all disease records for analyzing overall prescription patterns and (2) respiratory system disease (J00-47) records to understand the prescribing behaviors of a specific clinical domain. The following are the two reasons for choosing the specific diseases: (1) we tested the capability of our statistical models by applying the model to both most general and more specific cases and (2) we intended to drill-down the more specific cases, which included the respiratory cases, without clearly separating the cases, such as cancer cases that do not belong to the specific case that we chose. We considered the prescriptions of patients who had more than 80 visiting records for the most general dataset and 10 for the respiratory disease dataset. After these filters, the first dataset contained 189,086 prescriptions for 1514 patients, including 2394 ICD disease types as well as 1904 medications indexed by the main ingredient codes. The second dataset consists of 2,282,088 prescriptions, for 116,718 patients, including 196 ICD disease types as well as 1978 medications. 3.2. A topic model on disease and medicine We proposed the disease-medicine pattern model (DMPM) for extracting patterns from the prescriptions by jointly considering the labeled disease and the prescribed medicines. We extended the LDA [30] to develop DMPM by adding a generative process for the prescribed medicines and the disease for each prescription. The key idea of DMPM is to assign a latent diagnosis variable z to a prescription; we assumed that the latent diagnosis influences the determination of the disease label and the prescribed medicines. Additionally, we assumed that the latent diagnosis of a prescription is generated from the prior configuration of the diagnosis per patient. These assumptions are a customized version of the assumptions commonly used in Bayesian statistics, particularly in the generative modeling community, using probabilistic graphical modeling techniques. The assumptions suggest that the prescriptions would be affected by the compositions of the diseases treated at hospitals. A practitioner would generally interpret the composition of the diseases as the probability. The traditional view of prescription and disease categorizations implies that the human anatomy is the center of a disease and that 1

http://opendata.hira.or.kr/op/opc/selectPatDataAplInfoView.do.

39

S. Park et al. / Journal of Biomedical Informatics 75 (2017) 35–47

anatomy plays a key role in identifying disease types and their consequential prescriptions. Our view suggests that the prescription patterns are identifiable when we assume that the patterns are being generated by the compositions of diseases and prescriptions per patients. For example, if there are two prescriptions with the same labeled disease, unspecified acute bronchitis, and two different prescribed medicines, acetaminophen and guaifenesin. In this example, the two prescriptions might be generated from different latent diagnoses because the practitioners might have felt that the two treated diseases were two different strains of unspecified acute bronchitis. This might also suggest that the disease can be treated in two different ways. In either case, we jointly model the disease and the medicine to identify patterns reflecting both the medicine and disease information. Because DMPM is an extended model of LDA, other DMPM assumptions are similar to those for LDA. The following are the generating assumptions: (1) there are K patterns in the medical records, which can be represented as b and c; (2) each patient p has a pattern proportion of h; (3) a prescription record r has a latent diagnosis z from the prescription pattern h; (4) a disease d is probabilistically selected by the diagnosis z; (5) a medicine m is also stochastically selected from the diagnosis z; and (6) each pattern is defined as disease proportions, b, over diseases, and the medicine proportions c over the medicines. The DMPM generative process is outlined below:

iii. For every medicine m a. Draw a medicine mp;r;n  Multinomial(czp;r )

During the generative process, ‘‘Multinomial” represents a multinomial distribution, and ‘‘Dirichlet” indicates a Dirichlet distribution. The rationale behind the multinomial distribution is that the hidden variable Z that uses the multinomial distribution should represent a selection of joint pairs of diseases and medications. The Dirichlet distribution is applied to turn the probabilistic model into the Bayesian framework, and the Dirichlet distribution is the only prior conjugate for the multinomial distribution. The detailed probability density functions are as follows: f Dirichlet ðx1 ; . . . ; P Cð ai Þ Q a 1 and f Multinomial ðx1 ; . . . ; xK ; h1 ; . . . ; hK Þ ¼ xK ; a1 ; . . . ; aK Þ ¼ Q Cðia Þ i xi i i i P ð x Þ! Q x i i i i hi . x1 !...xK ! Now, we can explain the parameter inference process of the proposed probabilistic model, DMPM. We used a statistical technique called variational inference to learn the parameters. Due to the difficulty of finding the exact posterior distribution in DMPM, we employed the variational method [46,47] to approximate the posterior of the latent variables, z and h. The idea behind the variational method is to optimize the free parameters of a distribution over the latent variables using Kullback Liebler (KL) divergence [48]. The mean-field approximation [49] of the variational posterior is as follows (see Table 1): Rp P Y Y qðhp jkp Þ qðzpr j/pr Þ p¼1

P R N K

a b

c

h Z z D d M m

Number of patients Number of prescriptions Number of medicines Number of topics (patterns) Model parameter indicating the beliefs of pattern proportions for patients Model parameter indicating the disease distribution of each pattern Model parameter indicating the medicine distribution of each pattern Latent random variable indicating a pattern proportion of each patient Pattern set of latent random variable z Latent random variable indicating the pattern index of each prescription Disease set of a random variable d Random variable indicating a specific disease for each prescription Medicine set for the random variable m Random variable indicating a specific medicine in each prescription

same as in the LDA model due to the similar modeling structure of the latent variables in both models. Using this variational posterior, the log-likelihood is bounded by Jensen’s inequality as follows:

log pðd; mÞ P

Z X pðh; z; d; mÞ qðh; zÞlog pðh; zÞ z

¼ Eq ½logpðhÞ þ Eq ½log pðmjzÞ þ Eq ½log pðzjhÞ

ð2Þ

þ Eq ½log pðdjzÞ þ HðqÞ To determine the model parameters, a; b, and c, we adapted the expectation-maximization (EM) algorithm [50]; the E-step aims to find the current variational posterior distribution, while the M-step identifies the model parameters to maximize the lower bound of Jensen’s inequality. We found the variational parameters that optimize the lower bound when the model parameters are known in the E-step:

For every patient p in P A. Generate hp  Dirichlet(a) B. For every prescription record r i. Draw a pattern proportion zp;r  Multinomial(hp ) ii. Draw a disease dp;r  Multinomial(bzp;r )

qðh; zÞ ¼

Table 1 Notation table.

ð1Þ

r¼1

The additional parameters, k and /, are the variational parameters for the variational posterior. The assumption that the meanfield approximation includes each latent variable is considered to be independent of the other variables. This variational posterior is

kpk / ak þ

Rh X /prk

ð3Þ

r¼1

/prk / bkd

N pr Y

! K X ckm expðwðkpk Þ  w kpi Þ

n¼1

ð4Þ

i¼1

Then, we found the model parameters that maximize the lower bound when the variational parameters are known in the M-step:

bk;d /

Rp P X X /pr;k

ð5Þ

p¼1 r¼1

ck;m /

Npr Rh X P X X /pr;k

ð6Þ

p¼1 r¼1 n¼1

By operating the EM steps iteratively, we were able to infer the model parameters from the clinical records. Each EM iteration in the DMPM algorithm consists of two steps, the E-step and the M-step. In the E-step, the computational complexity of the learning time can be OðK  P  R  MÞ, where K is the number of patterns, P is the number of patients, R is the maximum number of records, and M is the maximum number of prescribed medicines in a record. Alternatively, this can be changed P to OðK  p;r N pr Þ, where N pr is the number of prescribed medicines in a record r from patient p. The M-step has a complexity of P OðK  P  RÞ that can alternatively be represented as OðK  p N p Þ, where N p is the number of records for patient p.

40

S. Park et al. / Journal of Biomedical Informatics 75 (2017) 35–47

4. Analysis results for respiratory disease prescriptions This section focuses on analyzing the prescriptions and diagnoses for respiratory diseases. First, we compared the clustering performances with LDA, which is a DMPM base model. Second, we calibrated the analysis dataset through cross-validations with the held-out likelihoods. This calibration determined the number of potential prescription patterns in the dataset. Third, we enumerated the identified patterns and their characteristics. Additionally, we showed that the correlation between the patterns revealed (1) the groups of the patterns sharing similar disease diagnoses and (2) the groups of patterns sharing similar medicines in their

prescriptions. Finally, we compared and contrasted the prescriptions by the patterns and using the ICD disease category. 4.1. Model evaluation As mentioned, DMPM is an extension of LDA that jointly models the diseases and the medicines in the prescriptions. To compare the clustering performances between LDA and DMPM, we measured the perplexities [30], which indicate the clustering quality given the cluster parameters imposed on the dataset (see Fig. 1). To calculate the perplexities, we defined the predictive probability for the unseen dataset. However, both models, LDA and DMPM, contain latent variables on the patient level, such as the pattern proportion of the patient. To compensate for this, we provide an observed dataset from the test dataset to learn the latent variables for an unseen patient dataset [51]. The observed dataset was not used in the model parameter training steps, and they were only used in the inference steps for the latent variables. Semantically, the predictive probability means that the model trained with patients in the training dataset is likely to predict the held-out prescriptions given the observed prescriptions of a new patient. In other words, a lower perplexity indicates a high predictive likelihood, which means that the model describes the unseen dataset well. Fig. 2 describes this data separation design. The following formula specifies the perplexity:

  1 perplexity ¼ exp  log PðRheldout jRobs ; HÞ D

ð7Þ

In the formulas, H indicates trained model parameters, such as the disease pattern distributions. Rheldout represents the unseen data in the test dataset and Robs indicates the observed dataset from the test dataset. D is the unique number of diseases or medicines. The log-likelihoods, such as the log PðRheldout jRobs ; HÞ, could be calculated as follows:

log PðRheldout jRobs ; HÞ 

X E½hk jRobs ; H log PðRheldout jHk Þ:

ð8Þ

k

In the formula, E½hk jRobs ; H is the expectation of a pattern proportion for a new patient from the observed prescriptions by iterating only the E-step in the learning processes. log PðRheldout jHk Þ is the log-likelihood of the held-out prescriptions for a new patient. P log PðRheldout jHk Þ can be regarded as d2Rheldout log bk;d in a disease P P perplexity case and r2Rheldout m2r log ck;d in a medicine perplexity

Fig. 1. Probabilistic graphical model for disease and medicine (disease-medicine pattern model or DMPM). Nodes in the diagram represent the random variables, the directed links show the conditional independence, and the boxes specify the number of repetitions for the structures with nodes and links in the boxed areas.

case. We split the entire dataset into five batches applied fivefold cross validation to the dataset, which means that one crossvalidation case consists of four batches used to find the patterns and one batch for the evaluation. Specifically, the four training batches became the dataset used to infer the model parameters,

Fig. 2. Dataset separation design for calculating predictive probabilities of the unseen held-out dataset.

S. Park et al. / Journal of Biomedical Informatics 75 (2017) 35–47

specifically H. Then, the remaining batch was split in half. The first half became the observed dataset, Robserv ed , to infer h, and the other half became the held-out dataset, Rheldout , for calculating the likelihood given the previously inferred parameters. This validation was repeated five times. Fig. 3 shows the experimental results from applying the cross-validation by changing the parameter K, which is used in DMPM and LDA. The figure demonstrates that DMPM has better performance from both perplexities than the baseline LDA model. 4.2. Analysis on number of prescription patterns We used the predictive log-likelihood [52] to determine the number of prescription patterns from the data. We split the dataset in the same way that as for the perplexity analysis. The predictive log-likelihood represents the probability of PðRnew Þ, which is the likelihood of generating a new prescription with the parameters obtained from the training dataset as in formula (8). Because DMPM is modeled with a set of diseases and medicines, the predictive log-likelihood contains the probabilities of both diseases and medicines as follows:

log PðRheldout jHk Þ 

X X r2Rheldout m2r

log ck;m þ

X

log bk;d :

ð9Þ

d2Rheldout

We applied the fivefold cross-validation to the dataset by changing the number of patterns, K. Fig. 4 indicates that the highest held-out log-likelihood was achieved with a setting k of 15. This suggests that the model trained by the analyzed data would be

41

best in predicting a missing prescription when we assume that there are 15 prescription patterns. The parameter k ¼ 15 is used for the rest of the paper. 4.3. Identified patterns and pattern correlations The identified patterns are stochastically inferred from the prescriptions by DMPM. Each pattern has a shared component with another pattern because of its soft-clustering property, and this section identifies such correlations between the patterns. First, we apply cosine similarity to the combination of patterns to confirm whether two patterns have significantly similar medicines or diseases [53]. Fig. 5 is a block diagram indicating the correlation strength by the brightness. Each cell represents the correlation between the patterns indexed by X and Y, which are the row and column. For example, Pattern 14 shares its disease probability distribution with Pattern 15 because the left figure in Fig. 5 shows a strong correlations at (14,15). The figure illustrates that the disease pattern similarities are different from the medicine pattern similarities. Pattern 1 has a similar disease distribution as patterns 7, 8, 9, 10, 11, 12, and 13, but Pattern 1 shows different medicine distribution similarities. The green boxes indicate groups which are a collection of selected patterns with high similarities. Next, we turned this correlation analysis into the network representation in Fig. 6. The network links two patterns if their correlation goes beyond a threshold, and we varied this threshold level to check its sensitivity. When we set the threshold to be 0.75, the network became two separate components, a large component;

Fig. 3. Perplexities for DMPM and LDA. (Top) disease perplexities over the number of pattern settings. (Bottom) Medicine perplexities over the number of pattern settings. A lower perplexity indicates better prediction performance on the unseen dataset.

Fig. 4. The predictive log-likelihood based on the number of patterns. A higher value indicates better predictive performance by the model.

42

S. Park et al. / Journal of Biomedical Informatics 75 (2017) 35–47

Fig. 5. Pearson correlation table between the (left) disease and (right) medicine patterns.

Fig. 6. Network of related diseases and medicines. The pairing and correlation over a threshold are visualized across the diseases and the medicines: DP and MP mean the disease pattern and medicine pattern, respectively.

and a component of patterns 14 and 15. Additionally, we grouped the patterns, which possess similar connections with other components, such as the green boxes in Fig. 5, and we extracted the top diseases and medicines for the groups at the bottom of Fig. 6. The prescription patterns on asthma in Group 4 are isolated from other patterns. This indicates that medical practitioners significantly chose to treat asthma diseases only with the medicines in Group 4. In contrast to these isolated patterns, the treatments of Groups 1 and 2 had a weak correlation regarding the prescribed

medicines, but their diagnosis disease distributions were strongly correlated. This suggests that there are alternative medicines to treat the same disease. In short, patients with the same disease, such as acute bronchitis, were prescribed different medications, such as phenylephrine hydrochloride (a decongestant) or amoxicillin sodium (an antibiotic). In principle, the treatment of a viral infection with antibiotics is inappropriate, but we captured a pattern that appears statistically significantly. Further study on this pattern is needed.

43

S. Park et al. / Journal of Biomedical Informatics 75 (2017) 35–47

The disease distributions on Groups 2 and 3 are different, yet their prescribed medicines are significantly similar. In other words, clinical experts treated different diseases, such as acute bronchitis and gangrenous tonsillitis, with the same medicines. In these cases, the treated diseases are different, but the diseases showed similar physiological characteristics and similar medicines are used to treat the physiological symptoms. From this correlation analysis, we found which diseases are more likely to share their prescriptions, and what the shared and prescribed medicines are. We also found which diseases require specific prescriptions without sharing a medicine. Surely every patient is different, and that is why practitioners examine, diagnose, and prescribe for each individual patient. However, once a large scale dataset is collected and analyzed, we can identify the bottom-up structure of similarity relationships between diagnosed diseases and prescribed medicines. 4.4. Pattern proportions using general patient characteristics DMPM provides the prescription categories via bottom-up clustering from the national-wide prescription collections. All extracted categories possess clinical characteristics, which are reflected as the inferred disease and medicine distributions per cluster. There are differences between this bottom-up clustering

and the top-down categorization, such as the ICD, and we reveal these differences by enumerating the descriptive statistics in the patient information per a cluster and per a categorization. Figs. 7 and 8 display the pattern proportions of the prescriptions by the patients’ ages. When we observed ICD proportions over ages in Fig. 8, we found that some of the ICD proportions changed based on the patients’ ages. For example, the proportion of J40-47 (chronic lower respiratory diseases) tends to increase, as the patients becomes older. This is essentially the disease proportion without the treatment information. In contrast, Fig. 7 presents the proportion trends over ages, and the proportions are calculated for clusters by modeling both diseases and medicines. For example, Patterns 4 and 5 have high proportions when the patients are infants and children. These patterns cannot be observed from the disease-only proportions in Fig. 8. To analyze the capability of capturing the patient background with prescription patterns, we calculated the similarities and the divergences on the ICD categories and the latent distributions for all patients. Applied measurements include the cosine similarity, Kullback-Leibler (KL) divergence, and Jensen-Shannon (JS) divergence. Additionally, we compared the DMPM results to the ICD categorization and the disease and medicine distributions with the original version of LDA. Table 2 shows the results for all the combination cases. The DMPM patterns showed the most distinctions

Fig. 7. The probability of DMPM pattern appearance by age.

Fig. 8. The probability of ICD code appearance by age.

Table 2 Similarities and divergences of the pattern proportions by patients’ characteristics from the general pattern proportions. Bold indicates the biggest difference. –

Age

ICD LDA-disease LDA-medicine DMPM

0.9537 0.9900 0.9264 0.8671

– ICD LDA-disease LDA-medicine DMPM

Sex

Location

Age

Cosine similarity 0.9992 0.9998 0.9967 0.9888

0.9940 0.9974 0.9887 0.9822

0.1029 0.0101 0.0822 0.2724

0.0196 0.0027 0.0111 0.0219

0.1086 0.0100 0.0829 0.3105

KL(qjjp) divergence 0.1144 0.0098 0.0835 0.3485

0.0036 0.0002 0.0033 0.0102

Sex

Location

KL(pjjq) divergence 0.0034 0.0002 0.0033 0.0101

0.0195 0.0027 0.0113 0.0213

JS divergence 0.0035 0.0002 0.0033 0.0102

0.0195 0.0027 0.0112 0.0216

44

S. Park et al. / Journal of Biomedical Informatics 75 (2017) 35–47

reflecting the patients’ characteristics in all cases. These results quantitatively proved that the DMPM patterns re-categorize the prescriptions more relevantly with regards to the patients’ backgrounds. 4.5. Pattern matching to disease categories Finally, we matched the identified patterns by DMPM to the disease categorization by ICD. ICD is the top-down authoritative categorization of diseases, and ICD inherently depends on the disease categorizations based on the human anatomy. Fig. 9 shows the ICD categorization and the proportion of DMPM patterns per ICD category. Each ICD category branch has its own dominant prescription patterns from DMPM. For example, J13, pneumonia due to streptococcus pneumoniae, demonstrates relatively high proportions of patterns in Group 4 from Fig. 6 when compared with the other codes in J09-18. This suggests that the ICD categories are also supported by the clustering of prescribed medicines from the bottomup perspective. At the same time, there are still patterns that may not be dominant but cannot be ignored. For example, J30-J39, other diseases of upper respiratory tract, is a good example, which displays a minor pattern group (Group 3). When we see the branch of J30J39, the proportions of Group 3 are relatively higher compared to the other branches. This probabilistic assignment of prescription patterns to the ICD categorization is completed for multiple reasons. First, the DMPM model itself is a soft-clustering method for managing variations from prescriptions as well as practitioners. Second, our medical practitioners commented that there are some cases when the disease mentioned on the prescription is intentionally misdiagnosed due to insurance problems related to the medicine cost. Therefore, DMPM could be a supporting tool to clarify the patterns that could

also be linked to ICD. Third, the minor prescription patterns might actually represent cases that are rarely but necessarily treated by such medicines. This case of patterns may result in more direct recommendation of prescriptions given the diagnosis. This bottom-up clustering of prescription patterns is interesting from multiple perspectives. Primarily, no single pattern of treatments is dedicated to a single disease. A disease could present differently in patients, so there are various treatment methods for curing variations associated with one disease. If we investigate the diversity of diseases and their treatments given an authoritative disease categorization, such as ICD, DMPM shows what would be a minor yet noticeable prescription pattern for treating a certain category of diseases. If we wish to find prescriptions that exploit insurance policies, DMPM results could be a hint because such exploitation requires the intentional arrangement of medicines and diagnosed diseases; thus, we can seek these uncommon disease-medicine correlation patterns. 5. Results using all disease records This section provides the analysis results from applying DMPM to the prescriptions for all diseases in the NPS dataset and after applying its filters. We only provided the extracted patterns without the additional evaluation experiments conducted for the respiratory system disease records. However, it should be noted that DMPM can be applied for the same generalized analysis. Fig. 10 shows the network representation of the extracted patterns from all disease datasets. The network links two patterns if their correlation goes beyond a threshold, and we varied this threshold level to check its sensitivity. When we set the threshold to 0.75, there were few similarities between medications, particularly Groups 1 and 2. The network contains isolated and connected patterns for shared medications. When we considered the isolated

Fig. 9. Matching ICD categorization to DMPM patterns. An ICD categorization branch uses multiple prescription patterns with differentiated probabilities. Some dominant prescription patterns from DMPM emerged based on an ICD categorization branch.

S. Park et al. / Journal of Biomedical Informatics 75 (2017) 35–47

45

Fig. 10. Network of related diseases and medicines. The pairing and correlation of the threshold are visualized across the diseases and the medicines. DP and MP mean the disease pattern and the medicine pattern, respectively.

patterns, some patterns had unique ICD alphabet codes, such as Code I in Pattern 1, whereas others had mixed ICD alphabet codes, such as Code J,H in Pattern 20. The unique code patterns indicated the unique medicine usage patterns for the specific disease categories, but the mixed-code patterns represented similar medications in different disease categories. These results enable the high-level analysis of medication similarities between upper disease categories.

6. Conclusion Human common diseases, such as the cold, have diverse treatments determined by practitioners’ diagnoses given the patients’ symptoms, and these diagnoses and symptoms are truly different case by case. The field of medicine has developed categorizations of diseases via a top-down manner, for example, looking at the major anatomy infected by diseases, but the diagnoses based on this categorization may not be able to reveal prescription patterns that may be different even for the same disease. This paper proposes a DMPM method that combines medicines and diseases to identify joint patterns between prescriptions and corresponding diseases. The patterns that emerge from prescription clustering show (1) which treatments are shared by which diseases, (2) which treatments are used based on the patients’ basic information (i.e., a

certain age group), and (3) which category of diseases in the ICD uses which prescription patterns. This model reveals prescription practices from a broad perspective, and the resulting summary of practices will be a valuable source of information for practitioners looking for multiple patterns of care, as well as researchers searching for emerging medicine combination patterns. Furthermore, these findings could lead us to a more in-depth understanding of both diseases and medicines. The data analysis in this paper could be coarse, leading to the re-discovery of well-known practices or phenomena. However, the replication of such coarse yet commonly accepted knowledge demonstrates that the proposed model identifies commonly accepted patterns from the dataset without detailed domain knowledge. We suggest that, in real-world usage, DMPM would be best used when to handle a large-scale dataset from a bird’seye view. Each identified pattern from DMPM may appear coarse, but the pattern is purely created by datasets without prior knowledge or bias. This unbiased and high-level summary of nationallevel health care datasets could interest national health care policy makers and insurance companies rather than medical researchers and practitioners. DMPM was proposed for clustering patients with multiple data sources, such as any information from electronic medical records, prescriptions, lab charts, or nurses’ care notes. The essence of DMPM from a modeling perspective is extracting a topic of joint

46

S. Park et al. / Journal of Biomedical Informatics 75 (2017) 35–47

information from two information sources. For example, this model can be adapted to extract patterns from a medicalcheckup list and a single disease, which might reveal abnormal medical checkup request patterns given the disease indication. Hence, DMPM is basically a simple expansion of LDA, but it is a useful structure that specifies a patient’s state and a list of treatments that the patient has received. Conflict of interest None. Acknowledgments This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) and funded by the Korean Ministry of Science, ICT & Future Planning (2012R1A1A1044575). References [1] M. Simasek, D.A. Blandino, Treatment of the common cold, Am. Fam. Phys. 75 (4) (2007) 515–520. [2] D. Marsland, M. Wood, F. Mayo, Content of family practice. Part I. Rank order of diagnoses by frequency. Part II. Diagnoses by disease category and age/sex distribution, J. Fam. Pract. 3 (1) (1976) 37. [3] P.D. Stolley, L. Lasagna, Prescribing patterns of physicians, J. Chron. Dis. 22 (6) (1969) 395–405. [4] S.L. Devantier, J.P. Minda, M. Goldszmidt, W. Haddara, Categorizing patients in a forced-choice triad task: the integration of context in patient management, PloS One 4 (6) (2009) e5881. [5] M.A. Pfeifer, D.R. Ross, J.P. Schrage, D.A. Gelber, M.P. Schumer, G.M. Crain, S.J. Markwell, S. Jung, A highly successful and novel model for treatment of chronic painful diabetic peripheral neuropathy, Diab. Care 16 (8) (1993) 1103–1115. [6] J.G.C. Association et al., Japanese gastric cancer treatment guidelines 2010 (ver. 3), Gastr. Cancer 14 (2) (2011) 113–123. [7] S.B. Brown, E.A. Brown, I. Walker, The present and future role of photodynamic therapy in cancer treatment, The Lancet Oncol. 5 (8) (2004) 497–508. [8] J.H. Glauber, A.L. Fuhlbrigge, J.A. Finkelstein, C.J. Homer, S.T. Weiss, Relationship between asthma medication and antibiotic use, CHEST J. 120 (5) (2001) 1485–1492. [9] A.-C. Nyquist, R. Gonzales, J.F. Steiner, M.A. Sande, Antibiotic prescribing for children with colds, upper respiratory tract infections, and bronchitis, Jama 279 (11) (1998) 875–877. [10] C. Munizza, G. Tibaldi, P. Bollini, E. Pirfo, F. Punzo, F. Gramaglia, Prescription pattern of antidepressants in out-patient psychiatric practice, Psychol. Med. 25 (04) (1995) 771–778. [11] B. Dean, M. Schachter, C. Vincent, N. Barber, Causes of prescribing errors in hospital inpatients: a prospective study, The Lancet 359 (9315) (2002) 1373– 1378. [12] X. Hu, M. Gallagher, W. Loveday, J.P. Connor, J. Wiles, Detecting anomalies in controlled drug prescription data using probabilistic models, in: Artificial Life and Computational Intelligence, Springer, 2015, pp. 337–349. [13] H. Park, M. Jeon, J.B. Rosen, Lower dimensional representation of text data in vector space based information retrieval, in: M. Berry (Ed.), Computational Information Retrieval, Proc. Comput. Inform. Retrieval Conf, SIAM, 2001, pp. 3– 23. [14] S.L. Spector, The common cold: current therapy and natural history, J. Aller. Clin. Immunol. 95 (5) (1995) 1133–1138. [15] T. Heikkinen, A. Järvinen, The common cold, The Lancet 361 (9351) (2003) 51– 59. [16] C. Ferrajolo, V. Arcoraci, M.G. Sullo, C. Rafaniello, L. Sportiello, R. Ferrara, A. Cannata, C. Pagliaro, M.G. Tari, A.P. Caputi, et al., Pattern of statin use in southern Italian primary care: can prescription databases be used for monitoring long-term adherence to the treatment?, PloS One 9 (7) (2014) e102146 [17] T. Tamayo, H. Claessen, I.-M. Rückert, W. Maier, M. Schunk, C. Meisinger, A. Mielck, R. Holle, B. Thorand, M. Narres, et al., Treatment pattern of type 2 diabetes differs in two german regions and with patients’ socioeconomic position, PloS One 9 (6) (2014) e99773. [18] S.-C. Park, M.-S. Lee, S.-G. Kang, S.-H. Lee, Patterns of antipsychotic prescription to patients with schizophrenia in Korea: results from the health insurance review & assessment service-national patient sample, J. Korean Med. Sci. 29 (5) (2014) 719–728. [19] S. Nakaoka, T. Ishizaki, H. Urushihara, T. Satoh, S. Ikeda, M. Yamamoto, T. Nakayama, Prescribing pattern of anti-parkinson drugs in japan: a trend analysis from 2005 to 2010, PloS One 9 (6) (2014) e99021.

[20] K.L. Olson, K.D. Mandl, Temporal patterns of medications dispensed to children and adolescents in a national insured population, PloS One 7 (7) (2012) e40991. [21] A. Vallano, E. Montane, J. Arnau, X. Vidal, C. Pallares, M. Coll, J. Laporte, Medical speciality and pattern of medicines prescription, Eur. J. Clin. Pharmacol. 60 (10) (2004) 725–730. [22] F. Napolitano, M.T. Izzo, G. Di Giuseppe, I.F. Angelillo, V. Castaldo, et al.C.W. Group, Frequency of inappropriate medication prescription in hospitalized elderly patients in italy, PloS One 8 (12) (2013) e82359. [23] A. Calderón-Larrañaga, L.A. Gimeno-Feliu, F. González-Rubio, B. Poblador-Plou, M. Lairla-San José, J.M. Abad-Díez, A. Poncel-Falcó, A. Prados-Torres, Polypharmacy patterns: unravelling systematic associations between prescribed medications, PloS One 8 (12) (2013) e84967. [24] H.M. Skerman, P.M. Yates, D. Battistutta, Multivariate methods to identify cancer-related symptom clusters, Res. Nurs. Health 32 (3) (2009) 345–360. [25] D.M. Blei, K. Franks, M.I. Jordan, I.S. Mian, Statistical modeling of biomedical corpora: mining the caenorhabditis genetic center bibliography for genes related to life span, BMC Bioinform. 7 (1) (2006) 250. [26] B. Zheng, D.C. McLean, X. Lu, Identifying biological concepts from a proteinrelated corpus with a probabilistic topic model, BMC Bioinform. 7 (1) (2006) 58. [27] F. Mörchen, M. Dejori, D. Fradkin, J. Etienne, B. Wachmann, M. Bundschus, Anticipating annotations and emerging trends in biomedical literature, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2008, pp. 954–962. [28] H. Wang, Y. Ding, J. Tang, X. Dong, B. He, J. Qiu, D.J. Wild, Finding complex biological relationships in recent pubmed articles using bio-lda, PLoS One 6 (3) (2011) e17243. [29] Y. Halpern, S. Horng, L.A. Nathanson, N.I. Shapiro, D. Sontag, A comparison of dimensionality reduction techniques for unstructured clinical text, in: ICML 2012 Workshop on Clinical Data Analysis, 2012. [30] D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res. 3 (2003) 993–1022. [31] J.D. Mcauliffe, D.M. Blei, Supervised topic models, in: Advances in Neural Information Processing Systems, 2008, pp. 121–128. [32] G.H. Golub, C. Reinsch, Singular value decomposition and least squares solutions, Numer. Math. 14 (5) (1970) 403–420. [33] J. Zhu, A. Ahmed, E.P. Xing, Medlda: maximum margin supervised topic models for regression and classification, in: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, 2009, pp. 1257–1264. [34] K.R. Chan, X. Lou, T. Karaletsos, C. Crosbie, S. Gardos, D. Artz, G. Ratsch, An empirical analysis of topic modeling for mining cancer clinical notes, in: 2013 IEEE 13th International Conference on Data Mining Workshops (ICDMW), IEEE, 2013, pp. 56–63. [35] C. Arnold, W. Speier, A topic model of clinical reports, in: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2012, pp. 1031–1032. [36] Y. Chen, J. Ghosh, C.A. Bejan, C.A. Gunter, S. Gupta, A. Kho, D. Liebovitz, J. Sun, J. Denny, B. Malin, Building bridges across electronic health record systems through inferred phenotypic topics, J. Biomed. Inform. 55 (2015) 82–93. [37] L. Yao, Y. Zhang, B. Wei, W. Wang, Y. Zhang, X. Ren, Y. Bian, Discovering treatment pattern in traditional chinese medicine clinical cases by exploiting supervised topic model and domain knowledge, J. Biomed. Inform. 58 (2015) 260–267. [38] M. Hasan, A. Kotov, A.I. Carcone, M. Dong, S. Naar, K.B. Hartlieb, A study of the effectiveness of machine learning methods for classification of clinical interview fragments into a large number of categories, J. Biomed. Inform. 62 (2016) 21–31. [39] W. Speier, M.K. Ong, C.W. Arnold, Using phrases and document metadata to improve topic modeling of clinical reports, J. Biomed. Inform. 61 (2016) 260– 266. [40] Z. Huang, W. Dong, H. Duan, A probabilistic topic model for clinical risk stratification from electronic health records, J. Biomed. Inform. 58 (2015) 28– 36. [41] Z. Huang, W. Dong, L. Ji, C. Gan, X. Lu, H. Duan, Discovery of clinical pathway patterns from event logs using probabilistic topic models, J. Biomed. Inform. 47 (2014) 39–57. [42] Z. Huang, W. Dong, L. Ji, C. He, H. Duan, Incorporating comorbidities into latent treatment pattern mining for clinical pathways, J. Biomed. Inform. 59 (2016) 227–239. [43] R. Pivovarov, A.J. Perotte, E. Grave, J. Angiolillo, C.H. Wiggins, N. Elhadad, Learning probabilistic phenotypes from heterogeneous EHR data, J. Biomed. Inform. 58 (2015) 156–165. [44] H.-M. Lu, C.-P. Wei, F.-Y. Hsiao, Modeling healthcare data using multiplechannel latent Dirichlet allocation, J. Biomed. Inform. 60 (2016) 210–223. [45] K.D. Association, et al., Health insurance review & assessment service, Report of Task Force Team for Basic Statistical Study of Korean Diabetes Mellitus: Diabetes in Korea 2007, 2007, pp. 1–57. [46] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, L.K. Saul, An introduction to variational methods for graphical models, Mach. Learn. 37 (2) (1999) 183–233. [47] M.J. Wainwright, M.I. Jordan, Graphical models, exponential families, and variational inference, Found. TrendsÒ Mach. Learn. 1 (1-2) (2008) 1–305. [48] S. Kullback, Information Theory and Statistics, Courier Corporation, 1997. [49] L.P. Kadanoff, More is the same; phase transitions and mean field theories, J. Stat. Phys. 137 (5-6) (2009) 777–797.

S. Park et al. / Journal of Biomedical Informatics 75 (2017) 35–47 [50] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the em algorithm, J. Roy. Stat. Soc. Ser. B (Methodol.) (1977) 1–38. [51] M. Zhou, L. Carin, Negative binomial process count and mixture modeling, IEEE Trans. Pattern Anal. Mach. Intell. 37 (2) (2015) 307–320.

47

[52] M.D. Hoffman, D.M. Blei, C. Wang, J. Paisley, Stochastic variational inference, J. Mach. Learn. Res. 14 (1) (2013) 1303–1347. [53] P.-N. Tan, M. Steinbach, V. Kumar, et al., Introduction to Data Mining, vol. 1, Pearson Addison Wesley, Boston, 2006.

Identifying prescription patterns with a topic model of ...

Sep 27, 2017 - prescription data from 2011 with DMPM and found prescription ...... IEEE 13th International Conference on Data Mining Workshops (ICDMW),.

3MB Sizes 0 Downloads 270 Views

Recommend Documents

Towards Identifying Patterns for Reliability of Web Services ...
Abstract: Individual web services can be composed together to form composite services representing business process workflows. The value of such workflows ...

Identifying Dynamic Spillovers of Crime with a Causal Approach to ...
Mar 6, 2017 - physical and social environment through a variety of mechanisms. ... 3Levitt (2004) describes efforts by the media to attribute falling crime rates in ... behavior as young adults.5 In addition, our findings contribute to the long ...

Identifying Dynamic Spillovers of Crime with a Causal Approach to ...
Mar 6, 2017 - and empirical analysis of the statistical power of the test that ..... data, we begin by considering a large subset of candidate models (Section 5.2).

Labeled LDA: A supervised topic model for credit ...
A significant portion of the world's text is tagged by readers on social bookmark- ing websites. Credit attribution is an in- herent problem in these corpora ...

A Joint Topic and Perspective Model for Ideological ...
Jul 2, 1998 - Language Technologies Institute. School of Computer Science. Carnegie ..... Republican. • 1232 documents (214 vs. 235), 122,056 words ...

A Joint Topic and Perspective Model for Ideological ...
Jul 2, 1998 - Wei-Hao Lin, Eric Xing, and Alex Hauptmann. Language Technologies Institute. School of Computer Science. Carnegie Mellon University ...

DualSum: a Topic-Model based approach for ... - Research at Google
−cdn,k denotes the number of words in document d of collection c that are assigned to topic j ex- cluding current assignment of word wcdn. After each sampling ...

A Topic-Motion Model for Unsupervised Video ... - Semantic Scholar
Department of Electrical and Computer Engineering. Carnegie Mellon University ..... Locus: Learning object classes with unsupervised segmentation. In IEEE ...

Feature LDA: a Supervised Topic Model for Automatic ...
Knowledge Media Institute, The Open University. Milton Keynes, MK7 6AA, ... Web API is highly heterogeneous, so as its content and level of details [10]. Therefore, a ... haps the most popular directory of Web APIs is ProgrammableWeb which, as of. Ju

TopicFlow Model: Unsupervised Learning of Topic ...
blogs, social media, etc. is an important problem in information retrieval and data mining ..... training corpus: Left: top 10 words and the most influential document.

TopicFlow Model: Unsupervised Learning of Topic ...
all the documents in the corpus are unlabeled and the top- ics of the corpus are ...... TopicFlow's list of top 10 most influential papers on at least one topic, and ...

A Poisson-Spectral Model for Modelling the Spatio-Temporal Patterns ...
later reference, we call this technique l best amplitude model. (BAM). ..... ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ...

Topic Segmentation with Shared Topic Detection and ...
Jul 23, 2007 - †College of Information Sciences and Technology. The Pennsylvania State University. University Park, PA 16802. ‡College of Computing.

Identifying Exoplanets with Deep Learning: A Five-planet ... - IOPscience
Jan 30, 2018 - largely unchanged when any other region is blocked. Figures 8(c) and (d) show that the model learns to identify secondary eclipses. The model's planet prediction increases when a secondary eclipse is occluded because we are hiding the

Identifying Exoplanets with Deep Learning: A Five-planet ... - IOPscience
Jan 30, 2018 - Millholland & Laughlin (2017) used supervised learning to identify candidate nontransiting planets in Kepler data, and Dittmann et al. (2017) used a neural network to identify the most likely real transits among many candidate events i

Identifying Exoplanets with Deep Learning: A Five-planet ... - IOPscience
Jan 30, 2018 - temperate orbits around Sun-like stars—that is, planets that might. (in ideal circumstances) support life as we know it on Earth. But early in the mission, most of the major results coming from Kepler were the discoveries of individu

Question-answer topic model for question retrieval in ...
"plus" way4 (PLSA_qa) with the traditional topic model PLSA [6]. We use folding-in ... TransLM significantly outperforms Trans (row 2 vs. row 3). The results are ...

A North-South Model of Trade with Search ...
Aug 23, 2016 - I provide empirical evidence in support of the last result using data for. 20 OECD countries. Keywords: Creative destruction, search, unemployment, trade ..... savers do not save in Northern companies would be in line with.

A Strategic Model of Network Formation with ...
Jan 4, 2017 - JEL Codes: D85, C72. Keywords: network formation, communication networks, endogenous link strength. 1 ..... flow model, strict Nash equilibrium structures are the empty network and the center%sponsored star network, in which one agent .