Gillan & Whelan IN PRESS COBEHA.pdf

Viewer
Transcript

Available online at www.sciencedirect.com

ScienceDirect What big data can do for treatment in psychiatry Claire M .Gillan1,2,3 and Robert Whelan1,2,3 Treatments for psychiatric disorders are only as effective as the precision with which we administer them. We have treatments that work; we just cannot always accurately predict who they are going to work for and why. In this article, we discuss how big data can help identify robust, reproducible and generalizable predictors of treatment response in psychiatry. Specifically, we focus on how machine-learning approaches can facilitate a move beyond discovery studies and toward model validation. We will highlight some recent exemplary studies in this area, describe how one can assess the merits of studies reporting treatment biomarkers, and discuss what we consider to be best practice for prediction research in psychiatry. Addresses 1 School of Psychology, Trinity College Dublin, Dublin 2, Ireland 2 Trinity College Institute of Neuroscience, Trinity College Dublin, Dublin 2, Ireland 3 Global Brain Health Institute, Trinity College Dublin, Dublin 2, Ireland Corresponding author: .Gillan, Claire M ([email protected])

Current Opinion in Behavioral Sciences 2017, 18:34–42 This review comes from a themed issue on Big data in the behavioral sciences Edited by Michal Kosinski and Tara Behrend

http://dx.doi.org/10.1016/j.cobeha.2017.07.003

measurements, we remain without a single viable biomarker for treatment prediction in psychiatry. Why is this? Issues surrounding reproducibility in science have never been more visible [2,3]. It is being increasingly recognized that there is an over-reliance on null hypothesis significance testing, and a related focus on p-values [4,5]. Although social psychology has been the sacrificial lamb for the reproducibility debate in psychology thus far, neuroimaging has been subjected to similar critiques [6], where statistical power and researcher degrees of freedom have been identified as key problem areas. Most neuroimaging studies test small samples (typically <50), often having many more variables (e.g., tens of thousands of voxels) than subjects. This makes finding spurious results (i.e., overfitting) highly likely, because the ratio of cases to predictors for ordinary least-squares is directly related to the overestimation of a model’s performance (see [7]). If held to the same standard, biomarker research studies in psychiatry would fare no better — failing the most crucial test: reproducibility. Although numerous brain measurements have been identified as potential biomarkers of treatment response (see Jollans and Whelan, 2016 for a review [8]), there is almost no consistency across studies. This is because these studies, with a few exceptions described later, do not incorporate adequate external validation steps — that is, testing the ability of a putative predictor to classify unseen data.

2352-1546/ã 2017 Elsevier Ltd. All rights reserved.

Introduction Treatment response in psychiatry is highly variable. Most patients do not remit following their first course of treatment [1], and in practice multiple therapies are often trialed (and failed) before finding one that works. This means that patients may endure long periods of untreated symptoms before finding something effective — if at all. Unfortunately, we do not possess prognostic tools that can accurately predict a patient’s response to a specific treatment. A key goal of modern psychiatry research is to remedy this; to identify the right treatment for each patient, first time around. Although there are myriad existing research studies that purport to distinguish for example antidepressant treatment responders from non-responders using cognitive, neuroimaging, neurochemical, demographic or clinical Current Opinion in Behavioral Sciences 2017, 18:34–42

This paper seeks to emphasize the importance of big data and robust statistical methodologies in treatment prediction research. How, in the absence of pre-registration, methods like internal cross-validation and the use of ‘hold-out data’ or external data (both are considered unseen data, but the latter is recruited independently of the training data for example, taken from another study) are crucial tools for prediction research in psychiatry, and particularly for studies involving neuroimaging [9,10]. We will start by introducing the concept of machine-learning and discussing how it complements theory-driven approaches to understanding treatment response. Then we will describe recent exemplars that have successfully applied machine learning to treatment prediction in psychiatry. We will close with some bestpractice guidelines for research in this area and some recommendations for collaborative research strategies (Box 1, Figure 1). These approaches align nicely with initiatives like that of the National Institute of Mental Health (Research Domain Criteria (RDoC)), which aim to establish biologically-grounded alternatives to our current system of psychiatric diagnostic classification. www.sciencedirect.com

What big data can do for treatment in psychiatry .Gillan and Whelan 35

Box 1 Best practice for treatment prediction research Large samples. Other considerations being equal, bigger sample sizes improve model reliability by reducing the tendency to overfit [10]. It is difficult for the model to fit to random noise in the training data as the sample size increases. Larger effect sizes protect against overfitting, while high-dimensional feature spaces (e.g. voxels in neuroimaging) promote it. Validation. The performance of any model should be referenced against unseen data. One approach is to split a dataset into three parts (nomenclature of Hasti, Tibshirani & Friedman [12] but note terminology use can differ): a training set, and smaller validation and test (‘hold-out’) sets. The training set is used for model fitting, the validation set is used to measure generalization error of the model, for example using nested cross-validation. The training and validation data can also be used for model selection, for example assessing how well competing models perform on validation data by varying model parameters (e.g., regularization constants) or type of algorithm (e.g., Elastic Net or SVM). In this case, a test set is critical, otherwise overfitting due to ‘researcher degrees of freedom’ will occur (e.g., choosing the best performing algorithm). The ‘hold-out’ test set is one that has been kept entirely separate from the rest of the data, and is only used to test the performance of one final selected model. Where possible, the test set would come from a different sample with similar characteristics (an external test), which is an even stronger test of generalization. Appropriate metrics. No one metric can capture model performance because factors such as differences in base rates of response to treatment affect the interpretation of these metrics. Thus, a range of metrics are necessary and should include sensitivity, specificity, positive predictive value, negative predictive value (see Figure 2 for detailed explanation). Regret. Psychiatric treatments have a range of side-effects and financial burdens, and therefore some misclassifications are worse than others in terms of patient safety and expense. Machine learning algorithms that include ‘regret’ by incorporating a different cost for error types are likely to be useful for treatment response prediction. For example, we may want to predict response to treatment with the goal of stopping ineffective treatment early for financial reasons. False positives (predicting no improvement when the patient does improve) is more risky in terms of patient health than false negatives (predicting improvement when the patient does not improve). Thus, a classifier could be constructed so that, during training, false positives are twice as costly as false negatives [48]. Interpretability. Woo and colleagues make [49] three excellent suggestions improving interpretability of models. Briefly, the models should firstly, be summarized (e.g., by applying a data reduction method) to present the most predictive features secondly, be evaluated for neuroscientific plausibility (e.g., is it concordant with known pathology) and thirdly, consider the potential for confounding variables to contribute to the model. Open Science. Data and models should be shared to firstly, facilitate comparisons with previous and future models and secondly, to provide datasets for external model validation.

Data-driven and theory-driven approaches Machine-learning (essentially synonymous with ‘datamining’ or ‘statistical learning’) refers to a class of approaches that focus on prediction rather than interpretation or mechanism. Typically, an outcome variable such as responder/non-responder status is used to train an algorithm to identify some combination of features (e.g., self-report, demographic, cognitive or brain data) that www.sciencedirect.com

are associated with the outcome. This type of question — responder vs. non-responder as the outcome variable — is often treated as a classification problem in machine learning. But the outcome variable can also be continuous, and in these cases a regression analysis is often used. From a machine-learning perspective, the same principles apply to both regression and classification questions. Models must balance the need to accommodate the complexity of the data (i.e., to be flexible) and the need for interpretability (see [11], section 2.1.3 for a more detailed exposition). For example, linear regression is inflexible because only linear relationships are allowed, as a result the output is easy to interpret — the outcome is a weighted linear combination of the features (e.g., younger people respond better to a specific treatment). By contrast, a support vector machine or random forest approach yields more accurate predictions, but interpretation is more difficult [12]. The choice of method depends on the goal of the analysis, as does the choice of metric for quantifying model performance. For example, the model can be optimized for real-world implementation for example using terms like ‘regret’ which take into account the fact that in the context of patient-care, some misclassification errors are worse than others (e.g., false positive errors might be worse than false negative errors in some situations; Box 1). Machine learning approaches are often contrasted with theory-driven approaches, such as those promoted by the computational psychiatry movement, which endeavor to explain psychiatric phenomena in terms of detailed models of brain function [13,14]. This theory-driven strategy might help improve treatment outcomes in one of two ways. First, it is thought that by linking clinical symptoms directly to theory-driven computational models of neural processes, new treatments could be designed to more precisely and effectively target these neural processes [15,16]. Alternatively, it is possible that the heterogeneity of response to existing treatments within diagnostic categories might be resolved if we redefine those categories based on commonalities in well-defined neurobiological processes rather than symptomatology. Enthusiasm for the latter approach has been borne out in work that has found new ways of parsing symptoms in ways that link more closely to neural [17] and cognitive [18,19] processes than existing diagnostic categories. The computational psychiatry approach is appealing, yet it remains to be seen how these insights will transfer to the clinic. Aside from issues of scalability and implementation in terms of both reach and cost-effectiveness of its mainstay tools, like functional imaging, seeking a near perfect computational characterization of the brain processes linked to a given clinical symptom cluster might be a dead end, because there is no guarantee that this will produce insights for improving treatments. This is in part because, much like in general medicine, different underlying causes can produce similar symptoms (e.g., jaundice Current Opinion in Behavioral Sciences 2017, 18:34–42

36 Big data in the behavioral sciences

Figure 1

(a) Define Feature and Target Data

(b)Train Model

(c) Test on Unseen Data

(d) RCT using algorithm

Features K-fold cross validation

S(1)

S(2)

S(2)

S(3)

S(3)

S(4)

S(4)

S(5)

S(5)

S(6)

S(6)

S(7)

S(7)

S(8)

S(8)

S(9)

S(9)

S(N)

S(N)

Randomized Controlled Trial

Treatment as Usual Training

S(1)

‘Hold-out’ or external data

Test

Target

vs.

+

Treatment Outcome

Model classified as responder

Model classified as nonresponder

Algorithm-Assisted Treatment Assignment

Responder Non-responder

Current Opinion in Behavioral Sciences

Treatment prediction research with big data. (a) Feature and target spaces are defined for a supervised learning approach. Features are variables that might predict future treatment outcome — examples include clinical, demographic, social, physiological, cognitive, neural or genetic information. The target variable is the outcome variable of interest — binary responder/non-responder status is a common metric based on a threshold of reduction in symptoms. In this example, there are n subjects and i features. All things being equal, we should maximize the number of subjects relative to the number of features (i.e., the ratio n:i). (b) Machine learning algorithms are trained and tested repeatedly using k-fold cross validation. In this example, k = 5; hence, for each fold, the model is trained on 80% of the data, and tested on the remaining 20%. Thus, the model is always evaluated with respect to its performance on unseen data within the initial sample. (c) The ‘winning’ algorithm is next tested on unseen data using the same metrics as b. This can be ‘hold-out’ data that were collected alongside the training data, but kept separately, or it can be independently collected data. Model performance on unseen data should be assessed using a variety of metrics (Figure 2). (d) Viable models — those that perform similarly on unseen internal and external datasets — should be brought forward to randomized controlled trials where treatment allocation by best-practice clinical guidelines can be compared to algorithm-assisted treatment assignment [34].

is a symptom of both gallstones and hepatitis). Moreover, in patients that share a common cause, understanding that cause may not necessarily lead one directly to its treatment. Instead of developing models of symptom states, psychiatry therefore might be better served through research aimed at fostering a computational understanding of how and when treatments work. These insights could then be used to develop prognostic tools that provide individual patient predictions to guide treatment decisions [15]. Although the computational psychiatry approach is scientifically rich and without doubt enticing, it is not the most expedient way to achieve the goal of a more personalized medicine approach in psychiatry treatment. This is where data-driven approaches like machine learning have the advantage. Machine learning can be used to develop single-subject prediction algorithms for treatment allocation from expansive datasets from which no obvious hypotheses can be postulated. In its purest form, Current Opinion in Behavioral Sciences 2017, 18:34–42

machine learning maximizes accuracy at the expense of interpretability and this sets it in stark contrast to theorydriven methods like computational psychiatry. The relative merits of theory-driven vs. data-driven approaches have been discussed and debated expertly by others [20,21], so we will not attempt to recapitulate these arguments. It will suffice to say that theory-driven approaches inevitably serve as a starting point for these models (i.e., for variable selection) and as such, the methods should be considered complementary. More crucially, we make the case that psychiatry should not be afraid of ‘black box’ machine learning algorithms for two reasons. First, these algorithms help us to achieve the best prediction accuracy — often the goal of treatment prediction research. Second, although machine learning algorithms often appear scientifically bereft, there are methods (described later) to reverse-engineer mechanisms from these models. In doing so they offer potential for scientific discovery that goes beyond the limits of hypotheses that can be derived from existing knowledge. www.sciencedirect.com

What big data can do for treatment in psychiatry .Gillan and Whelan 37

Machine-learning in treatment prediction research To illustrate this, we will now describe how some researchers have taken advantage of the recent proliferation of large datasets to make single-patient predictions in relation to treatment response in psychiatry. One of the first studies to identify a robust predictor of treatment response was by Uher and colleagues [22!] in 2012. They tested the extent to which dimensions of depressive symptomatology could predict clinical outcome following 12 weeks of pharmacological treatment — taking what is known as an unsupervised approach. Unsupervised analyses aim to find hidden ‘latent’ structure in datasets (e.g., clusters or factors) and, in many cases, is a useful method for reducing the number of predictors in a model (thereby reducing the risk of over-fitting). Unlike supervised learning, this method does not attempt to predict an outcome measure, but rather searches for structure within a set of features. In this study, 9 partially nested symptom dimensions were defined from a ‘discovery dataset’ (n = 811) using a factor analysis of individual items from three popular scales that measure depression symptoms [23]. These factors were then entered as predictors in subsequent regression models, for which the authors adopted a stringent criterion that a candidate predictor survive correction for multiple comparisons in two separate drug groups, and be consistent across two outcome measures, one self-report and the other clinician administered. Just one predictor, an ‘interest-activity’ symptom dimension, met this criterion and its association with clinical outcome was tested in an independent validation dataset (n = 3637) from STAR*D study. This result generalized to these new data, showing that patients reporting the most prominent ‘loss of interest’ had the worst prognosis, independent of baseline illness severity. Moving into supervised learning, which is used to identify the best predictors of a stated outcome or ‘target’ variable, another group of researchers [24,25] built a machine learning model using self-report data from over 8000 patients to predict chronicity of depression. Rather than following subjects over time, they relied on subjects’ self-reported history of depression, such as number of years with depression, to define chronicity. Kessler and colleagues [26] recently took this further, and tested whether this algorithm could also identify subjects who would experience chronic depression in the future. To do this, 1056 of the original patients were re-interviewed 10–12 years later and Kessler and colleagues [26] fit the original algorithm to their prospective clinical outcome data. Similar measures were examined 10–12 years later for the prospective analysis. They showed that these results generalized into prospective outcomes for these patients with area under the curve (AUC) of the receiver operating characteristic (ROC) ranging from 0.63–0.76 (for the various chronicity measures included). While results are promising, it is important to note that this is www.sciencedirect.com

not an independent validation, because the same subjects were used in both analyses. For example, subjects who reported retrospectively that they had chronic depression at time 1, would have an increased tendency to also show that pattern 10–12 years later. Given the non-independence of these tests, the generalizability of the model is uncertain and its true AUC likely lower. A 2016 study demonstrated both the utility and challenges of a big data approach. Chekroud and colleagues [27!!] first used clinical variables collected at baseline from 1949 depressed patients enrolled in the STAR*D study [1] (e.g., presence of specific symptoms, medication history, employment status) to predict response to the selective serotonin reuptake inhibitor (SSRI) citalopram. Within this sample, they adopted a machine-learning approach that constructed and validated a model using internal cross-validation. The 25 best individual predictors were identified from a set of 164 features using regression with elastic net regularization and these 25 were used predict treatment response. This initial model performed at 64.6% accuracy on the internal test folds. Next, this model was applied (with no further training) to a new sample derived from the COMED study: model accuracy was reduced somewhat to just under 60% for this external validation (Box 2). Although the approach adopted in this study can be considered gold-standard, comprising both internal and external validation (Figure 1, Box 1), the rate of misclassification remains substantial at "40%. Although a considerable step forward, as we will discuss later, it is possible that the addition of brain, cognitive, psychosocial or genetic data (but see recent failure using genetics [28]) could increase sensitivity and specificity [29!]. A final note on this study is that there was modest evidence that the algorithm (which was trained on citalopram) generalized to escitalopram, another SSRI, but failed to predict response to a medication combination outside this class (venlafaxine Box 2 How accurate is accurate enough? The American Psychiatric Association (APA) recommended that for a biomarker to have clinical utility in psychiatry, it should have 80% accuracy in detecting a disorder [50]. While this might seem like a reasonable starting point for research aimed at developing prognostic tools too, ultimately the value of a tool comes down to a cost/ benefit trade-off. For example, take a tool based on some algorithmic combination of a limited set of self-report questions that has 70% accuracy (with equal sensitivity and specificity) for predicting response to a first-line antidepressant in depression. If 10 patients were prescribed a treatment based on the output of the algorithm, 7 would respond and 3 would not (to give context, the STAR*D study found that out of 10 patients given first line treatment of citalopram 4 respond and 6 do not [1]). In terms of specificity, if 10 others were denied this treatment, 7 would have moved on more quickly to the second-line treatment and only 3 would have been denied a beneficial first line drug. Thus, if that tool was quick to administer and readily available, despite its modest predictive power, it would have enormous value.

Current Opinion in Behavioral Sciences 2017, 18:34–42

38 Big data in the behavioral sciences

plus mirtazapine group). Although these data are not entirely compelling, the prospect of identifying algorithms with treatment-specificity, that is, those that can help clinicians determine which treatment they should recommend to which individual, is tantalizing and analysis approaches expressly designed to arbitrate between alternative treatment options will be of interest moving forward [30]. However, pharmacological treatment specificity is not a necessary attribute of a useful treatment prediction algorithm in psychiatry, given the current state of the art. For example, an algorithm that can identify individuals that are unlikely to respond to a range of antidepressants could be used to fast-track those patients to alternative treatments [31]. Koutsouleris and colleagues [32] similarly trained a machine learning algorithm on a combination of clinician, staff and self-rated measures from a sample of 334 patients with first episode psychosis who subsequently underwent treatment. Their nested cross-validated algorithm achieved 75% accuracy for 4-week outcomes and 73.8% for 52-weeks. Prioritizing clinical practicality, they retrained the 4-week algorithm with just the top 10 features from the initial analysis and critically tested this on an independent sample of 108 subjects. Impressively, this achieved a prediction accuracy of 71.7%. The most robust predictors of 4-week outcomes were unemployment, daytime activities, psychological distress, company, money and global assessment of functioning at baseline. Like the 2016 study from Chekroud and colleagues [27!!], there was some suggestion that the algorithm possessed treatment specificity (at 52 weeks). However, the specificity findings of Koutsouleris et al. require independent replication. The most cautious interpretation must be that these models are not yet ready to guide treatment decisions, but rather may indicate poor clinical prognosis or general treatment resistance for which the implications for selecting the appropriate treatment strategy are less clear. The final study that we will discuss begins to address this issue, highlighting how we might identify what treatments a medication can effectively treat. In a 2017 paper, Chekroud and colleagues [33] used an unsupervised approach to identify clusters of symptoms based on the correlation across item-level scores on a common scale assessing depression symptomatology (which is otherwise typically aggregated into a total score). They identified three symptom clusters (sleep, core emotional and atypical) in a sample of 4017 subjects, and then replicated them in an independent set of 640 subjects. The ‘sleep’ cluster refers to various forms of insomnia, ‘core emotional’ include symptoms relating to mood, energy, interest and guilt, and the ‘atypical’ cluster include suicidality, hypersomnia, psychomotor slowing and agitation. They found that antidepressants were more effective for core emotional symptoms, than for sleep symptoms and were the least effective for Current Opinion in Behavioral Sciences 2017, 18:34–42

atypical symptoms. Moreover, they showed that certain drugs, for example high dose duloxetine outperform others, like escitalopram at treating core emotional symptoms, suggesting a possible future clinical application.

Can brain imaging data improve prediction? Magnetic resonance imaging (MRI) can measure in vivo brain activity with high spatial resolution. As such, it has great theoretical applicability for predicting response to treatments designed to alter brain function. Unfortunately, the high cost of MRI is a barrier to large (>100) sample sizes, yet this is even more important given the high dimensionality (>10 000 voxels) of MRI data. Early studies with small sample sizes have highlighted some of the ways that MRI data might be incorporated into analyses aimed at predicting treatment response in the future. Redlich et al. [34], applied machine learning methods to structural gray-matter data to predict response to electroconvulsive therapy (ECT) with moderate performance (13 responders vs. 10 nonresponders, 100% sensitivity and 50% specificity). Another study [35] compared 25 remitters versus 20 non-remitters to ECT using resting-state functional MRI. Two networks were identified that were predictive of outcome (sensitivity of 80–84% and specificity of 75–85%). A sophisticated analysis [36] used brain connectomics (i.e., correlated activity of brain regions), both structural (diffusion) and functional (resting state) MRI to predict improvement in 38 individuals with social anxiety disorder (SAD) following 12 weekly sessions of CBT. Each MRI modality added additional prediction accuracy with the best model resulting in a five-fold improvement in prediction. This approach suggests that a multi-modal approach and brain connectomics may be useful. Although these studies employed internal cross-validation, the small sample sizes and lack of testing on unseen data means that the generalizability cannot be assessed and therefore these performance metrics cannot be taken at face value. As we recommend in Box 1, large samples and external validation are essential in constructing and assessing, respectively, the utility of predictions. An example of a promising methodology comes from work focused on predicting future binge drinking in a cohort of adolescents using machine learning. MRI data at age 14 years-old was obtained from 692 subjects, including structural MRI and fMRI during tasks that assayed inhibitory control, reward processing and emotional responding. A wide variety of personality and cognitive measures were also obtained. These age-14 data were used to predict alcohol misuse at age 16 years-old (121 binge drinkers versus 150 non-drinkers), with an AUC of 0.75. This model was also able to predict bingedrinking in a separate sample of adolescents with slightly different alcohol use characteristics at age 14 (n = 116), also with an AUC of 0.75. The use of a large sample and external validation is unusual in MRI prediction studies www.sciencedirect.com

What big data can do for treatment in psychiatry .Gillan and Whelan 39

for reasons of cost, but was made possible through a multisite MRI data collection consortium that employed standardized data collection procedures and imaging parameters. There are just a couple of MRI studies in treatment prediction research that carry weight. Korgaonkar et al. [37] examined both gray-matter volumes (116 brain regions from an anatomical atlas) and fractional anisotropy of 46 white matter tracts. A series of decision trees were generated, which essentially identifies the optimal ROC cut-off values for accurately separating the groups. The trees were generated on a sample of 74 subjects and validated on a set of 83 subjects, with results indicating that gray matter volume in the left middle frontal and the right angular gyrus identified a subset of patients who did not remit to any of the three prescribed antidepressant medications (55% of all non-remitters in the cohort were identified, with 82% accuracy). Although the sample size was relatively modest in the training set, the validation test on 83 unseen patients reassures that the data were not overfit. Finally, an outstanding recent example of a Big Data approach [38!!] involved the application of unsupervised clustering on brain connectomics data derived from resting state (rsfMRI) data in 1188 patients with a diagnosis of depression. This imaging modality is potentially more useful in a clinical setting because it has minimal patient demands and data can be more easily aggregated across sites. Here, the authors generated four ‘biotypes’ of patients that had high test–retest reliability, organized only by patterns of brain connectivity. The biotypes were validated on an external sample, with 86.2% of patients correctly identified. Critically, the biomarkers also had prognostic utility. Repetitive transcranial magnetic stimulation (rTMS), a neurostimulation treatment for medication-resistant depression, was administered to a separate cohort of 154 patients (who had recently undergone rsfMRI). Patients in one biotype subgroup were more likely to benefit from rTMS than those in other biotype subgroups (82% improvement versus 61%, 29.6% and 25%). Connectivity measures from rsfMRI and biotype diagnosis were markedly superior to treatment predictors using the typical clinical symptoms (89.6% vs. 62.6%).

Where next?

With the notable exceptions described above [37,38!!], the vast majority of MRI treatment prediction studies do not meet best practice guidelines in terms of sample size and/or appropriate validation (Figure 2). This means that the data are unlikely to generalize to new settings. We advocate that re-registration and/or appropriate validation of putative biomarkers in unseen data should be a requirement for publication. If the sample is large enough, this means internal validation at a minimum, but preferably with external validation on unseen data. The dearth of www.sciencedirect.com

appropriately powered large-scale predictive MRI research is likely due to the same factor that makes MRI unlikely to be practical in clinical practice — its high cost. By contrast, electroencephalography (EEG) is relatively inexpensive and therefore may have more potential for widespread use in prediction models. Capable of providing reliable measures of cognitive status, one EEG tool has already been shown to improve treatment decisions in depression, when used as a supplement to clinical opinion [39]. Another study by Leuchter and colleagues (2009) showed that EEG data collected following 1 week of treatment with selective serotonin reuptake inhibitors can predict responder status as week 8 with "74% prediction accuracy [40]. One way we might incorporate MRI into this kind of large-scale brainbased prediction research without the high price tag is by increasing the number of small-scale MRI studies that simultaneously collect EEG data. In this way, researchers can back-translate EEG insights (that have been shown to have direct treatment prediction power) to MRI, facilitating a richer examination of the neuronal mechanisms that predict response. A similar logic applies to cognitive tests that can, albeit with less precision, assay specific aspects of brain function. Researchers have already begun to carry out preliminary investigations of this sort with classic cognitive tests [41], but there is much to be done. Critically, cognitive tests (unlike EEG) can be administered via the Internet, which allows researchers to feasibly collect new data (rather than relying on data from randomized controlled trials (RCTs)) from individuals around the world in comparatively short timescales and at low cost. The opportunity to extend this approach to prediction research in psychiatry is basically untouched, but the recent uptake of services like Amazon’s Mechanical Turk for psychological research indicates that this will not be far off [42]. Finally, treatment prediction research in psychiatry will truly move into the realm of ‘big data’ when several complementary data sources are combined. These sources may include self-report data, MRI or EEG, genetics and perhaps epigenetics. With respect to neuroimaging data, these modalities could include structural information such as volumetrics, cortical thickness and white-matter tractography and functional data that assay one or more of relevant neurocognitive systems (e.g., reward responsivity). Indeed, it is possible to strategically combine data sources to increase prediction accuracy. Sophisticated methods, such as boosting or stacked generalization (stacking), are types of ensemble learners that can train a diverse set of models and learn from misclassified cases. In this way, results from several weak learners — perhaps each focusing on a particular data source — can be aggregated to produce a better prediction than any single model. While the inclusion of multiple modalities might improve prediction, this awaits direct test. The unique contribution of any one modality in predicting treatment response Current Opinion in Behavioral Sciences 2017, 18:34–42

40 Big data in the behavioral sciences

Figure 2 Treatment Outcome

Responder

Non-responder

(a) Treatment with moderate efficacy (57% responders) Model-predicted responders

TRUE NEGATIVES

FALSE POSITIVES

FALSE NEGATIVES

TRUE POSITIVES

Positive Predictive Value = .74 True positives / (true positives + false positives) Negative predictive value = .63 True negatives / (true negatives + false negatives)

Sensitivity = .70 Truepositives / (true positives + false negatives) Specificity=.66 True negatives / (true negatives + false positives)

(b) Treatment with low efficacy (25% responders)

Model-predicted responders

TRUE NEGATIVES

FALSE POSITIVES

Positive Predictive Value = .41 True positives / (true positives + false positives) Negative predictive value = .87 True negatives / (true negatives + false negatives)

TRUE POSITIVES

FALSE NEGATIVES

Sensitivity = .70 True positives / (true positives + false negatives) Specificity = .66 True negatives / (true negatives + false positives) Current Opinion in Behavioral Sciences

Assessing model performance. (a) Hypothetical results of a treatment that has moderate efficacy and results in 20 responders and 15 nonresponders. This model has sensitivity of .7, which reflects its ability to correctly classify actual ‘responders’. This model has slightly lower specificity, the ability to correctly identify those who will not respond to treatment — ‘non-responders’. The area under the curve of the receiver-operating characteristic gives some useful information on model performance when groups are approximately balanced in size. However, other metrics should also be reported, particularly when treatment outcomes are unbalanced [44] (Panel B). These include positive and negative predictive value, which account for base rates in treatment response. (b) Hypothetical results of a treatment that has low efficacy and results in just 10 responders and 30 non-responders. Here, the sensitivity and specificity estimates are identical to Panel A. However, the positive and negative predictive values differ markedly from those of Panel A because of the change in base rate with respect to treatment response. Positive predictive value refers to the probability that those individuals that the model indicates should be given a treatment, will in fact be ‘responders’. Negative predictive value refers to the probability that those individuals that the model indicates we should deny the treatment to, would in fact have had a positive response. If we imagine a treatment that is low ‘cost’ (e.g. few side-effects, rapid treatment course, inexpensive), then we would prefer a predictive tool with high negative predictive value — we do not want patients to miss out on a potentially helpful treatment. With more high-risk or invasive treatment, options, positive predictive value might be more important.

Current Opinion in Behavioral Sciences 2017, 18:34–42

www.sciencedirect.com

What big data can do for treatment in psychiatry .Gillan and Whelan 41

can be quantified by selectively including or excluding each modality. For example, Whelan and colleagues [29!] used this method to show that life history, personality and neuroimaging data could predict future substance misuse, whereas features such as cognitive performance could not (over and above neuroimaging data). Notably, given the covariance among different features, such as behavioral assays and neuroimaging, removing an entire class of features sometimes did not substantially degrade prediction accuracy. Feature ablation — iteratively removing features from a model — can be used to identify the minimal set of features to achieve a certain model performance, which is essential for real-world applications where for example brain imaging with fMRI — even if proven useful — will not be practical.

References and recommended reading Papers of particular interest, published within the period of review, have been highlighted as: ! of special interest !! of outstanding interest 1.

Rush AJ, Trivedi MH, Wisniewski SR, Nierenberg AA, Stewart JW, Warden D, Niederehe G, Thase ME, Lavori PW, Lebowitz BD et al.: Acute and longer-term outcomes in depressed outpatients requiring one or several treatment steps: a STAR*D report. Am J Psychiatry 2006, 163:1905-1917.

2.

Open Science Collaboration: Estimating the reproducibility of psychological science. Science 2015, 349:aac4716.

3.

Ioannidis JP: Why most published research findings are false. PLoS Med 2005, 2:e124.

4.

Wasserstein R, Lazar N: The ASA’s statement on p-values: context, process, and purpose. Am Stat 2016, 70:129-131.

5.

Nuzzo R: Scientific method: statistical errors. Nature 2014, 506:150-152.

6.

Button KS, Ioannidis JP, Mokrysz C, Nosek BA, Flint J, Robinson ES, Munafo` MR: Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci 2013, 14:365-376.

7.

Whelan R, Garavan H: When optimism hurts: inflated predictions in psychiatric neuroimaging. Biol Psychiatry 2014, 75:746-748.

8.

Jollans L, Whelan R: The clinical added value of imaging: a perspective from outcome prediction. Biol Psychiatry Cogn Neurosci Neuroimaging 2016, 1:423-432.

9.

Dubois J, Adolphs R: Building a science of individual differences from fMRI. Trends Cogn Sci 2016, 20:425-443.

Conclusion Despite the increasing opportunities for collecting large datasets inexpensively using online methods, we must also recognize that ultimately the problem we are trying to solve is too large for any one laboratory. Collaboration will be essential moving forward, as the need for large datasets becomes more and more apparent. An important step will be a greater engagement in standardization of instruments, from using the same self-report questionnaires, to cognitive tests and MRI acquisition methods. This will facilitate a balance between incremental hypothesis-testing on small datasets within labs, and large machine-learning studies carried out across centers. In some cases, such as structural MRI where images are collected in a fairly standardized way, there is already potential for merging existing datasets to conduct ‘megaanalyses’ years later [43]. It might be more difficult to standardize cognitive tests, as numerous parameters (e.g., stimulus, timing, and reinforcement properties) typically vary. However, in the developmental literature there may be some emerging consensus on core tasks; for example, the European-based IMAGEN (n > 2000) [44] and the US-based ABCD study (projected n = 10 000) [45] both employ the stop-signal task [46] and the monetary incentive delay task [47] as assays of inhibitory control and reward processing. We are just now starting to talk about the research infrastructure needed to apply ‘big data’ methodologies to the task of developing robust, multimodal, prediction models for treatment response in psychiatry. It is clear that more than just big data, we need big ideas to deliver on the collaborative and open research framework needed to make this possible.

10. Yarkoni T, Westfall J: Choosing prediction over explanation in psychology: lessons from machine learning. Perspect Psychol Sci 2017. in press. 11. James G, Witten D, Hastie T, Tibshirani R: An Introduction to Statistical Learning. New York: Springer; 2013. 12. Hastie R, Tibshirani R, Friedman J: The Elements of Statistical Learning. New York: Springer; 2009. 13. Huys QJ, Maia TV, Frank MJ: Computational psychiatry as a bridge from neuroscience to clinical applications. Nat Neurosci 2016, 19:404-413. 14. Montague PR, Dolan RJ, Friston KJ, Dayan P: Computational psychiatry. Trends Cogn Sci 2012, 16. 15. Stephan KE, Schlagenhauf F, Huys QJ, Raman S, Aponte EA, Brodersen KH, Rigoux L, Moran RJ, Daunizeau J, Dolan RJ et al.: Computational neuroimaging strategies for single patient predictions. Neuroimage 2017, 145:180-199. 16. Wang XJ, Krystal JH: Computational psychiatry. Neuron 2014, 84:638-654. 17. Brodersen KH, Deserno L, Schlagenhauf F, Lin Z, Penny WD, Buhmann JM, Stephan KE: Dissecting psychiatric spectrum disorders by generative embedding. Neuroimage Clin 2014, 4:98-111. 18. Gillan CM, Kosinski M, Whelan R, Phelps EA, Daw ND: Characterizing a psychiatric symptom dimension related to deficits in goal-directed control. Elife 2016, 5.

Nothing declared.

19. Fair DA, Bathula D, Nikolas MA, Nigg JT: Distinct neuropsychological subgroups in typically developing youth inform heterogeneity in children with ADHD. Proc Natl Acad Sci U S A 2012, 109:6769-6774.

Acknowledgements

20. Pine DS, Leibenluft E: Biomarkers with a mechanistic focus. JAMA Psychiatry 2015, 72:633-634.

Conflict of interest statement

C.M. Gillan is supported by MQ: transforming mental health (MQ16IP13). R. Whelan is supported by Science Foundation Ireland (16/ERCD/3797), the Health Research Board (HRA-POR-2015-1075) and a Brain and Behavior Research Foundation Young Investigator award (#23599). www.sciencedirect.com

21. Paulus MP: Pragmatism instead of mechanism: a call for impactful biological psychiatry. JAMA Psychiatry 2015, 72:631-632. Current Opinion in Behavioral Sciences 2017, 18:34–42

42 Big data in the behavioral sciences

22. Uher R, Perlis RH, Henigsberg N, Zobel A, Rietschel M, Mors O, ! Hauser J, Dernovsek MZ, Souery D, Bajs M et al.: Depression symptom dimensions as predictors of antidepressant treatment outcome: replicable evidence for interest-activity symptoms. Psychol Med 2012, 42:967-980. These authors showcase how as an alternative to machine learning techniques, one use discovery science methods to reveal robust predictors, so long as appropriate validation steps are carried out. Here they used factor analysis to reduce their feature space, employed stringent criteria for selecting viable predictors and validate one such predictor in an independent dataset. 23. Uher R, Farmer A, Maier W, Rietschel M, Hauser J, Marusic A, Mors O, Elkin A, Williamson RJ, Schmael C et al.: Measuring depression: comparison and integration of three scales in the GENDEP study. Psychol Med 2008, 38:289-300. 24. van Loo HM, Cai T, Gruber MJ, Li J, de Jonge P, Petukhova M, Rose S, Sampson NA, Schoevers RA, Wardenaar KJ et al.: Major depressive disorder subtypes to predict long-term course. Depress Anxiety 2014, 31:765-777. 25. Wardenaar KJ, van Loo HM, Cai T, Fava M, Gruber MJ, Li J, de Jonge P, Nierenberg AA, Petukhova MV, Rose S et al.: The effects of co-morbidity in defining major depression subtypes associated with long-term course and severity. Psychol Med 2014, 44:3289-3302. 26. Kessler RC, van Loo HM, Wardenaar KJ, Bossarte RM, Brenner LA, Cai T, Ebert DD, Hwang I, Li J, de Jonge P et al.: Testing a machine-learning algorithm to predict the persistence and severity of major depressive disorder from baseline self-reports. Mol Psychiatry 2016. 27. Chekroud AM, Zotti RJ, Shehzad Z, Gueorguieva R, Johnson MK, !! Trivedi MH, Cannon TD, Krystal JH, Corlett PR: Cross-trial prediction of treatment outcome in depression: a machine learning approach. Lancet Psychiatry 2016, 3:243-250. This paper represents the gold standard for treatment prediction studies using machine learning. Chekroud and colleagues employ feature reduction, internal cross-validation and external validation to find predictors of antidepressant response from clinical data. 28. Garcı´a-Gonza´lez J, Tansey KE, Hauser J, Henigsberg N, Maier W, Mors O, Placentino A, Rietschel M, Souery D, agar T et al.: Pharmacogenetics of antidepressant response: a polygenic approach. Prog Neuropsychopharmacol Biol Psychiatry 2017, 75:128-134. 29. Whelan R, Watts R, Orr CA, Althoff RR, Artiges E, Banaschewski T, ! Barker GJ, Bokde AL, Bu¨chel C, Carvalho FM et al.: Neuropsychosocial profiles of current and future adolescent alcohol misusers. Nature 2014, 512:185-189. Whelan and colleagues showed the importance of multimodal data in making the best predictions — here they predict binge-drinking status at age 16, using baseline data collected at age 14. 30. Huibers MJ, Cohen ZD, Lemmens LH, Arntz A, Peeters FP, Cuijpers P, DeRubeis RJ: Predicting optimal outcomes in cognitive therapy or interpersonal psychotherapy for depressed individuals using the personalized advantage index approach. PLOS ONE 2015, 10:e0140771. 31. Perlis RH: A clinical risk stratification tool for predicting treatment resistance in major depressive disorder. Biol Psychiatry 2013, 74:7-14. 32. Koutsouleris N, Kahn RS, Chekroud AM, Leucht S, Falkai P, Wobrock T, Derks EM, Fleischhacker WW, Hasan A: Multisite prediction of 4-week and 52-week treatment outcomes in patients with first-episode psychosis: a machine learning approach. Lancet Psychiatry 2016, 3:935-946. 33. Chekroud AM, Gueorguieva R, Krumholz HM, Trivedi MH, Krystal JH, McCarthy G: Reevaluating the efficacy and predictability of antidepressant treatments: a symptom clustering approach. JAMA Psychiatry 2017. 34. Redlich R, Opel N, Grotegerd D, Dohm K, Zaremba D, Bu¨rger C, Mu¨nker S, Mu¨hlmann L, Wahl P, Heindel W et al.: Prediction of individual response to electroconvulsive therapy via machine learning on structural magnetic resonance imaging data. JAMA Psychiatry 2016, 73:557-564.

Current Opinion in Behavioral Sciences 2017, 18:34–42

35. van Waarde JA, Scholte HS, van Oudheusden LJ, Verwey B, Denys D, van Wingen GA: A functional MRI marker may predict the outcome of electroconvulsive therapy in severe and treatment-resistant depression. Mol Psychiatry 2015, 20:609-614. 36. Whitfield-Gabrieli S, Ghosh SS, Nieto-Castanon A, Saygin Z, Doehrmann O, Chai XJ, Reynolds GO, Hofmann SG, Pollack MH, Gabrieli JD: Brain connectomics predict response to treatment in social anxiety disorder. Mol Psychiatry 2016, 21:680-685. 37. Korgaonkar MS, Rekshan W, Gordon E, Rush AJ, Williams LM, Blasey C, Grieve SM: Magnetic resonance imaging measures of brain structure to predict antidepressant treatment outcome in major depressive disorder. EBioMedicine 2015, 2:37-45. 38. Drysdale AT, Grosenick L, Downar J, Dunlop K, Mansouri F, !! Meng Y, Fetcho RN, Zebley B, Oathes DJ, Etkin A et al.: Restingstate connectivity biomarkers define neurophysiological subtypes of depression. Nat Med 2017, 23:28-38. The authors use unsupervised clustering based on resting state fMRI data from 1188 patients with depression to reveal 4 subgroups that were reproducible in an out-of-sample test. These subgroups could be used to predict treatment response to transcranial magnetic stimulation in 154 patients, achieving accuracy >87.5% in an independent replication set. 39. DeBattista C, Kinrys G, Hoffman D, Goldstein C, Zajecka J, Kocsis J, Teicher M, Potkin S, Preda A, Multani G et al.: The use of referenced-EEG (rEEG) in assisting medication selection for the treatment of depression. J Psychiatr Res 2011, 45:64-75. 40. Leuchter AF, Cook IA, Marangell LB, Gilmer WS, Burgoyne KS, Howland RH, Trivedi MH, Zisook S, Jain R, McCracken JT et al.: Comparative effectiveness of biomarkers and clinical indicators for predicting outcomes of SSRI treatment in Major Depressive Disorder: results of the BRITE-MD study. Psychiatry Res 2009, 169:124-131. 41. Etkin A, Patenaude B, Song YJ, Usherwood T, Rekshan W, Schatzberg AF, Rush AJ, Williams LM: A cognitive-emotional biomarker for predicting remission with antidepressant medications: a report from the iSPOT-D trial. Neuropsychopharmacology 2015, 40:1332-1342. 42. Gillan CM, Daw ND: Taking psychiatry research online. Neuron 2016, 91:19-23. 43. de Wit SJ, Alonso P, Schweren L, Mataix-Cols D, Lochner C, Mencho´n JM, Stein DJ, Fouche JP, Soriano-Mas C, Sato JR et al.: Multicenter voxel-based morphometry mega-analysis of structural brain scans in obsessive-compulsive disorder. Am J Psychiatry 2014. 44. Schumann G, Loth E, Banaschewski T, Barbot A, Barker G, Bu¨chel C, Conrod PJ, Dalley JW, Flor H, Gallinat J et al.: The IMAGEN study: reinforcement-related behaviour in normal brain function and psychopathology. Mol Psychiatry 2010, 15:1128-1139. 45. ABCD Study Protocol. Edited by. https://abcdstudy.org/images/ Protocol_Brochure_Assessment.pdf. 46. Logan G: On the ability to inhibit thought and action: a user’s guide to the stop signal paradigm. In Inhibitory Processes in Attention, Memory and Language. Edited by Dagenbach D, Carr TH. Academic; 1994. 47. Knutson B, Adams CM, Fong GW, Hommer D: Anticipation of increasing monetary reward selectively recruits nucleus accumbens. J Neurosci 2001, 21:RC159. 48. Zhao Y, Healy BC, Rotstein D, Guttmann CR, Bakshi R, Weiner HL, Brodley CE, Chitnis T: Exploration of machine learning techniques in predicting multiple sclerosis disease course. PLOS ONE 2017, 12:e0174866. 49. Woo CW, Chang LJ, Lindquist MA, Wager TD: Building better biomarkers: brain models in translational neuroimaging. Nat Neurosci 2017, 20:365-377. 50. APA: Consensus Report of the APA Work Group on Neuroimaging Markers of Psychiatric Disorders. 2012. Edited by Association AP (Series Editor). Arlington, VA.

www.sciencedirect.com