Revealing strengths and weaknesses of methods for ...

Viewer
Transcript

Revealing strengths and weaknesses of methods for gene network inference Daniel Marbacha,b, Robert J. Prillc, Thomas Schafftera, Claudio Mattiussia, Dario Floreanoa, and Gustavo Stolovitzkyc,1 a Laboratory of Intelligent Systems, Ecole Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland; bComputer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139; and cIBM T. J. Watson Research Center, Yorktown Heights, New York, NY 10598

Edited by Charles R Cantor, Sequenom Inc., San Diego, CA, and approved February 4, 2010 (received for review November 18, 2009)

Numerous methods have been developed for inferring gene regulatory networks from expression data, however, both their absolute and comparative performance remain poorly understood. In this paper, we introduce a framework for critical performance assessment of methods for gene network inference. We present an in silico benchmark suite that we provided as a blinded, communitywide challenge within the context of the DREAM (Dialogue on Reverse Engineering Assessment and Methods) project. We assess the performance of 29 gene-network-inference methods, which have been applied independently by participating teams. Performance profiling reveals that current inference methods are affected, to various degrees, by different types of systematic prediction errors. In particular, all but the best-performing method failed to accurately infer multiple regulatory inputs (combinatorial regulation) of genes. The results of this community-wide experiment show that reliable network inference from gene expression data remains an unsolved problem, and they indicate potential ways of network reconstruction improvements. DREAM challenge ∣ community experiment ∣ reverse engineering ∣ transcriptional regulatory networks ∣ performance assessment

S

ome of our best insights into biological processes originate in the elucidation of the interactions between molecular entities within cells. In the past, these molecular connections have been established at a rather slow pace. For example, it took more than a decade from the discovery of the well known tumor suppressor gene p53 to determine that it formed a regulatory feedback loop with the protein MDM2, its key regulator (1). Indeed, the mapping of biological interactions in the intracellular realm remains the bottleneck in the pipeline to produce biological knowledge from high-throughput data. One of the promises of computational systems biology are algorithms that feed in data and output interaction networks consistent with those input data. To accomplish this task, the importance of having accurate methods for network inference cannot be overestimated. Spurred by advances in experimental technology, a plethora of network-inference methods (also called reverse engineering methods) has been developed (2–10), at a rate that has been doubling every two years (11). However, the problem of rigorously assessing the performance of these methods has received little attention until recently (11, 12). Even though several interesting and telling efforts to compare between different network-inference methods have been reported (13, 14, 15), these efforts typically compare a small number of algorithms that include methods developed by the same authors that do the comparisons. Consequently, there remains a void in understanding the comparative advantages of inference methods in the context of blind and impartial performance tests. To foster a concerted effort to address this issue, some of us have initiated the DREAM (Dialogue on Reverse Engineering Assessment and Methods) project (11, 16). One of the key aims of DREAM is the development of community-wide challenges for objective assessment of reverse engineering methods for biological networks. Similar efforts have been highly successful in the

6286–6291 ∣ PNAS ∣ April 6, 2010 ∣ vol. 107 ∣ no. 14

field of protein structure prediction (17). However, the design of such benchmarks for biological network inference is problematic. On the one hand, well-known networks cannot be used because their identity is not easily hidden from the participants to create “blinded” challenges. On the other hand, there is not yet a goldstandard experiment for establishing the ground truth (the true network structure) for unknown in vivo networks. Consequently, in silico benchmarks (i.e., simulated networks and data) remain the predominant approach for performance assessment of reverse engineering methods: in simulation, the ground truth is known and predictions can be systematically evaluated (18, 19). In this paper, we describe the results of a gene-network reverse engineering challenge, the so-called DREAM3 in silico challenge, which was one of the four DREAM3 challenges that we organized within the context of the DREAM project. The challenge is based on a series of in silico networks (Fig. 1), which we created using a unique approach for the generation of biologically plausible network structures and dynamics. The DREAM3 in silico challenge, with 29 participating teams from over ten countries, has become by far the most widely used benchmark for genenetwork reverse engineering. The participants have submitted almost 400 network predictions, which we have evaluated in a double-blind manner (Fig. 1). In what follows we dissect the predictions and analyze the performance of the 29 methods that inferred networks for the challenge. We developed unique methodologies to extract lessons from the ensemble of submissions based on the efficacy of the different predictions to learn local connectivity patterns (network motifs (20, 21)), and combinatorial regulation (in-degree distribution (22)). Our analyses clearly show that some network motifs (fan-in, fan-out, and cascade motifs) were poorly predicted even by high-rank performing submissions, indicating systematic errors in inference and potential ways of network reconstruction improvements. The set of submitted networks form a veritable dataset, contributed by the systems biology community and obtained by field experimentation (the DREAM challenges). The analysis of the results of this community-based experiment reveals the strengths and weaknesses of the state-of-the-art efforts on network inference. Results The DREAM3 in-Silico Challenge. To assess the performance of

gene network-inference methods in silico, it is essential that the benchmarks are biologically plausible. This involves generating realistic structures for the benchmark networks, generating the corresponding kinetic models, and using these models to Author contributions: D.M., R.J.P., and G.S. designed research; D.M., R.J.P., T.S., and G.S. performed research; D.M., R.J.P., C.M., D.F., and G.S. analyzed data; and D.M., R.J.P., T.S., C.M., D.F., and G.S. wrote the paper. The authors declare no conflict of interest. This article is a PNAS Direct Submission. 1

To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/cgi/content/full/ 0913357107/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.0913357107

C Network inference

Steady state and time series

method

Pa the rticip tar ants ge t n are Or etw bli ork nd t the ganiz o s inf ers ere ar nc e b e m lin eth d to od s

Simulation

E

Double-blind performance assessment

A

D Predicted networks

In silico gene networks

Fig. 1. Double-blind performance assessment of network-inference methods. (A, B) From a set of in silico benchmark networks (the so-called gold standards), steady-state and time-series gene expression data was generated and provided as a community-wide reverse engineering challenge. (C, D) Participating teams were asked to predict the structure of the benchmark networks from this data. They were blind to the true structure of these networks. (E) We evaluated the submitted predictions, being blind to the inference methods that produced them. This allowed for a double-blind performance assessment.

produce synthetic gene expression data by simulating different biological experiments. The challenge was structured as three separate subchallenges with networks of 10, 50, and 100 genes, respectively. For each size, we generated five in silico networks. We produced realistic network structures by extracting modules from known biological interaction networks (19). For each size, we extracted two structures from an Escherichia coli transcriptional regulatory network (21), and three structures from a yeast genetic interaction network (23). Examples of networks are shown in Fig. 2A and Fig. S1. We endowed these networks with dynamics using the models of transcription and translation described in Methods.

B Network prediction

G2

100

G1

G4 G3

G7 G6

G5

G8

Rank (%)

75

G9

50

G9 G2 G2 G10 G2 G8 G2 G2 G9 G3

C Null model 100

G5 G3 G1 G7 G6 G7 G7 G4 G4 G4

75

Rank (%)

A Target network

25

G10

50

25

Precision 0

0.5

1

0

Yip et al.

P-value = 4.53 10-6

0

Random predictions

Fig. 2. Evaluation of network predictions. (A) The true connectivity of one of the benchmark networks of size 10. (B) Example of a submitted prediction (it is the prediction of Yip et al., the best-performer team). The format is a ranked list of predicted edges, represented here by the vertical colored bar. The white stripes indicate the true edges of the target network. A perfect prediction would have all white stripes at the top of the list. The inset shows the first ten predicted edges: the top four are correct, followed by an incorrect prediction, etc. The color indicates the precision at that point in the list. E.g., after the first ten predictions, the precision is 0.7 (7 correct predictions out of 10 predictions). (C) The network prediction is evaluated by computing a P-value that indicates its statistical significance compared to random network predictions.

Marbach et al.

Transcriptional regulation of genes was modeled using a standard approach based on thermodynamics (24). Both independent (“additive”) and synergistic (“multiplicative”) interactions occur in the networks. We used these in silico gene networks to produce different types of steady-state and time-series gene expression data that are commonly used for gene network inference (8, 12): steadystate expression levels of the unperturbed network, steady-state levels of knockout and knockdown experiments for every gene, and time-series data showing how the network recovers from multifactorial perturbations (see Methods). The gene expression data were provided to the participants in the form of mRNA concentrations with moderate additive Gaussian noise. The protein concentrations were not provided, as would be the case with experiments based purely on transcriptional data. Participants were asked to predict the underlying networks from the given gene expression datasets. Our Java tool used to generate the benchmarks is available open-source (25). Performance Metrics. We evaluated the ability of inference meth-

ods to predict the presence of regulatory interactions between genes (some methods predict additional aspects, such as the kinetics parameters of the interactions, which were not considered here). Participants were asked to submit network predictions in the form of ranked lists of predicted edges (16). The lists had to be ordered according to the confidence of the predictions, so that the first entry corresponds to the edge predicted with the highest confidence. In other words, the edges at the top of the list were believed to be present in the network, and the edges at the bottom of the list were believed to be absent from the network. The number of possible edges in an N-gene network without autoregulatory interactions is NðN − 1Þ. Autoregulatory edges were not expected in the predictions. Therefore, for networks of size 10, 50, and 100, the length of a complete list of predictions is 90, 2,450, and 9,900 edges. An example is shown in Fig. 2. As mentioned above, each subchallenge had five networks. To participate, teams were required to submit a prediction for each of the five networks. We statistically evaluated predictions by computing P-values indicating the probability that random lists of edge predictions would be of the same or better quality (see Fig. 2 and Methods). The final score that we used for the ranking was a negative log-transformed P-value: for example, a P-value of 10−2 gives a score of 2, and a P-value of 10−3 gives a score of 3. Thus, larger scores indicate smaller P-values, hence better predictions. Performance Assessment of Network-Inference Methods. In total, 29 teams participated in the challenges. The majority of teams submitted predictions for all three network sizes (10, 50, and 100 genes): the corresponding subchallenges had 29, 27, and 22 participants, respectively, totaling in 390 submitted network predictions (there are five networks of each size). The scores of the top ten teams for each subchallenge are shown in Fig. 3. The complete set of results is available on the DREAM website (28). Note that participants are anonymous, except for the best performers and teams who voluntarily disclose their identity. The ranking is similar in the three subchallenges, i.e., the performance of most methods was consistent over different network sizes (Fig. S2). The method of Yip et al. (29) obtained the best performance on all three network sizes. A representative prediction of the best-performer method is shown in the example of Fig. 2: most links were correctly recovered, but there were also some incorrect predictions (false positives) among the highconfidence edges at the top of their prediction lists (the origin of these errors will become apparent in the network-motif analysis below). As can be seen in Fig. 3, several other methods achieved highly significant predictions. For example, on networks PNAS ∣

April 6, 2010 ∣

vol. 107 ∣

no. 14 ∣

6287

BIOPHYSICS AND COMPUTATIONAL BIOLOGY

B Synthetic gene expression data

B

4

Score

Score

6

2

40

80

30

60

20 10

0

20

75

50

0 100

100

50

90

80

0

0 A B C D E F G H I J

Legend Yip et al. A B-O Other participants (anonymous) P Perfect prediction (hypothetical) R Random prediction (hypothetical)

95

50

25

25

25

100

75

Rank (%)

75

Rank (%)

100

P R

40

0

100

Rank (%)

C Network size 100

Network size 50

Score

A Network size 10

P

Precision 90

0 P R

Method

A B G E C F D K L J

Method

P

0 P R

0.25

0.5

0.75

1

A E C F M L N J K O

Method

Fig. 3. Average performance of the best ten teams for each of the three subchallenges. The bar plots on top show the overall scores, and the color bars below show the precision of the corresponding lists of predictions, as explained in Fig. 2 (since each subchallenge has five networks, this is the average precision of the five lists). In addition to the submitted network predictions (methods A–O), we always show the plots for a hypothetical perfect prediction P (all true edges at the top of the list) and a randomly generated prediction R, which allows to visually appreciate the quality of the submitted predictions. Remember that for networks of size 10, 50, and 100, the length of the lists is 90, 2,450, and 9,900 edges. Note that for networks of size 50 and size 100, we have zoomed in to the top 20% and 10% of the lists, respectively.

of size 100, the top four teams all had scores above 30 (i.e., P-values smaller than 10−30 ). However, for the majority of inference methods the precision of the predictions was rather low (<0.5, blue tones in Fig. 3). In addition, a surprisingly large number of methods (11 out of the 29) produced network predictions that were, on average, not significantly better than random guessing (P-values >0.01). This is a sobering result for the efficacy of the network-inference community. It should be kept in mind that some participants may not be experienced in network inference, which could explain the low performance of some teams. However, many well-known practitioners in the field were spread over all ranks. According to a survey that we conducted among the participants, the applied inference methods span a wide range of approaches commonly used to reverse engineer gene networks, including correlation-based methods (6), information-theoretic methods (9, 27), Bayesian network predictions (4), and methods based on dynamical models (2, 3, 8). There seems to be no correlation between the general type of inference method used and the scores. Indeed, all four approaches mentioned above are represented among the top five inference methods of the challenge (see SI Text and Table S1). At the same time, all of these approaches were also used by teams that didn’t produce significant predictions, implying that success is more related to the details of implementation than the choice of general methodology. Concerning the type of data used to infer the networks, the top five teams all integrated both steady-state and time-series data, i. e., they took advantage of all provided data. In retrospect, the steady-state levels of the gene knockout experiments seemed to have been the most informative: the score of the best-performer team was mainly due to predictions derived from the knockout datasets and not those from the time-series (see SI Text). Network-Motif Analysis Reveals Three Types of Systematic Prediction Errors. In order to understand the differences in performance of

inference methods, we need to know what types of prediction errors they make. To this end, we have analyzed the inference methods performance on the basic building blocks of networks, the network motifs (20, 21). More precisely, we have analyzed how well the inference methods predict edges pertaining to dif6288 ∣

www.pnas.org/cgi/doi/10.1073/pnas.0913357107

ferent network motifs. The first column of Fig. 4 shows the four types of motifs that occur in the benchmark networks of the challenge (fan-in, fan-out, cascade, and feed-forward loop). As an illustrative example, the second column shows how well their links were predicted, on average, by the method that ranked second on the networks of size 100 (8). It can be seen that not all links of the motifs were predicted with the same median prediction confidence —some were predicted less reliably (at lower confidence) than others (the prediction confidence of edges was defined as their rank in the list of edge predictions, scaled such that the first edge in the list has confidence 100%, and the last edge in the list has confidence 0%).

Fig. 4. Systematic errors in the prediction of motifs. (A) The true connectivity of the motifs. (B) As an example, we show how the motifs were predicted on average by the inference method that ranked second on the networks of size 100 (8). The darkness of the links indicates their median prediction confidence. (C) We can identify three types of systematic prediction errors: the fan-out error, the fan-in error, and the cascade error.

Marbach et al.

Most Inference Methods Fail to Accurately Predict Combinatorial Regulation. The network-motif analysis has shown that all inference

methods had a reduced prediction confidence for multiple inputs (combinatorial regulation) of genes (fan-in error), here, we analyze this type of error in more detail. Specifically, we compare how well, on average, inference methods predict the regulatory input(s) of genes with a single input (in-degree 1), two inputs (in-degree 2), three inputs (in-degree 3), etc. The results of this analysis are shown in Fig. 5 for the best five inference methods on networks of size 100 (as mentioned above, the prediction confidence of edges was defined as their rank in the list of edge predictions, scaled such that the first edge in the list has confidence 100%, and the last edge in the list has confidence 0%). These data show that several methods predict single inputs of genes with high confidence. However, for all but the best-performer method, the prediction confidence degrades drastically as the number of Marbach et al.

100

Rank (%)

75

Methods Best performer Second place Third place Fourth place Fifth place

50

1

3

5

7

9

Indegree of target

1

2

3

Fig. 5. How the indegree of genes affects the prediction confidence. The plots show, for the best five methods on networks of size 100, the median prediction confidence for links that target genes of increasing indegree. The shaded areas indicate 95% confidence intervals for the medians. Singleinput links were reliably predicted with a similar, high prediction confidence by the best four methods (points in the top left corner). However, for all but the best-performer method, the performance drops drastically for higher indegrees.

inputs increases. For example, the fourth place method reliably identified links that are the only input of their targets (median prediction confidence 97%), but did no better than random guessing in predicting inputs of genes with in-degree nine (median prediction confidence 46%). We have performed this analysis for all methods that inferred networks of size 50 and 100, which confirmed that only the best-performer method had a robust performance on high indegrees (Fig. S4). It is not unexpected that edges that are the sole input of their target gene are easier to infer than edges towards genes with many inputs. If a gene has only one regulator, and this regulator is being perturbed, the gene would show a clear response. In contrast, if a regulator of a gene with other regulatory inputs is being perturbed, the effect may be partially buffered or even completely masked by the other inputs, which would make this edge more difficult to infer. Thus, what was surprising to us was not that all methods are affected by the fan-in error, but that they are so to very different degrees. Community Predictions Are More Reliable than Individual Inference Methods. In the previous sections, we have shown that inference

methods have different strengths and weaknesses. A natural corollary of this observation is that the combination of network reconstruction methods could be a good strategy for network inference. We have formed “community predictions” by combining the prediction lists of multiple inference methods. We combined the edge-prediction lists simply by reranking the edge list according to the average rank for each edge (Fig. S5). To gain a sense of the performance of community predictions in the DREAM3 in silico challenge, we systematically formed communities composed of the top two methods, the top three methods, the top four methods, etc., until the last community, which contains all applied methods of a particular subchallenge. In Fig. 6, we compare the scores of the community predictions with those of the individual teams for the networks of size 10. Some of the community predictions outperform the bestperformer team (e.g., the community of the top five teams). More importantly, the performance of the community is robust to inclusion of methods with very low scores. Even when combining all methods, the community prediction still ranks second. Similar observations can be made on networks of size 50 and 100 (Fig. S6). Note that in a real application to an unknown biological network, it’s impossible to know in advance which inference method would PNAS ∣

April 6, 2010 ∣

vol. 107 ∣

no. 14 ∣

6289

BIOPHYSICS AND COMPUTATIONAL BIOLOGY

In order to evaluate whether some edges of motifs were systematically predicted less reliably than others, we compared their median prediction confidence to the background prediction confidence, which is the median prediction confidence of all links of the network (independently of which motifs they belong to, see SI Methods). If the motifs had no effect on the prediction confidence, the edges pertaining to different motifs would all be inferred, on average, with the background prediction confidence. This was not the case: The median prediction confidence of some motif edges diverged significantly from the background prediction confidence (evaluated at a level of 0.01 using a two-sided Wilcoxon-Mann-Whitney rank-sum test and Bonferroni correction for multiple hypothesis testing). We found three different types of significant, systematic errors in the prediction of motifs (cf. Fig. 4C): (i) The fan-out error corresponds to a tendency to incorrectly predict edges between coregulated nodes (2 → 3 and 3 → 2). The expression levels of coregulated genes are often correlated. The fan-out error occurs when this correlation is wrongly interpreted as an interaction between the two genes; (ii) The fan-in error is a reduced prediction confidence for multiple inputs. In other words, fan-in links (2 → 1 and 3 → 1) are predicted less reliably than other links of the target network. This error is due to difficulties in accurately modeling and inferring combinatorial regulation of genes (regulation of genes by several inputs); (iii) The cascade error is a tendency for incorrectly predicted “shortcuts” in cascades. This error occurs when indirect regulation (1 → 2 → 3) is misinterpreted as direct regulation (1 → 3); and (iv) The links 1 → 3 and 2 → 3 of feed-forward loops (FFLs) often have a reduced prediction confidence. These links form a fan-in, and their reduced prediction confidence can thus be explained in terms of the fan-in error (hence, we did not consider this as an additional type of systematic prediction error). Note that we analyze the different motif types independently from each other. Possible effects due to overlapping motifs will be considered elsewhere. We performed the network-motif analysis for all inference methods that were applied to the networks of size 50 and 100 (networks of size 10 are too small for a statistically significant analysis). We did not observe other types of systematic errors than the three discussed above. However, we found that inference methods are affected to various degrees by these errors—they have different error profiles. Whereas some inference methods are more robust to certain types of error, they are more strongly affected by other types of errors, i.e., they have different strengths and weaknesses (Fig. S3). For example, the best-performer method was the most robust to the fan-in error, but it was more strongly affected by the cascade error than other inference methods.

6 5

Score

4 3 2 Individual predictions: 1, 2, 3, ... Community predictions {1,2}, {1,2,3}, {1,2,3,4}, ...

1 0

1

5

10

15

20

25

29

Team Fig. 6. Performance of community predictions for the networks of size 10. The circles are the scores of the individual teams. The diamonds correspond to the scores of the different community predictions, obtained by combining the two best teams, the three best teams, the four best teams, etc.

perform best. Our results show that, instead of choosing a single method to trust, a more reliable strategy is to apply all methods at hand and form a community prediction. Discussion A Word of Caution. The in silico benchmarks presented here are based on networks with similar types of structural properties and regulatory dynamics as occur in biological gene networks. In particular, the network structures correspond to modules of known gene networks, and the kinetic model is based on a thermodynamic approach (24), which has been shown to provide a good approximation to different types of transcriptional regulation (30). However, this representation is a simplified model of real biological mechanisms. Furthermore, additional layers of control, such as posttranscriptional regulation and chromatin states, are not modeled. Even though these in silico benchmarks by no means replace the need for careful characterization of performance in vivo (12), they remain an important tool to systematically and efficiently validate the performance of inference methods over many networks. Furthermore, it is likely that methods that don’t fare well in these benchmarks, might fare even worse with real biological networks, as discussed in SI Text. A general issue of benchmarks, be it in silico or in vivo, is that the measured performance of methods is always specific to the particular networks that were being used, and does not necessarily generalize to other, unknown networks, which may have different properties. Indeed, one of the main conclusions of this study is that the performance of current network-inference methods is strongly dependent on the properties of the network that is being inferred. For example, since methods were found to have very different network-motif error profiles, their performance depends on how many instances of each motif type are present in the network. Thus, the overall performance (score) of the inference methods should be considered with caution, as it may vary on networks with different properties. However, the systematic errors identified with the network-motif analysis are expected to be less variant on different networks. For example, a method that failed to distinguish direct from indirect regulation (cascade error), would be expected to have similar difficulties also on biological gene networks. Reliable Gene Network Inference from Gene Expression Data Remains an Unsolved Problem. The two major difficulties in gene-network

reverse engineering are often considered to be the limited data, which may leave the inference problem underdetermined (10), and the difficulty of distinguishing direct from indirect regulation (the cascade error) (26). A number of approaches have been developed to overcome these difficulties, for example, partial correlation (26) and other methods (6, 27) have been proposed to encounter potential cascade errors. However, the results of the community experiment reported here call attention to addi6290 ∣

www.pnas.org/cgi/doi/10.1073/pnas.0913357107

tional difficulties that have to be overcome for reliable inference of gene networks. We provided more and better data than is typically available in real biological experiments, yet the overall performance of the 29 applied network-inference methods was not satisfactory. The best-performer method (29) worked remarkably well, given that it is based on possibly the simplest model of all applied inference methods (see SI Text). However, it could not distinguish direct from indirect regulation. Despite this increased error rate in the cascade motif, it had significantly better overall performance than other state-of-the-art methods based on more advanced, probabilistic, or dynamical models. Even though some were more robust to the cascade error, they were strongly affected by other systematic errors, in particular the fan-in error. We expect these errors to be even more pronounced in real biological applications, suggesting that the performance of gene-network-inference methods may previously have been overestimated due to the lack of rigorous, blinded benchmarks. Conclusion. We have presented a framework for critical performance assessment of gene-network-inference methods. This framework has allowed us to compare a large number of inference methods—applied independently by different teams—on multiple benchmark networks. In addition to assessment of the overall accuracy, we have evaluated the performance of inference methods on individual network motifs. This analysis revealed that current inference methods are affected, to various degrees, by three types of systematic prediction errors: the fan-out error (incorrect prediction of interactions between coregulated genes), the fan-in error (inaccurate prediction of combinatorial regulation), and the cascade error (failure to distinguish direct from indirect regulation). Distinguishing between direct and indirect regulation is a well-known difficulty in network inference (6, 26, 27), but was never quantitatively assessed. The network-motif analysis makes it possible to quantify how well this difficulty is resolved by different methods. Furthermore, it revealed two other types of systematic errors, the fan-in error and the fan-out error, which are equally important for the overall quality of predictions. One of the difficulties that participants of the DREAM challenge had to face was that they did not know details of the kinetic model that was used to generate the gene expression data. This difficulty is even more pronounced in biological applications, where the mechanisms and kinetics of gene regulation underlying the expression data are more complicated, and also not known in advance. Consequently, inference methods are bound to make simplifying assumptions. However, inaccurate assumptions were shown to induce systematic prediction errors, which profoundly affected the performance of the applied network-inference methods (see also SI Text). There is thus a need for the development of inference methods that have a more robust performance despite uncertainty about the type of mechanisms (the “model”) underlying the data. We have shown that one possible approach to achieve this goal is to combine the predictions from complementary inference methods to form more robust and accurate “community predictions”. We are convinced that a better understanding of the capabilities and limitations of existing inference methods will enable the development of unique approaches that will ultimately make the DREAM of accurate, high-throughput inference of gene networks come true.

Materials and Methods Simulation of Expression Data. Gene networks were modeled by a system of ordinary differential equations describing the dynamics of the mRNA concentration x i and the protein concentration y i of every gene

dxi ¼ mi · f i ðyÞ − λRNA · xi i dt

[1]

Marbach et al.

[2]

where mi is the maximum transcription rate, r i the translation rate, λRNA and i λProt are the mRNA and protein degradation rates, and f i ð·Þ is the so-called i input function of gene i. The input function computes the relative activation of the gene, which is between 0 (the gene is shut off) and 1 (the gene is maximally activated), given the transcription-factor (TF) concentrations y. The input function is derived using a standard thermodynamic approach (24), where binding of TFs to cis-regulatory sites is approximated using Hill-type kinetics (see SI Methods). Knockouts were simulated by setting the maximum transcription rate mi of the deleted gene to zero, knockdowns by dividing it by 2. Time-series experiments were simulated by integrating the networks using different initial conditions. For the networks of size 10, 50, and 100, we provided 4, 23, and 46 different time-series, respectively, with 21 time points each. Gaussian noise was added to the data after the simulation. See SI Methods for details.

the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPR) (16). In addition to the AUPR and AUROC values, we statistically evaluated predictions by computing corresponding P-values (pAUROC and pAUPR ), which are the probability that a random list of edge predictions would obtain the same or better AUROC and AUPR than a given network prediction. Distributions for AUROC and AUPR were estimated from 100,000 instances of random lists of edge predictions. The overall P-value of the five networks of a subchallenge was defined as the geometric mean of the individual P-values: ðp1 · p2 …p5 Þ1∕5 . The final score of a method is the log-transformed geometric mean of the overall AUROC P-value (p¯ AUROC ) and the overall AUPR P-value (p¯ AUPR ): score ¼ −0.5 · log 10ðp¯ AUROC · p¯ AUPR Þ.

Evaluation of Predictions. As described in Results, the submission format of predictions was a list of predicted edges with their assigned confidence measures, constructed in decreasing order of confidence from the most reliable to the least reliable prediction. The quality of the predictions was measured by

ACKNOWLEDGMENTS. We thank all participants of the challenge for their contribution. Three anonymous reviewers provided valuable feedback on the original manuscript. G.S. and R.P. acknowledge support of the NIH Roadmap Initiative, the Columbia University Center for Multiscale Analysis Genomic and Cellular Networks (MAGNet), and the IBM Computational Biology Center. The work was supported through the Swiss National Science Foundation Grant no. 200021–112060 (to D.F., T.S., C.M., and D.M), the SystemsX.ch initiative (WingX project), and a postdoctoral fellowship for D.M.

1. Levine AJ, Oren M (2009) The first 30 years of p53: growing ever more complex. Nat Rev Cancer 9:749–758. 2. De la Fuente A, Brazhnik P, Mendes P (2002) Linking the genes: Inferring quantitative gene networks from microarray data. Trends Genet 18:395–98. 3. Gardner TS, di Bernardo D, Lorenz D, Collins JJ (2003) Inferring genetic networks and identifying compound mode of action via expression profiling. Science 301:102–105. 4. Friedman N (2004) Inferring cellular networks using probabilistic graphical models. Science 303:799–805. 5. Di Bernardo D, et al. (2005) Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks. Nat Biotechnol 23:377–83. 6. Rice JJ, Tu Y, Stolovitzky G (2005) Reconstructing biological networks using conditional correlation analysis. Bioinformatics 21:765–73. 7. De la Fuente A, Makhecha DP (2006) Unravelling gene networks from noisy underdetermined experimental perturbation data. Systematic Biol (Stevenage) 153:257–262. 8. Bonneau R, et al. (2006) The Inferelator: An algorithm for learning parsimonious regulatory networks from systems-biology datasets de novo. Genome Biol 7:R36 Available at http://www.genomebiology.com/2006/7/5/R36. 9. Faith JJ, et al. (2007) Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol 5:e8 Available at http://www.plosbiology.org/article/info%3Adoi%2F10.1371% 2Fjournal.pbio.0050008. 10. Marbach D, Mattiussi C, Floreano D (2009) Replaying the evolutionary tape: Biomimetic reverse engineering of gene networks. Annals of the New York Academy of Sciences 1158:234–245. 11. Stolovitzky G, Monroe D, Califano A (2007) Dialogue on reverse-engineering assessment and methods: The dream of high-throughput pathway inference. Annals of the New York Academy of Sciences 1115:1–22. 12. Cantone I, et al. (2009) A yeast synthetic network for in vivo assessment of reverseengineering and modeling approaches. Cell 137:172–181. 13. Michoel T, de Smet R, Joshi A, van de Peer Y, Marchal K (2009) Comparative analysis of module-based versus direct methods for reverse-engineering transcriptional regulatory networks. BMC Syst Biol 3:49 Available at http://www.biomedcentral.com/ 1752-0509/3/49. 14. Bansal M, Belcastro V, Ambesi-Impiombato A, di Bernardo D (2007) How to infer gene networks from expression profiles. Mol Syst Biol 3:78 Available at http://www.nature. com/msb/journal/v3/n1/full/msb4100120.html.

15. Margolin AA, et al. (2006) ARACNE: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 7(Suppl 1):S7 Available at http://www.biomedcentral.com/1471-2105/7/S1/S7. 16. Stolovitzky G, Prill RJ, Califano A (2009) Lessons from the DREAM2 challenges. Annals of the New York Academy of Sciences 1158:159–195. 17. Moult J, et al. (2007) Critical assessment of methods of protein structure predictionround vii. Proteins 69(Suppl 8):3–9. 18. Mendes P, Sha W, Ye K (2003) Artificial gene networks for objective comparison of analysis algorithms. Bioinformatics 19(Suppl 2):ii122–129. 19. Marbach D, Schaffter T, Mattiussi C, Floreano D (2009) Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J Comput Biol 16(2):229–239. 20. Milo R, et al. (2002) Network motifs: simple building blocks of complex networks. Science 298:824–827. 21. Shen-Orr SS, Milo R, Mangan S, Alon U (2002) Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31:64–68. 22. Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286:509–512. 23. Reguly T, et al. (2006) Comprehensive curation and analysis of global interaction networks in saccharomyces cerevisiae. J Biol 5:11. 24. Ackers GK, Johnson AD, Shea MA (1982) Quantitative model for gene regulation by lambda phage repressor. Proc Natl Acad Sci USA 79:1129–1133. 25. GeneNetWeaver project website: http://gnw.sourceforge.net. 26. De la Fuente A, Bing N, Hoeschele I, Mendes P (2004) Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics 20:3565–3574. 27. Basso K, et al. (2005) Reverse engineering of regulatory networks in human B cells. Nat Genet 37:382–390. 28. DREAM project website: http://wiki.c2b2.columbia.edu/dream/results. 29. Yip KY, Alexander RP, Yan KK, Gerstein M (2010) Improved reconstruction of in silico gene regulatory networks by integrating knockout and perturbation data. PLoS One 5(1):e8121 Jan. 26. 30. Setty Y, Mayo AE, Surette MG, Alon U (2003) Detailed map of a cis-regulatory input function. Proc Natl Acad Sci USA 100:7702–7707.

Marbach et al.

PNAS ∣

April 6, 2010 ∣

vol. 107 ∣

no. 14 ∣

6291

BIOPHYSICS AND COMPUTATIONAL BIOLOGY

dyi ¼ r i · xi − λProt · yi ; i dt

Supporting Information Marbach et al. 10.1073/pnas.0913357107 SI Text SI Results and Discussion. Best performance methods did consistently well across sizes. We provided subchallenges with networks of size

10, 50, and 100 in order to compare the performance of inference methods on different network sizes. From the 29 methods that were applied to the networks of size 10, all but two were also applied to the networks of size 50. The two teams that participated only in the subchallenge with networks of size 10 did not benefit from specializing on small networks. They ranked 22nd and 27th, and their predictions were not significantly better than random predictions (overall p-values of 0.24 and 0.44, respectively). After removing these two methods from the ranking, we compared the relative performance of methods on networks of size 10 and size 50 (Fig. S2A). The best and the second-best methods were the same in both subchallenges. The methods from ranks 3 to 7 are also the same for both network sizes, though not in the same order. Thus, the methods that performed best on the small networks of size 10 also performed best on the intermediate networks of size 50. Similar observations can be made when comparing the ranking of the 22 methods that were applied both to networks of size 50 and 100 (Fig. S2B). In this case, the correlation between the two rankings is even stronger. A notable exception is the method that ranked second on networks of size 10 and size 50, which ranked only 15th on networks of size 100. Even though the predictions of this method were of excellent quality for networks of size 50 (overall p-value <10−31 ), they were barely significant for networks of size 100 (overall p-value ¼ 0.01). In summary, best performance methods did consistently well across sizes, with few exceptions. The top five methods are all based on a different approach. As mentioned in the main text, the most common types of approaches for network inference were all represented among the successful teams of the challenge. The identities of the top-five teams are given in Table S1. B Team relied strongly on a Gaussian model of the noise (2), which will be discussed more in detail in the following sections. The teams Bonneau and Zuma inferred different types of dynamical network models. Team Bonneau applied and extended a previously described algorithm, the Inferelator (3, 4), which uses regression and variable selection (5). Team Zuma introduced a nonparametric dynamical model that was inferred using a Bayesian approach (6). The teams Usmtec347 and Intigern applied yet unpublished inference methods. The method of team Usmtec347 was based on dynamic Bayesian networks. Team Intigern integrated predictions obtained using different correlation-based and information-theoretic measures. Systematic prediction errors due to network motifs. We performed the network-motif analysis for all methods that inferred networks of size 50 and size 100 (networks of size 10 are too small for a statistically significant analysis). For the best five methods of the networks of size 100, the results are graphically represented and discussed in Fig. S3. A detailed description of the methodology is given in SI Methods. As discussed in the main text, we found that inference methods are affected by three types of systematic prediction errors. Fig. S3 shows that the inference methods have different strengths and weaknesses: whereas some are more robust to certain types of error, they are more strongly affected by other types of errors. In Fig. 4 of the main text, we discussed the three types of systematic prediction errors: the fan-out error (incorrect prediction Marbach et al. www.pnas.org/cgi/doi/10.1073/pnas.0913357107

of interactions between coregulated genes), the fan-in error (inaccurate prediction of combinatorial regulation), and the cascade error (failure to distinguish direct from indirect regulation). In the more detailed analysis of Fig. S3C, two additional, minor effects can be observed. First, in addition to the incorrectly predicted “shortcuts” (1 → 3) in cascades, some inference methods have a slightly reduced prediction confidence for the true edges of cascades (1 → 2 and 2 → 3). This may be due to: (1) incorrect prediction of the shortcut 1 → 3 instead of (and not in addition to) the true links 1 → 2 → 3 and/or (2) a tendency for cascades to overlap with fan-ins.* Second, the three links of feed-forward loops (FFLs) often have a reduced prediction confidence. This can be explained by the same types of systematic errors that occur in fan-ins and cascades. The links 1 → 3 and 2 → 3 of FFLs form a fanin, and are thus affected by the fan-in error. The links 1 → 2 → 3 form a cascade, thus, they have a reduced prediction confidence for the same reason as the corresponding links in the cascade motif. One minor effect remains to be explained: in FFLs, the prediction confidence was often slightly lower for edge 1 → 3 than for edge 2 → 3. For example, this is the case for all except the best-performer method in Fig. S3C. It seems that in addition to the fan-in error, which affects both these edges, the edge 1 → 3 is also affected by an additional systematic error, which could be called the FFL-error. This error occurs when a method interprets the variation of gene 3 to be due solely to the indirect regulation via gene 2ð1 → 2 → 3Þ, instead of both the indirect and the direct regulation ð1 → 2 → 3 and 1 → 3). However, since this effect was negligible compared to the fan-in error that affects these edges, we did not consider it as a fourth type of main prediction error. We conclude that the fan-out, fan-in, and cascade errors are the three main types of systematic prediction errors observed on three-node motifs, and that inference methods are differentially affected by these errors. The network-motif analysis, demonstrated here using three-node motifs, could potentially be extended to colored motifs, higher-order motifs, or to the network dynamics by considering activity motifs (7), for instance. Systematic prediction errors due to the indegree and outdegree of genes. Methods that are affected by the fan-in error have a re-

duced prediction confidence for edges that target genes with two or more inputs. Fig. 5 of the main text shows how the indegree of genes affects the prediction confidence for the best five methods on the networks of size 100. We have performed this analysis for all inference methods that were applied to networks of size 50 and 100, which confirms that the best-performer method had the most robust performance on high indegrees (Fig. S4). As control, we did the same analysis also for the outdegree of genes. As expected, in contrast to the indegree, the increasing outdegree of genes does not affect the prediction confidence of links (Fig. S4). Inaccurate prior assumptions induce systematic prediction errors.

One of the difficulties that participants of the DREAM challenge had to face was that they did not know details of the kinetic model that was used to generate the gene expression data. This difficulty is even more pronounced in biological applications, where the *Note that motifs do not occur in isolation in networks and edges usually pertain to several “overlapping” motif instances. We intend to investigate potential effects due to overlapping motifs in future work.

1 of 10

mechanisms and kinetics of gene regulation underlying the expression data are more complicated, and also not known in advance. Thus, inference methods are bound to make simplifying prior assumptions, e.g., by adopting a linear gene network model, to name just one example. However, if the prior assumptions are inaccurate, they may lead to systematic prediction errors. For example, the bestperforming team strongly relied on an inference method based on a very simple principle: it predicts an interaction from gene A to gene B if, after a knockout of A, there is a significant change in the expression level of B. The significance of a change in the expression level was estimated using a Gaussian model of the noise in the gene expression data (2). This method is based on the inaccurate prior assumption that the change in the expression level of gene B is always due to a direct regulatory interaction A → B (in reality, it may also be due to an indirect interaction via other genes, e.g. A → C → B). This inaccurate prior assumption induced systematic cascade errors (see Fig. S3). However, the best-performer method was also the most robust of all applied methods to the fan-in error (Fig. S3), i.e., it had better performance than other methods on genes that are combinatorially controlled by multiple regulatory inputs. Interestingly, the best-performer method makes a strong prior assumption on the type of noise in the gene expression data (Gaussian), but remains uncommitted to the type of regulatory dynamics in the networks. In contrast, according to our survey among the participants, other inference methods tend to make strong assumptions on the regulatory dynamics (e.g., by adopting simple, often linear, phenomenological functions to approximate combinatorial regulation of genes by multiple regulators). These assumptions are partly inaccurate compared to the more detailed, kinetic model of the benchmarks (they may be even less accurate compared to the complicated mechanisms and kinetics of biological gene networks, as discussed in the following sections). Consequently, these methods have a strongly reduced performance on genes with multiple regulatory inputs (the fan-in error), where their prior assumptions are inaccurate. Is simpler better? The relatively low performance of the majority

of methods might suggest that very sophisticated techniques were required to reliably infer the networks. This was not the case. The best performer-team used four different models to infer the networks, including the Gaussian model of the noise mentioned in the previous section and three Ordinary Differential Equation (ODE) models (2). They intended to combine the predictions from the four models, however, finally they mainly relied on the predictions derived from the simple model of the noise and added only few of the predictions from the ODE models at lower ranks. Thus, the excellent quality of their predictions is mainly due to the performance of the simple method based on the model of the noise. We have confirmed this by showing that a very similar, but even simpler method, would have obtained approximately the same performance. We used only the knockout dataset. For every gene i, we computed the mean expression level μi and the standard deviation σ i . Next, we evaluated how much the expression level xij of gene i in the knockout experiment of gene j changed by the Z-score zij ¼ ðxij − μi Þ∕σ i . The absolute value of zij was used as a measure of confidence for the edge from gene j to gene i, i.e., the lists of predictions were ordered according to the absolute values of the Z-scores. This method would have ranked second on the networks of size 10, first on the networks of size 50, and it would have tied for the first place with the best-performer method on the networks of size 100. The Z-score method outlined in the previous paragraph is arguably one of the simplest conceivable network inference approaches. As the best-performer method based on the Marbach et al. www.pnas.org/cgi/doi/10.1073/pnas.0913357107

Gaussian model of the noise, it is based on the inaccurate assumption that any change in the expression level of a gene is due to a direct regulatory interaction, which induces systematic cascade errors (see previous section). Still, this “naive” approach consistently outperformed more sophisticated, state-of-the-art methods in this benchmark. The fact that simple, but effective methods can often outperform more advanced, theoretically motivated approaches has already been observed in several in vivo challenges of the previous edition of DREAM. For example, Baralla et al. used both a simple approach based on Pearson or Spearman correlation, and a more advanced approach based on partial correlation analysis, in challenges 1 and 3 of DREAM2 (9). The goal in DREAM2 challenge 1 was to predict the direct regulatory targets of the human transcription-factor BCL6 from a collection of microarray experiments, and the goal of Challenge 3 was to infer the structure of an in vivo synthetic-biology network in yeast from two time-series datasets (10, 11). As Baralla et al. note: “[the] simpler approaches completely outperformed the more elegant partial correlations approach, presumably because the latter approach relies more strongly on the correctness of underlying assumptions” (9). The results reported here support this conclusion: sophisticated methods that would in theory be expected to perform better than the “naive” approach described above, were more strongly affected by inaccurate prior assumptions in practice, as discussed in the previous section. Note that these observations do not imply that always simpler methods perform better. If their underlying assumptions are correct, advanced methods based on more accurate models would be expected to outperform the simple approaches described above. In other words, the method and its assumptions has to adapt to the type of data at hand. Reliable gene network inference from gene expression data remains an unsolved problem. The performance of most reverse engineer-

ing methods was unsatisfactory (see Discussion of the main text). Although the measured performance is in principle specific to the in silico networks used here, it is unlikely that a method would perform better in an equivalent in vivo application. First, we provided much more and better quality data than is typically available in biological applications (knockouts and knockdowns for every gene, plus dozens of time series). Secondly, the prior assumptions of inference methods are in general more accurate, and would thus be expected to lead to better predictions, for the in silico networks than for biological gene regulatory networks. For example, biological gene networks have additional layers of control, such as posttranscriptional regulation, which are not modeled by prevalent gene network inference methods. We could list many more points in which the prior assumptions of current network inference methods are more accurate, and would thus be expected to lead to better predictions, for the in silico networks than for biological gene regulatory networks. It should be noted that some methods may perform better using other types of experimental data. For example, linear models only provide a good approximation of the nonlinear dynamics for weak perturbations, as Tegner et al. note: “[…] if the perturbation is too large, the gene network will be pushed outside of its linear response regime and this will weaken the algorithms assumption of linearity. As a result, the error in the predicted network will increase” (8). For this reason, we provided in addition to the knockout data, where the expression of genes is completely suppressed, also knockdown data, where the expression of genes is only reduced by half. However, it may be that the knockdowns were still too strong of a perturbation for the linear assumption to hold, and that the performance of the methods based on linear models would be better when using weaker perturbations. 2 of 10

Community predictions are more reliable than individual inference methods. In the main text, we have discussed the performance

of community predictions on the networks of size 10. Here, we first describe in more detail how we combined predictions of participating teams, and then we analyze the performance of the resulting community predictions for all network sizes. Remember that the submission format of the DREAM3 in silico challenge was a ranked list of edge predictions. The question is thus how to optimally combine an ensemble of such lists, submitted by different participating teams, to form a community prediction. We use a straightforward approach to combine the ensemble of edge-prediction lists, which consists of simply taking the average rank for each edge. Thus, the list of edge predictions of the community is ordered according to the average ranks of the edges in the ensemble, as illustrated in Fig. S5. As described in the main text, we systematically formed communities composed of the top-two methods, the top-three methods, the top-four methods, etc., until the last community, which contains all applied methods of a particular subchallenge (there are three subchallenges corresponding to networks of size 10, 50, and 100). Here, we formed in addition community predictions without including the best-performers, i.e., we removed the bestperforming team and then formed communities composed of the top-two methods, top-three methods, etc. For each of these communities, we derived community predictions for the five networks of the subchallenge. The scores of both the community predictions and the individual teams are shown in Fig. S6. Some of the community predictions outperform the best-performing team on networks of size 10 and size 50. For example, the community of the top-five teams would have won these two subchallenges. As more and more teams with a bad performance are added to the community, the accuracy of the community prediction decreases. However, the performance is remarkably robust. Even when combining all methods, the majority of which have low scores (remember that about a third of the methods did not perform better than random guessing), the community prediction still ranks second on networks of size 10 and 100, and third on networks of size 50. The robustness of the community predictions is even more evident in the communities that don’t include the best-performer (the squares in Fig. S6). For example, the community composed of all teams (second-place to last-place) outperforms the secondplace team on networks of size 100. In summary, the community predictions are consistently as good or better than the best methods of the ensemble, and this even when the majority of the methods of the ensemble have a low performance. SI Methods. Gene network model. We model transcriptional regulatory networks consisting of genes, mRNA, and proteins. The state of the network is given by the vector of mRNA concentrations x and protein concentrations y. We model only transcriptional regulation, where regulatory proteins (transcription factors) control the transcription rate (activation) of genes. The gene network is modeled by the system of differential equations

dxi ¼ mi · f i ðyÞ − λRNA · xi i dt

[S1]

dyi ¼ r i · xi − λProt · yi ; i dt

[S2]

where mi is the maximum transcription rate, r i the translation and λProt are the mRNA and protein degradation rates, rate, λRNA i i and f i ð·Þ is the so-called input function of gene i. The input function computes the relative activation of the gene, which is between Marbach et al. www.pnas.org/cgi/doi/10.1073/pnas.0913357107

0 (the gene is shut off) and 1 (the gene is maximally activated), given the transcription-factor (TF) concentrations y. Gene regulation is modeled using a standard approach based on thermodynamics (12, 13). Good introductory texts are references (14, 15). The basic assumption of this approach is that binding of TFs to cis-regulatory sites on the DNA is in quasiequilibrium, since it is orders of magnitudes faster than transcription and translation. In the most simple case, a gene i is regulated by a single TF j. In this case, its promoter has only two states: either the TF is bound (state S1 ) or it is not bound (state S0 ). The probability PfS1 g that the gene i is in state S1 at an instant in time is given by the fractional saturation, which depends on the TF concentration yj n yj ij χj PfS1 g ¼ with χ j ¼ ; [S3] 1 þ χj kij where kij is the dissociation constant and nij the Hill coefficient. At concentration yj ¼ kij the saturation is half-maximal, i.e., the promoter is bound by the TF 50% of the time. Many TFs bind DNA as homodimers or higher homooligomers. This and other mechanisms that affect the effective cooperativity of promoter binding are approximated by the Hill coefficient nij , which determines the “steepness” of the sigmoid described by Eq. 3. The bound TF activates or represses the expression of the gene. In state S0 the relative activation is α0 and in state S1 it is α1 . Given PfS1 g and its complement PfS0 g, it is straightforward to derive the input function f i ðyj Þ, which computes the mean activation of the gene as a function of the TF concentration yj f ðyj Þ ¼ α0 PfS0 g þ α1 PfS1 g ¼

α0 þ α1 χ j : 1 þ χj

[S4]

This approach can be used for an arbitrary number of regulatory inputs. A gene that is controlled by N TFs has 2N states: each of the TFs can be bound or not bound. Thus, the input function for N regulators would be f ðyÞ ¼

2N −1

∑

αm PfSm g:

[S5]

m¼0

Using thermodynamics, it is straightforward to compute the probability PfSm g for every state m. For example, the resulting expression for a gene with two inputs is f ðy1 ; y2 Þ ¼

α0 þ α1 v1 þ α2 v2 þ α3 ρv1 v2 1 þ v1 þ v2 þ ρv1 v2

[S6]

with vj ¼ ðyj ∕kj Þnj , where nj is the Hill coefficient, kj the dissociation constant, ρ the cooperativity factor, and αs are the relative activations when none of the TFs (α0 ), only the first (α1 ), only the second (α2 ), or both TFs are bound (α3 ). We nondimensionalize the gene network models as described in the Supplementary Information of reference (16). As von Dassow et al. note: “This entails replacing every occurrence of dimensioned state variables (concentrations of molecular species and time) with scaled products yielding new state variables free of units. […] The dimensionless model is identical to the dimensional one since it is merely an algebraic transformation.” (16). One of several advantages of nondimensionalization is that it is much more easier to generate biologically plausible instances of the dimensionless model. We will provide a more detailed description of the gene network model and the nondimensionalization elsewhere. Clearly, the model outlined above is still extremely simplied compared to the real biological mechanisms. However, two of the key simplifying assumptions of most reverse engineering 3 of 10

methods based on dynamical models are not made. First, mRNA and proteins are not lumped into a single state variable. Second, the gene regulation function is not phenomenologically approximated by additive or multiplicative terms, but is derived using the thermodynamic approach. One consequence of the thermodynamical model is that it can express all types of regulatory logic (AND, OR, etc). We have initialized the networks of the DREAM3 in silico challenges such that there is approximately equal amounts of independent and synergistic gene regulation. Thus, a priori neither the additive nor the multiplicative phenomenological modeling approach used in gene network reverse engineering should be favored in these benchmarks. Simulation of gene expression data. Gene knockouts were simulated by setting the maximum transcription rate mi of the deleted gene to zero, knockdowns by dividing it by two. Time-series experiments were simulated by integrating the networks using different initial conditions. For the networks of size 10, 50, and 100, we provided 4, 23, and 46 different time series, respectively.† For each time series, we used a different random initial condition for the mRNA and protein concentrations. The initial mRNA concentrations xi ð0Þ were obtained by adding a random number from a Gaussian distribution with mean zero and standard deviation 0.5 to the wild-type steady-state level of every gene. We assume that the multifactorial perturbation that led to the perturbed state xi ð0Þ was applied for sufficient time for the protein levels to stabilize at their new steady state. Thus, the initial protein concentrations were obtained by setting dyi ∕dt ¼ 0 in Eq. 2

yi ð0Þ ¼

ri · xi ð0Þ: λProt i

[S7]

Each time series consists of 21 time points (from t ¼ 0 until t ¼ 200). Trajectories were obtained by integrating the networks from the given initial conditions using a Runge-Kutta 4∕5 solver with variable step size of the Open Source Physics library (17). White noise with a standard deviation of 0.05 was added after the simulation to the generated gene expression data. Concentrations that became negative due to the addition of noise were set to zero. Network-motif analysis. The goal of the network-motif analysis is to evaluate, for a given network inference method, whether some types of edges of motifs are systematically predicted less (or more) reliably than expected. This involves: (1) determining the prediction confidence of edges pertaining to different motifs, (2) determining the expected prediction confidence independently of motifs (the background prediction confidence), and (3) evaluating whether divergences of the prediction confidence of motif edges from the expected prediction confidence are statistically significant (cf. Fig. S3). Definition of prediction confidence. Given a network prediction in the format described in the main text, we define the prediction confidence of edges as their rank in the list of edge predictions. We scale the prediction confidence such that the first edge in the list has confidence 100%, and the last edge in the list has confidence 0%. †

For networks of size 50, we provided the same number of time series (23) as Pedro Mendes did for his networks of size 50 in the DREAM2 in silico challenges, to allow comparison of results between the two challenges. For networks of size 10 and 100, we scaled the number of provided time series with the network size (two times more for size 100, five times less for size 10).

Marbach et al. www.pnas.org/cgi/doi/10.1073/pnas.0913357107

Determining prediction confidence of motif edges. First we need to identify all three-node motif instances in the target network (the same approach could be used for higher-order motifs). We use the efficient algorithm of Wernicke for this purpose (18). Next, for every type of motif edge, we determine the prediction confidence of all its instances. For example, we identify all links of type 1 → 2 of cascades in the target network (the numbering of the nodes of the motifs is defined in Fig. S3), and we record the prediction confidences that were assigned to these links by the given network inference method. More formally, we construct the set 1→2 Ccascade ¼ fck g, where ck is the prediction confidence of the link 1 → 2 in the k’th cascade of the target network. Note that we also record the prediction confidences of “absent edges” of the motifs, for example, C1→3 cascade is the set of prediction confidences of the “shortcuts”. For every motif type m, we determine the set of prediction confidences assigned by the inference method to each of its six pos1→3 2→1 2→3 3→1 3→2 sible edges (C1→2 m , Cm , Cm , Cm , Cm , and Cm ). Note that fan-in as well as fan-out motifs are symmetric: nodes 2 and 3 can’t be distinguished (see Fig. S3). Thus, both fan-ins and fan-outs have only three types of edges: 1 → 2, 2 → 1, and 2 → 3. The edges 1 → 3, 3 → 1 and 3 → 2 are equivalent to 1 → 2, 2 → 1, and 2 → 3, respectively. The prediction confidence of equivalent edges is recorded in the same set. For example, the set C1→2 fanout contains all prediction confidences assigned by the inference method to outgoing edges of fan-outs (1 → 2 or 1 → 3). Determining the background prediction confidence. Before we can analyze the effect of motifs on the edge-prediction confidence, we first need to determine the expected edge-prediction confidence independently of motifs. There are three types of predicted edges:

1. Predicted edges that are true edges of the target network. Their background prediction confidence is given by Ctrue_edge , which is the set of prediction confidences that were assigned by the inference method to the edges that are part of the target network. 2. Predicted edges for which the directionality is incorrect. We call these back edges. Their background prediction confidence is given by Cback_edge , which is the set of prediction confidences assigned to edges B → A that are not part of the target network, but for which the edge in the opposite direction A → B is part of the target network. 3. Predicted edges between two nodes that are not connected by an edge in the target network. We call these absent edges. Their background prediction confidence is given by Cabsent_edge , which is the set of confidences of predicted edges between nodes that are not directly connected in the target network. Note that in Fig. S3B, we have only shown the median prediction confidence of true edges (1 → 2) and back edges (2 → 1)— the median prediction confidence of absent edges was not shown. Evaluating the divergence of the prediction confidence of motif edges from the background prediction confidence. We use the Wilcoxon-

Mann-Whitney rank-sum test to compare the motif edgeprediction confidences with their corresponding background prediction confidence. The prediction confidence of true edges of 1→2 motifs (e.g., Ccascade and C2→3 cascade ) are compared with Ctrue_edge , the prediction confidence of back edges of motifs (e.g., C2→1 cascade 3→2 and Ccascade ) are compared with Cback_edge , and absent edges of 3→1 m o t i f s ( e . g . , C1→3 cascade a n d Ccascade ) a r e c o m p a r e d w i t h Cabsent_edge . We use Bonferroni correction for multiple hypothesis testing. 4 of 10

1. GeneNetWeaver project website: http://gnw.sourceforge.net 2. Yip KY, Alexander RP, Yan KK, Gerstein M (2010) Improved reconstruction of in silico gene regulatory networks by integrating knockout and perturbation data. PLoS One, Jan 26;5(1):e8121. 3. Bonneau R, et al. (2006) The Inferelator: An algorithm for learning parsimonious regulatory networks from systems-biology datasets de novo. Genome Biol, 7:R36. 4. Bonneau R, et al. (2007) A predictive model for transcriptional control of physiology in a free living cell. Cell, 131:354–1365. 5. Madar A, Greenfield A, Vanden-Eijnden E, Bonneau R (2010) Network inference using dynamic context likelihood of relatedness and inferelator 1.0. PLoS One, in press. 6. Äijö T, Lähdesmäki H (2009) Learning gene regulatory networks from gene expression measurements using nonparametric molecular kinetics. Bioinformatics, 22:937–2944. 7. Chechik G, et al. (2008) Activity motifs reveal principles of timing in transcriptional control of the yeast metabolic network. Nat Biotechnol, 26:1251–1259. 8. Tegner J, Yeung MKS, Hasty J, Collins JJ (2003) Reverse engineering gene networks: Integrating genetic perturbations with dynamical modeling. Proc Natl Acad Sci USA, 100:5944–949. 9. Baralla A, Mentzen WI, de la Fuente A (2009) Inferring gene networks: dream or nightmare? Annals of the New York Academy of Sciences, 1158:246–256.

10. Stolovitzky G, Prill RJ, Califano A (2009) Lessons from the DREAM2 challenges. Ann N Y Acad Sci, 1158:159–195. 11. Cantone I, et al. (2009) A yeast synthetic network for in vivo assessment of reverseengineering and modeling approaches. Cell, 137:72–181. 12. Ackers GK, Johnson AD, Shea MA (1982) Quantitative model for gene regulation by lambda phage repressor. Proc Natl Acad Sci U S A, 79:1129–33. 13. Shea MA, Ackers GK (1985) The or control system of bacteriophage lambda. a physicalchemical model for gene regulation. J Mol Biol, 181:211–230. 14. Bower JM, Bolouri H, editors (2004) Computational modeling of genetic and biochemical networks. MIT Press. 15. Bintu L, et al. (2005) Transcriptional regulation by the numbers: models. Curr Opin Genet Dev, 15:116–124. 16. Von Dassow G, Meir E, Munro EM, Odell GM (2000) The segment polarity network is a robust developmental module. Nature, 406:188–192. 17. Open Source Physics website: http://www.opensourcephysics.org 18. Wernicke S (2006) Efficient detection of network motifs. IEEE/ACM Trans Comput Biol Bioinform, 3:347–59.

A

B

Excitatory Inhibitory

Fig. S1. Two target (gold standard) networks of size 100. (A) was extracted from an E.coli and (B) from a yeast gene network (see Methods of the main text). The graphs of all target networks of the challenge are available in the file DREAM3 In Silico Challenges.zip on the GeneNetWeaver website (1).

Marbach et al. www.pnas.org/cgi/doi/10.1073/pnas.0913357107

5 of 10

A

B 27

22

Rank on size 100

Rank on size 50

20 20

10

10

1

1 1

10

20

Rank on size 10

27

1

10

20 22

Rank on size 50

Fig. S2. Best performance methods did consistently well across sizes. (A) The rankings of the 27 teams that participated both in the subchallenges with networks of size 10 and 50. The methods with a good rank on networks of size 10 also performed well on networks of size 50 (points in the bottom-left corner of the plot). (B) The rankings of the 22 teams that participated both in the subchallenges with networks of size 50 and 100. The ranking is very similar in the two subchallenges. A notable exception is the second-place method on networks of size 50, which had a poor performance on networks of size 100 (indicated by the arrow).

Marbach et al. www.pnas.org/cgi/doi/10.1073/pnas.0913357107

6 of 10

A

Fan-out

Motif prediction confidence True structure

Best performer

Second place

Third place

Fourth place

Fifth place

1

1

1

1

1

1

2

3

2

2

3

2

1

1

3

3

2

1

3

2

1

1

Fan-in

1

3

3

Cascade

2

1

1

1

3

2

1

1

3

2

1

1

Prediction confidence (median rank in %)

FFL

1

3

2

1

3

2

3

2

1

3

2

3

2

1

3

2

3

2

1

3

2

3

2

3

2

3

2

B

3

2

3

2

3

2

3

2

75

Edge

Background prediction confidence 1

2

1

C

2

1

2

1

2

1

2

100

1

2

≤ 50

Fan-out

Divergence of motif from background prediction confidence 1

1

1

1

1

Divergence of prediction confidence

1

20 2

3

3

2

3

2

3

2

3

2

3

2

10 1

1

1

1

1

Fan-in

1

3

Cascade

2

3

2

1

3

2

1

3

2

1

1

3

2

3

2

1

Reduced prediction confidence

1

* 3

2

3

2

1

2

3

3

2

1

1

3

2

3

2

1

Increased prediction confidence statistically * Not significant

1

FFL

1

*

≤1

2

3

2

3

2

3

2

3

2

3

2

*

3

Fig. S3. Inference methods are differentially affected by systematic prediction errors. The first column shows the true connectivity of the motifs, the following columns show how they were predicted by the best five methods of the networks of size 100. (A) Median prediction confidence (see SI Methods for a definition) of the different edges (same as Fig. 4 of main text). Note that the third and fourth-place methods apparently can’t distinguish the directionality of edges. (B) The background prediction confidence, which is the median prediction confidence of all links of the network, independently of which motifs they belong to. If the motifs had no effect on the prediction confidence, all edges of the motifs would be inferred with this background prediction confidence. (C) The divergence of the prediction confidence of the different motif edges from the background prediction confidence. The darkness of the links is proportional to the difference of the motif prediction confidence (A) from the background prediction confidence (B). Dashed arrows indicate a reduced, solid arrows an increased median prediction confidence. All differences are statistically significant at a level of 0.01 using a two-sided Wilcoxon-Mann-Whitney rank-sum test and Bonferroni correction for multiple hypothesis testing, except for the marked edges (*). The five methods are affected to various degrees by the three systematic prediction errors discussed in Fig. 4 of the main text. The second, third, and fourth-place methods have a tendency for false positives between coregulated genes (fan-out error). All methods have a reduced prediction confidence for edges that target genes with multiple inputs (fan-in error), but the best-performer method to a lesser degree than the others. However, the best-performer method has one of the strongest cascade errors (shortcuts 1 → 3). Since the third and fourth-place methods can’t infer the directionality of interactions, the wrongly predicted “shortcut” appears in both directions.

Marbach et al. www.pnas.org/cgi/doi/10.1073/pnas.0913357107

7 of 10

Network size 100 Indegree of target gene

B

75

50

1

5

7

2

1

9

1

3

3

5

7

Outdegree of regulator

2

9

1

3

D

G

y-intercept (rank in %)

75

50

−6

−4

−2

Slope

75

50

−4

0

Method (score) Yip et al. Method E Method C Method F

(inf) (45) (42) (32)

1

3

5

2

1

7

Indegree of target

3

H

100

100

100

−2

0

Slope

Method M (12) Other methods Score < 10 Score < 2

2

4

75

50

50

y-intercept (rank in %)

1

3

I n d eg r e e o f t a r g e t

100

Rank (%)

75

50

F

100

Rank (%)

Rank (%)

Rank (%)

75

Outdegree of regulatory gene

E

100

100

C

Indegree of target gene

75

50

−8

3

5

7

Outdegree of regulator

1

2

3

100

y-intercept (rank in %)

A

y-intercept (rank in %)

Network size 50

Outdegree of regulatory gene

−6

−4

Slope

−2

75

50

0

−4

Method (score) Yip et al. Method B Method G Method E

(40) (31) (18) (16)

−2

0

2

4

Slope

Method C (14) Other methods Score < 13 Score < 2

Fig. S4. How the indegree and outdegree of genes affects the prediction confidence. The first row shows, for the best five methods on the networks of size 100 and size 50, the median prediction confidence assigned to edges of varying indegree of target genes and outdegree of regulators (the prediction confidence of an edge is its rank in the list of edge predictions, as defined above for the network-motif analysis). The shaded areas indicate 95% confidence intervals for the medians. Note that the top-five methods (first row) are not all the same for networks of size 100 and size 50 (see legends). (A, E) Single-input links were reliably predicted with a similar, high prediction confidence by several inference methods (points in the top left corner). However, for all but the bestperformer method, the performance drops drastically for higher indegrees. Note that the results for the smaller networks of size 50 are noisier, and the confidence intervals wider, due to the smaller sample sizes. We summarized these plots by fitting for every method a line through the curve: (C, G) show the y-intercept and the slope of the fitted lines for all methods, which summarize the performance on single inputs and the robustness to the fan-in error, respectively. A linear regression analysis (dashed line, the shaded area is the 95% prediction interval) confirms that the best-performer method (Yip et al.) is an outlier. It has the most robust performance on high indegrees (flattest slope) of all methods that have a reasonably high score (y-intercept >75%). The results for networks of size 50 are noisier, but consistent with the observations on the networks of size 100. (B, F) In contrast to the indegree of target genes, the outdegree of regulators does not affect the prediction confidence of their links. (D, H) Consequently, the slopes of the fitted lines are close to zero. There is no significant correlation between the y-intercept and the slope. Thus, the regression done in the corresponding plots for the indegrees is not applicable here.

Marbach et al. www.pnas.org/cgi/doi/10.1073/pnas.0913357107

8 of 10

B

Individual predictions Confidence level Edge

True edges

A

Target network 4

3

1

2

Rank

2 2 4 4 1 1 2 3 4 1 3 3

3 4 1 3 4 2 1 1 2 3 2 4

1 0.9 0.8 0.7 0.6 0.5 0 0 0 0 0 0

1 2 3 4 5 6 9.5* 9.5* 9.5* 9.5* 9.5* 9.5*

1 3 1 3 4 4 2 4 2 1 3 2

2 4 3 2 2 1 1 3 3 4 1 4

1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1.5* 1.5* 2 4 5 6 7 8 9 10 11 12

C

Community prediction Edge

Avg. rank

1 4 2 3 1 4 3 2 4 1 2 3

3.75 4.5 5 5.5 5.75 6 6.75 7 7.25 7.5 8.25 10.25

2 1 3 4 3 3 2 4 2 4 1 1

* Ranks of edges with equal confidence levels are averaged

Fig. S5. Example of a community prediction formed from two individual network predictions. (A) The target network of this example is a loop of four genes. (B) Two possible lists of edge predictions. The lists are ranked according to the confidence levels of the edges, the most confident prediction at the top of the list has rank 1. The true edges of the target network are highlighted. (C) The community prediction is obtained by taking for each edge the average of its rank in the two individual predictions. Here, the community prediction is perfect (all true edges are at the top of the list). This example illustrates how a community prediction can be more accurate than the individual predictions that it is composed of.

A

B

Network size 10 6

Network size 50

40

5 30

Score

Score

4 3

20

2 10

1 0

0 1

5

10

15

20

25

29

Team

C

1

5

10

15

20

25 27

Team

Network size 100 inf

120

Legend

Score

100

Individual predictions: 1, 2, 3, ...

80

Community predictions {1,2}, {1,2,3}, {1,2,3,4}, ...

60

Communities without best method {2,3}, {2,3,4}, {2,3,4,5}, ...

40 20 0

1

5

10

15

20 22

Team

Fig. S6. Community predictions in the DREAM3 in silico challenge. The circles are the scores of the individual teams. The diamonds correspond to the scores of the different community predictions, obtained by combining the two best teams, the three best teams, the four best teams, etc. The squares correspond to the community predictions obtained without including the best-performer. The performance of the community is remarkably robust. Even when including all teams (rightmost diamonds), the score of the community is better than the second-best method on networks of size 10 and 100, and better than the third-best method on networks of size 50.

Marbach et al. www.pnas.org/cgi/doi/10.1073/pnas.0913357107

9 of 10

Table S1. The identities and methods of the five top-performing teams in the DREAM3 in silico network inference challenge Team name B Team

USMtec347

Members Kevin Y. Yip, Roger P. Alexander, Koon-Kiu Yan, and Mark Gerstein Peng Li and Chaoyang Zhang

Type of method

Affiliation

Rank

Yale University, USA

1st on all sizes

Statistical method

References (2)

University of Southern Mississippi, USA

2nd on size 10, 2nd on size 50

Bayesian network

-

Bonneau

Aviv Madar, Alex Greenfield, Eric Vanden-Eijnden, and Richard Bonneau

New York University, USA

2nd on size 100

Dynamical model

(3, 4, 5)

Intigern

Xuebing Wu, Feng Zeng, and Rui Jiang

Tsinghua University, China

3rd on size 10, 3rd on size 100

Statistical method

-

Tarmo Äijö and Harri Lähdesmäki

Tampere University of Technology, Finland

3rd on size 50

Dynamical model

(6)

Zuma

Marbach et al. www.pnas.org/cgi/doi/10.1073/pnas.0913357107

10 of 10

The strengths and weaknesses of research designs ... - SAGE Journals

Dissecting the Brain of the Frog and Revealing the Bones.pdf ...

First, to understand the strengths and drawbacks of ... -

Truth-revealing voting rules for large populations

A Revealing History of the World's Oldest Profession

Prediction of Population Strengths

Revealing Method for the Intrusion Detection System

A Comparison of Clustering Methods for Writer Identification and ...

Systems and methods for support of various processing capabilities

WikiLeaks: The Ethics of Revealing Secrets.pdf

basics of research methods for criminal justice and criminology pdf ...

Materials and methods for the treatment of gastroesophageal reflux ...

Compositions and methods for enhanced mucosal delivery of Y2 ...

Methods and apparatus for controlled directional drilling of boreholes