Are biological networks scale-free graphs? Diego Mauricio Ria˜no-Pach´on, Ingo Dreyer, Bernd Mueller-Roeber Department of Molecular Biology, Institute for Biochemistry and Biology, University of Potsdam† email: {diriano,dreyer,bmr}@uni-potsdam.de
Introduction
Importance of hubs
Scale-free graphs were first described by Barab´asi and Albert[1] based on the study of the web connectivity, followed by several different biological networks (e. g. [2]). A graph is scale-free if the distribution of the vertex connectivity (degree) (k) follows a power-law distribution of the form P (k) ∼ k −γ . Additionally, it should have highly connected nodes, called hubs, which are central to the network topology, and ‘keep the network together’. Targeted removal of such a hub is catastrophic for the network topology. Recently, some authors have pointed out that networks believed to be scale-free graphs are actually scale-rich. Here we follow a recommended procedure to study the degree distribution of biological graphs and to evaluate the importance of hubs in the graph topology (for details read [3, 4]).
It has been suggested that the occurrence of a power-lawlike degree ditribution was enough for the emergence of hubs that kept the network together. Li et al. [3] have shown that this is not the case, and networks displaying power-law degree distributions may not have central hubs. The authors define the s-metric of graph g (s(g)) and its normalized version S(g) as:
Figure 3: PDF of degree distribution
The recommended way to study the degree distribution is through the Cumulative Distribution Function (CDF) which gives the probability of a degree larger than k. The CDF of a power-law function is also a power-law function, where the exponent is one unit less than in equation 1. P (K > k) = Ck −(γ−1)
Graph theory concepts Many systems can be represented as graphs, a set of nodes joined together by links, representing some type of interaction or relationship. Such as: protein-protein interactions, transcriptional regulation and domain cooccurrences, among several others. Figure 1 depicts the representation of the protein domain co-occurences as a graph. In panel ‘A’, two proteins (a and b) are shown with colored boxes, representing different functional domains. In panel ‘B’, the domains of those proteins are represented as colored nodes. Edges link together nodes (domains) that appear in a single protein. Applying this procedure to a full proteome results in a graph like the one in panel C of Figure 1.
s(g) =
Cummulative distribution
(2)
Figure 4 shows the CDF of the same data used in panel A of Figure 3. Fitting a straight line to the CDF gives γ = 1.9, much closer to the real value.
Figure 4: CDF of degree distribution
X
didj
S(g) = s(g)/smax
(4)
(i,j)∈ε
where di and dj are the degrees of nodes i and j, respectively, and ε is the set of edges of graph g. And, smax is the maximally attainable value of s(g), for a graph with the same degree sequence as g. This metric attempts to quantify the role of hubs in the topology of the graph. High values of S(g) indicate a ‘hub-like core’, where the hubs play an important role in the connectivity of the graph (the smax graph was computed as described in appendix A.1 of [3]). s-metric Metric S. cerevisiae O. sativa s(g) 32 672 558 635 S(g) 0.624 0.417
Another option is to compute the centrality of nodes. If hubs are central to the topology, then a positive correlation between the node degree and the node centrality is expected, and the centrality of low-degree nodes would be low. Here we compute the betweenness centrality as implemented in Pajek [9], which is the fraction of shortest paths that go through the node of interest.
Model Selection To decide objectively which is the function best describing the degree distribution, we follow the procedure described in [7], which has basically three steps: (I). Define a set of competing functions. Figure 1: Biological networks as graphs
The degree (k) of a node is its number of incident edges. For example, in Figure 1 panel ‘B’, nodes blue, green and orange each have degree 3, node purple has degree 5, and nodes yellow and light green have degree 2 each. As seen, different nodes have different degrees, this variability is characterized by the degree distribution function P (k), which gives the probability that a node has exactly k edges, or, in other words gives the observed frequency of a node of degree k. Very good introductions to network analysis and theory are found in [5] and [6].
Probability distribution The degree distribution for a scale-free graph can be described by a power-law, especially for large values of k P (K = k) = Ck −γ (1) P where C is a constant, such that ki=1 P (k) = 1 This function when plotted in double logarithmic scale appears as a straight line of slope −γ. This fact has been used as the gold-standard test to decide wether a graph is scalefree or not, and to estimate the value of γ. As shown in panel A of Figure 3, this frequency plot or Probability Distribution Function(PDF) is very noisy for large k, leading to unreliable estimates of γ. In this example γ = 2, and fitting a straight line to this data gives γ = 0.8. In addition, an exponential function (P (k) ∼ e−αk ), can also look like a straight line in the double logarithmic plot for large values of k, as it is seen in panel B of Figure 3. Figure 3 shows n = 10000 integer values sampled from both distributions, power law and exponential. The following listing shows the R code to generate the data.
Figure 2: R code. Generating degree distributions.
†
(II). Compute Maximum Likelihood estimates of the parameters for each function.
Figure 6: Node centrality
(III). Decide which function describes best the data. To ilustrate this approach we will use real data from the network of domain co-occurrences in Saccharomyces cerevisiae and Oryza sativa. Exponential, stretched exponential, power-law with exponential cutoff and power-law will be used as competing functions (step (I), Figure 5). Name Function Abb Exponential Ce−βk F1 ¯ Stretched exponential Ce−βk/k k −γ F2 Power-law with exponential cutoff k −γ e−k/kcut F3 −γ Power-law Ck F4
The Maximum Likelihood parameters estimation for each function is obtained through non-linear least squares (step (II)). The best describing function is the one with the lowest Akaike Information Criteria (AIC). AIC is a measure that rewards models for good fit, but penalizes for extra parameters. ˆ + d) AIC = 2(−lk(θ) (3) ˆ Where θ is the maximum value of the likelihood function and d is the number of adjusted parameters.
Conclusions • When analysing a degree distribution, always use the cummulative distribution. • Assess statistically which model describes best the degree distribution of your network, using the Akaike Information Criteria. It is also possible to use the Bayesian Information Criteria. • Evaluate the importance of hubs in the network. Two options are: using the s-metric or an index of betweenness centrality. For the domain co-occurrence networks shown here, it is clear that they are not scale-free graphs. Their degree distribution is not well described by a power-law. And, there are many low-degree nodes that are very important for the network topology.
Acknowledgements This work was funded by the IZ-APT Uni-Potsdam.
References [1] Barabasi, A.-L. and Albert, R. (1999) Emergence of scaling in random networks. Science, 286, 509–512. [2] Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N., and Barabasi, A.-L. (2000) The large-scale organization of metabolic networks. Nature, 407, 651–654. [3] Li, L., Alderson, D., Tanaka, R., Doyle, J. C., and Willinger, W. (2005) Towards a theory of scale-free graphs: Definition, properties, and implications (extended version). Tech. rep., Engineering & Applied Sciences Division California Institute of Technology, Pasadena, CA, USA.
Figure 5: Non linear least squares ML fitting AIC Function S. cerevisiae O. sativa F1 -107.15 -193.43 F2 -131.21 -242.07 F3 -100.04 -199.10 F4 -80.18 -183.24
For both species the degree distribution for the domain cooccurrence network is better explained by a stretched exponential function than by a power-law, in contrast to what it was suggested before in [8].
[4] Tanaka, R., Yi, T.-M., and Doyle, J. (2005) Some protein interaction data do not exhibit power law statistics. FEBS Lett, 579, 5140–5144. [5] Albert, R. and Barabasi, A.-L. (2002) Statistical mechanics of complex networks. Rev Mod Phys, 74, 47–97. [6] Newman, M. (2003) The structure and function of complex networks. SIAM Review, 45, 167–256. [7] Stumpf, M. P. H. and Ingram, P. J. (2005) Probability models for degree distributions of protein interaction networks. Europhysics Letters, 71, 152. [8] Wuchty, S. and Almaas, E. (2005) Evolutionary cores of domain co-occurrence networks. BMC Evol Biol, 5, 24. [9] Batagelj, V. and Mrvar, A. (2005) Pajek: Program for Analysis and Visualization of Large Networks Reference Manual List of commands with short explanation version 1.06.
¨ Biochemie und Biologie, Karl-Liebknecht-Str. 24-25, Haus 20, 14476 Golm, Deutschland Address for correspondence: Universit¨at Potsdam, Institut fur