Graph theory

From "A B C"
Jump to navigation Jump to search

Graph theory


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


An introduction to graph theory for computational systems biology. This page focusses on the theoretical aspects of graphs, for biological pathways and networks, see there.


Introductory reading

Pavlopoulos et al. (2011) Using graph theory to analyze biological networks. BioData Min 4:10. (pmid: 21527005)

PubMed ] [ DOI ] Understanding complex systems often requires a bottom-up analysis towards a systems biology approach. The need to investigate a system, not only as individual components but as a whole, emerges. This can be done by examining the elementary constituents individually and then how these are connected. The myriad components of a system and their interactions are best characterized as networks and they are mainly represented as graphs where thousands of nodes are connected with thousands of vertices. In this article we demonstrate approaches, models and methods from the graph theory universe and we discuss ways in which they can be used to reveal hidden properties and features of a network. This network profiling combined with knowledge extraction will help us to better understand the biological significance of the system.


Contents

  • Graph types
  • Graph metrics
  • Graph operations


 

Further reading and resources

The "classic":

"...we show that, independent of the system and the identity of its constituents, the probability P(k) that a vertex in the network interacts with k other vertices decays as a power law, following P(k) ∼ k−γ. This result indicates that large networks self-organize into a scale-free state, a feature unpredicted by all existing random network models."
Barabasi & Albert (1999) Emergence of scaling in random networks. Science 286:509-12. (pmid: 10521342)

PubMed ] [ DOI ] Systems as diverse as genetic networks or the World Wide Web are best described as networks with complex topology. A common property of many large networks is that the vertex connectivities follow a scale-free power-law distribution. This feature was found to be a consequence of two generic mechanisms: (i) networks expand continuously by the addition of new vertices, and (ii) new vertices attach preferentially to sites that are already well connected. A model based on these two ingredients reproduces the observed stationary scale-free distributions, which indicates that the development of large networks is governed by robust self-organizing phenomena that go beyond the particulars of the individual systems.

Jeong et al. (2000) The large-scale organization of metabolic networks. Nature 407:651-4. (pmid: 11034217)

PubMed ] [ DOI ] In a cell or microorganism, the processes that generate mass, energy, information transfer and cell-fate specification are seamlessly integrated through a complex network of cellular constituents and reactions. However, despite the key role of these networks in sustaining cellular functions, their large-scale structure is essentially unknown. Here we present a systematic comparative mathematical analysis of the metabolic networks of 43 organisms representing all three domains of life. We show that, despite significant variation in their individual constituents and pathways, these metabolic networks have the same topological scaling properties and show striking similarities to the inherent organization of complex non-biological systems. This may indicate that metabolic organization is not only identical for all living organisms, but also complies with the design principles of robust and error-tolerant scale-free networks, and may represent a common blueprint for the large-scale organization of interactions among all cellular constituents.



Galas et al. (2014) Describing the complexity of systems: multivariable "set complexity" and the information basis of systems biology. J Comput Biol 21:118-40. (pmid: 24377753)

PubMed ] [ DOI ] Context dependence is central to the description of complexity. Keying on the pairwise definition of "set complexity," we use an information theory approach to formulate general measures of systems complexity. We examine the properties of multivariable dependency starting with the concept of interaction information. We then present a new measure for unbiased detection of multivariable dependency, "differential interaction information." This quantity for two variables reduces to the pairwise "set complexity" previously proposed as a context-dependent measure of information in biological systems. We generalize it here to an arbitrary number of variables. Critical limiting properties of the "differential interaction information" are key to the generalization. This measure extends previous ideas about biological information and provides a more sophisticated basis for the study of complexity. The properties of "differential interaction information" also suggest new approaches to data analysis. Given a data set of system measurements, differential interaction information can provide a measure of collective dependence, which can be represented in hypergraphs describing complex system interaction patterns. We investigate this kind of analysis using simulated data sets. The conjoining of a generalized set complexity measure, multivariable dependency analysis, and hypergraphs is our central result. While our focus is on complex biological systems, our results are applicable to any complex system.

Stumpf & Porter (2012) Mathematics. Critical truths about power laws. Science 335:665-6. (pmid: 22323807)

PubMed ] [ DOI ]

Kelly et al. (2012) The degree distribution of networks: statistical model selection. Methods Mol Biol 804:245-62. (pmid: 22144157)

PubMed ] [ DOI ] The degree distribution has been viewed as an important characteristic of network data. Many biological networks have been labelled scale-free as their degree distribution can be approximately described by a power-law probability distribution. This chapter presents a formal statistical model selection procedure that can determine which functional form, from a collection of specified models, best describes the degree distribution of network data. The degree distribution found for empirical data is viewed as belonging to a class of probability models and the model which best describes the data is determined in a maximum likelihood framework. In conclusion, it is important to note that these statistical tests do not confirm the true underlying distribution of the observed data, but instead show which models from a chosen set best describe the data. In reality, these approaches should be viewed as providing evidence for which probability models do not adequately (or optimally) describe the data, and give an indication of the underlying sampling and true interaction properties of the system considered.

Geraci et al. (2012) Algorithms for systematic identification of small subgraphs. Methods Mol Biol 804:219-44. (pmid: 22144156)

PubMed ] [ DOI ] The ability to analyze large biological networks proves to be a computationally expensive task, but the information one can gain is worth the cost and effort. In cancer research for example, one is able to derive knowledge about putative drug targets by revealing the strengths and weaknesses inherent in a protein-protein interaction (PPI) network. Further, network analyses can be used to optimize high-throughput genetic and proteomic experiments. In addition, the study of biological networks is now an active part of molecular biology. In this chapter, we review techniques for studying biological networks in general but with a focus on PPI networks, including an example of a bacterial PPI network. After a brief introduction, we concentrate on methods based on the analysis of subnetworks, namely, graph motifs and graphlets.

Klamt & von Kamp (2009) Computing paths and cycles in biological interaction graphs. BMC Bioinformatics 10:181. (pmid: 19527491)

PubMed ] [ DOI ] BACKGROUND: Interaction graphs (signed directed graphs) provide an important qualitative modeling approach for Systems Biology. They enable the analysis of causal relationships in cellular networks and can even be useful for predicting qualitative aspects of systems dynamics. Fundamental issues in the analysis of interaction graphs are the enumeration of paths and cycles (feedback loops) and the calculation of shortest positive/negative paths. These computational problems have been discussed only to a minor extent in the context of Systems Biology and in particular the shortest signed paths problem requires algorithmic developments. RESULTS: We first review algorithms for the enumeration of paths and cycles and show that these algorithms are superior to a recently proposed enumeration approach based on elementary-modes computation. The main part of this work deals with the computation of shortest positive/negative paths, an NP-complete problem for which only very few algorithms are described in the literature. We propose extensions and several new algorithm variants for computing either exact results or approximations. Benchmarks with various concrete biological networks show that exact results can sometimes be obtained in networks with several hundred nodes. A class of even larger graphs can still be treated exactly by a new algorithm combining exhaustive and simple search strategies. For graphs, where the computation of exact solutions becomes time-consuming or infeasible, we devised an approximative algorithm with polynomial complexity. Strikingly, in realistic networks (where a comparison with exact results was possible) this algorithm delivered results that are very close or equal to the exact values. This phenomenon can probably be attributed to the particular topology of cellular signaling and regulatory networks which contain a relatively low number of negative feedback loops. CONCLUSION: The calculation of shortest positive/negative paths and cycles in interaction graphs is an important method for network analysis in Systems Biology. This contribution draws the attention of the community to this important computational problem and provides a number of new algorithms, partially specifically tailored for biological interaction graphs. All algorithms have been implemented in the CellNetAnalyzer framework which can be downloaded for academic use at http://www.mpi-magdeburg.mpg.de/projects/cna/cna.html.

Ravasz (2009) Detecting hierarchical modularity in biological networks. Methods Mol Biol 541:145-60. (pmid: 19381526)

PubMed ] [ DOI ] Spatially or chemically isolated modules that carry out discrete functions are considered fundamental building blocks of cellular organization. However, detecting them in highly integrated biological networks requires a thorough understanding of the organization of these networks. In this chapter I argue that many biological networks are organized into many small, highly connected topologic modules that combine in a hierarchical manner into larger, less cohesive units. On top of a scale-free degree distribution, these networks show a power law scaling of the clustering coefficient with the node degree, a property that can be used as a signature of hierarchical organization. As a case study, I identify the hierarchical modules within the Escherichia coli metabolic network, and show that the uncovered hierarchical modularity closely overlaps with known metabolic functions.

Thorne & Stumpf (2007) Generating confidence intervals on biological networks. BMC Bioinformatics 8:467. (pmid: 18053130)

PubMed ] [ DOI ] BACKGROUND: In the analysis of networks we frequently require the statistical significance of some network statistic, such as measures of similarity for the properties of interacting nodes. The structure of the network may introduce dependencies among the nodes and it will in general be necessary to account for these dependencies in the statistical analysis. To this end we require some form of Null model of the network: generally rewired replicates of the network are generated which preserve only the degree (number of interactions) of each node. We show that this can fail to capture important features of network structure, and may result in unrealistic significance levels, when potentially confounding additional information is available. METHODS: We present a new network resampling Null model which takes into account the degree sequence as well as available biological annotations. Using gene ontology information as an illustration we show how this information can be accounted for in the resampling approach, and the impact such information has on the assessment of statistical significance of correlations and motif-abundances in the Saccharomyces cerevisiae protein interaction network. An algorithm, GOcardShuffle, is introduced to allow for the efficient construction of an improved Null model for network data. RESULTS: We use the protein interaction network of S. cerevisiae; correlations between the evolutionary rates and expression levels of interacting proteins and their statistical significance were assessed for Null models which condition on different aspects of the available data. The novel GOcardShuffle approach results in a Null model for annotated network data which appears better to describe the properties of real biological networks. CONCLUSION: An improved statistical approach for the statistical analysis of biological network data, which conditions on the available biological information, leads to qualitatively different results compared to approaches which ignore such annotations. In particular we demonstrate the effects of the biological organization of the network can be sufficient to explain the observed similarity of interacting proteins.