Graphs and Networks

Contents
Report topic
- Short Report options
Questions, comments
References

Expected Preparations:

	Calculus: Functions and equations; Polynomial functions, logarithms, trigonometric functions; Integrals and derivatives; Theorem and proof.		[RPR] Introduction
	If you are not already familiar with the prior knowledge listed above, you need to prepare yourself from other information sources.		The units listed above are part of this course and contain important preparatory material.

Keywords: Intoduction to graph theory and network science; iGraph

Objectives:

This unit will …

… introduce concepts and definitions of graph theory;
… demonstrate how to work with graphs in R, using the igraph:: package.

Outcomes:

After working through this unit you …

… know what the most important terms of graph theory mean;
… are familar with a number of quantitiative measures on graphs;
… can use igraph:: to construct graphs, plot graphs with a variety of layout algorithms, and analyse their degree distribution, centrality distribution, and community structure;
… can interpret such measures in a biological context.

Deliverables:

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.

Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these.

Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Evaluation:

Material based on this learning unit can be submitted for formative feedback. To submit:

Create a new document in your shared Google drive folder.
Call your document FND-MAT-Graphs_and_networks-<your name>-2022
Write a short report on one of the topics defined below.
Include a (CC) license at the end of your document, as instructed at the beginning of the course.
When you are done with everything, go to the Assignments page on Quercus and open the first Feedback Unit that you have not submitted yet. Paste the URL of your report document into the form, and click on Submit Assignment. Your link can be submitted only once and not edited. Also: do not edit your document after it has been submitted.

“Traditional” bioinformatics has focussed on the properties of individual genes, but we are developing a much more functionally oriented view that investigates gene relationships. Analyzing relationships is the domain of graph theory, where relationships are represented as edges between nodes. This unit introduces key concepts and terminology and puts working with graphs into practice with the igraph:: package in R.

Task…

Read the introductory notes on graph theory and network sciencePDF.

For a useful overview of graph-theory and network analysis concepts you could additionally have a look at:

Koutrouli, Mikaela et al.. (2020). “A Guide to Conquer the Biological Network Era Using Graph Theory”. Frontiers in Bioengineering and Biotechnology 8:34 .
[PMID: 32083072] [DOI: 10.3389/fbioe.2020.00034]

Abstract …

Networks are one of the most common ways to represent biological systems as complex sets of binary interactions or relations between different bioentities. In this article, we discuss the basic graph theory concepts and the various graph types, as well as the available data structures for storing and reading graphs. In addition, we describe several network properties and we highlight some of the widely used network topological features. We briefly mention the network patterns, motifs and models, and we further comment on the types of biological and biomedical networks along with their corresponding computer- and human-readable file formats. Finally, we discuss a variety of algorithms and metrics for network analyses regarding graph drawing, clustering, visualization, link prediction, perturbation, and network alignment as well as the current state-of-the-art tools. We expect this review to reach a very broad spectrum of readers varying from experts to beginners while encouraging them to enhance the field further.

Pavlopoulos, Georgios A et al.. (2011). “Using graph theory to analyze biological networks”. Biodata Mining 4:10 .
[PMID: 21527005] [DOI: 10.1186/1756-0381-4-10]

Abstract …

Understanding complex systems often requires a bottom-up analysis towards a systems biology approach. The need to investigate a system, not only as individual components but as a whole, emerges. This can be done by examining the elementary constituents individually and then how these are connected. The myriad components of a system and their interactions are best characterized as networks and they are mainly represented as graphs where thousands of nodes are connected with thousands of vertices. In this article we demonstrate approaches, models and methods from the graph theory universe and we discuss ways in which they can be used to reveal hidden properties and features of a network. This network profiling combined with knowledge extraction will help us to better understand the biological significance of the system.

Task…

Open RStudio and load the ABC-units R project. If you have loaded it before, choose File ▹ Recent projects ▹ ABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
Choose Tools ▹ Version Control ▹ Pull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included. This ensures that your data and code remain up to date when we update, or fix bugs.
Type init() if requested.
Open the file FND-MAT-Graphs_and_networks.R and follow the instructions.

Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.

Report topic

Context

The R project for the learning units contains a network that I have prepared from functional interaction data from the STRING database, subset with GO annotations for yeast. STRING curates networks of genes that have functional interactions, based on a variety of data such as experimental evidence, literature mining and more. GO (Gene Ontology) categorizes genes with functional annotations. The network I have prepared contains all yeast genes that have been annotated to the “mitotic cell cycle” and that have high-confidence functional interactions as determined by STRING.

Preparation

You can load the edge list with the command …

scCCnet <- readRDS("./data/scCCnet.rds")

… which will place the “tibble” scCCnet (S. cerevisiae Cell Cycle network) into your workspace. tibbles are modern versions of data frames and can be used just like data frames for most practical purposes¹. The code I wrote to create this tibble is in ./scripts/ABC-makeScCCnet.R - it may be worthwhile for you to look at the script to get a better understanding what this data is - or how to build a functional subset of genes for an interaction network for your own projects. All report options require you to plot the network: I expect that igraph::layout_with_graphopt() will work reasonably well to define the layout, but igraph:: has many options for graph layout and it is worthwhile to try them. In all cases you may need to play with the parameters to find the proper balance of node distance and node size.

Short Report options

Write a short report on one of the topics below. The goal of this short report is to connect graph measures to the biology of the nodes. (All reports must have the R code you wrote in an appendix.)

Option: Degree distribution

Topic: Degree Distribution

(A) Create an informative overview plot of the scCCnet network to highlight the degree distribution. Color and scale the nodes according to their degree and take care to choose the layout algorithm, and the plotting parameters for node color, size, and label-size well. “Excellent” submissions will include a legend in the plot and a caption for the figure. Obviously, the full plotting command needs to be included with your submission. Interpret what you see.

(B) Plot a log-frequency against log-rank plot of the degree distribution. Interpret this result. Does this look like a scale-free network? How can you tell? What are the highest-degree nodes? What are the lowest-degree nodes? Do those genes have anything in common? How can you tell? Interpret what you see.

Option: Centrality

Topic: Centrality

(A) Create an informative overview plot of the scCCnet network to highlight the centrality scores of the nodes. Choose one of: betweenness centrality, closeness centrality, or eigenvector centrality for your analysis. I expect that igraph::layout_with_graphopt() will provide a good starting point, but you may need to adjust the parameters. Color and scale the nodes according to their centrality score and take care to choose the layout algorithm, and the plotting parameters for node color, size, and label-size well. “Excellent” submissions will include a legend in the plot and a caption for the figure. Obviously, the full plotting command needs to be included with your submission. Interpret what you see.

(B) Order the nodes according to their score. Choose the nodes with the 10 highest centrality scores. Interpret the results. What are the highest-centrality nodes? Do genes with high centrality have anything in common?

Option: Clusters

Topic: Clusters

(A) Create an informative overview plot of the network to highlight the community structure of the network. Choose igraph::cluster_infomap() and one other of the eight other clustering algorithms of igraph:: for comparison. Color and scale the nodes according to their community membership and take care to choose the layout algorithm, and the plotting parameters for node color, size, and label-size well. “Excellent” submissions will include a legend in the plot and a caption for the figure. Obviously, the full plotting command needs to be included with your submission. Interpret what you see.

(B) Compare the two methods. Do the clusters overlap? Are they very different? Considering the biology of the nodes, which clustering method was more successful?

(C) For the “better” of the two methods, analyse the top 2 communities. Why do these genes have more functional interactions with each other than with other genes? Do the members of these clusters have anything in common? What did you learn from this analysis?

Hint: The Saccharomycs Genome Database hosts YeastMine - a Web tool into which you can paste a number of yeast gene IDs, and retrieve summary information including symbol, name, and sequence length. The name is a good indicator of a gene’s biological role.

Questions, comments

If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.

Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.

References

About this page …

[END]

Post to the discussion board in case you come into a situation where you are trying to do something that ought to have worked with a data frame, but does not work in the same way with a tibble. There are some differences …↩︎