Expected Preparations:
|
|||||||||||
|
|||||||||||
Keywords: Intoduction to graph theory and network science; iGraph | |||||||||||
|
|||||||||||
Objectives:
This unit will …
|
Outcomes:
After working through this unit you …
|
||||||||||
|
|||||||||||
Deliverables: Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit. Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these. Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page. |
|||||||||||
|
|||||||||||
Evaluation: Material based on this learning unit can be submitted for formative feedback. To submit:
|
“Traditional” bioinformatics has focussed on
the properties of individual genes, but we are developing a much
more functionally oriented view that investigates gene
relationships. Analyzing relationships is the domain of
graph theory, where relationships are represented as edges between
nodes. This unit introduces key concepts and terminology and puts
working with graphs into practice with the igraph::
package
in R.
Task…
Koutrouli,
Mikaela et al.. (2020). “A Guide to Conquer the Biological
Network Era Using Graph Theory”. Frontiers in Bioengineering and
Biotechnology 8:34 .
[PMID: 32083072]
[DOI: 10.3389/fbioe.2020.00034]
Pavlopoulos,
Georgios A et al.. (2011). “Using graph theory to analyze
biological networks”. Biodata Mining 4:10
.
[PMID: 21527005]
[DOI: 10.1186/1756-0381-4-10]
Task…
ABC-units
R project. If you
have loaded it before, choose File ▹ Recent
projects ▹ ABC-Units. If you have not loaded
it before, follow the instructions in the RPR-Introduction
unit.init()
if requested.FND-MAT-Graphs_and_networks.R
and follow
the instructions.
Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.
Context
The R project for the learning units contains a network that I have
prepared from functional interaction data from the STRING database,
subset with GO annotations for yeast. STRING curates networks of genes
that have functional interactions, based on a variety of data such as
experimental evidence, literature mining and more. GO (Gene Ontology)
categorizes genes with functional annotations. The network I have
prepared contains all yeast genes that have been annotated to the
“mitotic cell cycle”
and that have high-confidence
functional interactions as determined by STRING.
Preparation
You can load the edge list with the command …
scCCnet <- readRDS("./data/scCCnet.rds")
… which will place the “tibble” scCCnet
(S.
cerevisiae Cell Cycle network) into your workspace.
tibbles are modern versions of data frames and can be used just
like data frames for most practical purposes1. The code I wrote to
create this tibble is in
./scripts/ABC-makeScCCnet.R
- it may be worthwhile for you
to look at the script to get a better understanding what this data is -
or how to build a functional subset of genes for an interaction network
for your own projects. All report options require you to plot the
network: I expect that igraph::layout_with_graphopt()
will
work reasonably well to define the layout, but igraph::
has
many options for graph layout and it is worthwhile to try them. In all
cases you may need to play with the parameters to find the proper
balance of node distance and node size.
Write a short report on one of the topics below. The goal of this short report is to connect graph measures to the biology of the nodes. (All reports must have the R code you wrote in an appendix.)
Topic: Degree Distribution
(A) Create an informative overview plot of the
scCCnet
network to highlight the degree distribution. Color
and scale the nodes according to their degree and take care to choose
the layout algorithm, and the plotting parameters for node color, size,
and label-size well. “Excellent” submissions will include a legend in
the plot and a caption for the figure. Obviously, the full plotting
command needs to be included with your submission. Interpret what you
see.
(B) Plot a log-frequency against log-rank plot of the degree distribution. Interpret this result. Does this look like a scale-free network? How can you tell? What are the highest-degree nodes? What are the lowest-degree nodes? Do those genes have anything in common? How can you tell? Interpret what you see.
Topic: Centrality
(A) Create an informative overview plot of the
scCCnet
network to highlight the centrality scores of the
nodes. Choose one of: betweenness centrality, closeness centrality, or
eigenvector centrality for your analysis. I expect that
igraph::layout_with_graphopt()
will provide a good starting
point, but you may need to adjust the parameters. Color and scale the
nodes according to their centrality score and take care to choose the
layout algorithm, and the plotting parameters for node color, size, and
label-size well. “Excellent” submissions will include a legend in the
plot and a caption for the figure. Obviously, the full plotting command
needs to be included with your submission. Interpret what you see.
(B) Order the nodes according to their score. Choose the nodes with the 10 highest centrality scores. Interpret the results. What are the highest-centrality nodes? Do genes with high centrality have anything in common?
Topic: Clusters
(A) Create an informative overview plot of the
network to highlight the community structure of the network. Choose
igraph::cluster_infomap()
and one other of
the eight other clustering algorithms of igraph::
for
comparison. Color and scale the nodes according to their community
membership and take care to choose the layout algorithm, and the
plotting parameters for node color, size, and label-size well.
“Excellent” submissions will include a legend in the plot and a caption
for the figure. Obviously, the full plotting command needs to be
included with your submission. Interpret what you see.
(B) Compare the two methods. Do the clusters overlap? Are they very different? Considering the biology of the nodes, which clustering method was more successful?
(C) For the “better” of the two methods, analyse the top 2 communities. Why do these genes have more functional interactions with each other than with other genes? Do the members of these clusters have anything in common? What did you learn from this analysis?
Hint: The Saccharomycs Genome Database hosts YeastMine - a Web tool into which you can paste a number of yeast gene IDs, and retrieve summary information including symbol, name, and sequence length. The name is a good indicator of a gene’s biological role.
If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.
Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.
[END]
Post to the discussion board in case you come into a situation where you are trying to do something that ought to have worked with a data frame, but does not work in the same way with a tibble. There are some differences …↩︎