Difference between revisions of "FND-MAT-Graphs and networks"

From "A B C"
Jump to navigation Jump to search
m
m
Line 140: Line 140:
 
:*Write a function that takes an igraph graph as input and returns a random graph with the same degree distribution. Use the <code>sample_degseq()</code> function internally, with the <code>"vl"</code> method.
 
:*Write a function that takes an igraph graph as input and returns a random graph with the same degree distribution. Use the <code>sample_degseq()</code> function internally, with the <code>"vl"</code> method.
 
:*Write a script that produces a vector of 1,000 results from your first function, calculated for a graph produced by your second function using the <code>scCCnet</code> graph as input.
 
:*Write a script that produces a vector of 1,000 results from your first function, calculated for a graph produced by your second function using the <code>scCCnet</code> graph as input.
:*Plot and interpret the result.
+
:*Plot and interpret the result<ref>If you want to interpret this as a p-value, simply calculate the empirical p-value as (number of networks with size difference less or equal to <code>scCCnet</code> + 1) / (Number of random networks + 1).</ref>.
 
:*Create a new page on the student Wiki as a subpage of your User Page. Put your code there.
 
:*Create a new page on the student Wiki as a subpage of your User Page. Put your code there.
 
:*When you are done with developing this contents, add the following category tag to the page:
 
:*When you are done with developing this contents, add the following category tag to the page:

Revision as of 16:42, 2 November 2017

Graphs and Networks


 

Keywords:  Intoduction to graph theory and network science; iGraph


 



 


 


Abstract

"Traditional" bioinformatics has focussed on the properties of individual genes, but we are developing a much more functionally oriented view that investigates gene relationships. Analyzing relationships is the domain of graph theory, where relationships are represented as edges between nodes. This unit introduces key concepts and terminology and puts working with graphs into practice with the igraph package in R.


 


This unit ...

Prerequisites

You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:

  • Calculus: functions and equations; polynomial functions, logarithms, trigonometric functions; integrals and derivatives; theorem and proof.

You need to complete the following units before beginning this one:


 


Objectives

This unit will ...

  • ... introduce concepts and definitions of graph theory;
  • ... demonstrate how to work with graphs in R, using the igraph package.


 


Outcomes

After working through this unit you ...

  • ... know what the most important terms of graph theory mean;
  • ... are familar with a number of quantitiative measures on graphs;
  • ... can use igraph to construct graphs, plot graphs with a variety of layout algorithms, and analyse their degree distribution, centrality distribution, and community structure;
  • ... can interpret such measures in a biological context.


 


Deliverables

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.


 


Evaluation

This learning unit can be evaluated for a maximum of 6 marks. If you want to submit tasks for this unit for credit you have the following options. If you have any questions about these options, discuss on the mailing list.

Context
The R project for the learning units contains a network that I have prepared from functional interaction data from the STRING database, subset with GO annotations for yeast. STRING curates networks of genes that have functional interactions, based on a variety of data such as experimental evidence, literature mining and more. GO (Gene Ontology) categorizes genes with functional annotations. The network I have prepared contains all yeast genes that have been annotated to the "mitotic cell cycle" and that have high-confidence functional interactions as determined by STRING. It contains 261 genes and 1280 interactions. You can load the edge list with the command ...
load("./data/scCCnet.RData")

... which will place the "tibble" scCCnet (S. cerevisiae Cell Cycle network) into your workspace. tibbles are modern versions of data frames and can be used just like data frames for most practical purposes[1]. The code I wrote to create this tibble is in ./scripts/ABC-makeScCCnet.R - it may be worthwhile for you to look at the script to get a better understanding what this data is. All options require you to plot the network, I expect that layout_with_graphopt() will work reasonably well to define the layout, but igraph has many options for graph layout and its worthwhile to try them. In all cases you may need to play with the parameters to find the proper balance of node distance and node size.

Short Report option
1. Create a new page on the student Wiki as a subpage of your User Page.
2. Write a short report on one of the three following topics - A, B, or C. The goal of this short report is to connect graph measures to the biology of the nodes. (All reports must have the R code you wrote in an appendix.)
A - Degree distribution
A.1 Create an informative overview plot of the network to highlight the degree distribution. Color and scale the nodes according to their degree.
A.2 Plot a log-frequency against log-rank plot of the degree distribution. Does this look like a scale-free network? Interpret this result. What are the highest-degree nodes? What are the lowest-degree nodes? Do genes in these categories have anything in common?
B - Centrality
B.1 Create an informative overview plot of the network to highlight the centrality scores of the nodes. Choose one of: betweenness centrality, closeness centrality, or eigenvector centrality for your analysis. I expect that layout_with_graphopt() will work reasonably well for the plot, but you may need to play with the parameters. Color and scale the nodes according to their centrality score.
B.2 Order the nodes according to their score. Choose the nodes with the 10 highest centrality scores. Interpret the results. What are the highest-centrality nodes? Do genes with high centrality have anything in common?
C - Clusters
C.1 Create an informative overview plot of the network to highlight the community structure of the network. Choose cluster_infomap() and one of the eight other clustering algorithms of igraph. Color and scale the nodes according to their community membership.
C.2 Compare the two methods. Do the clusters overlap? Are they very different? Considering the biology of the nodes, which clustering method was more successful?
C.3 For the "better" of the two methods, analyse the top 4 communities. Why do these genes have more functional interactions with each other than with other genes? Do the memebrs of these clusters have anything in common?
Hint: The Saccharomycs Genome Database hosts YeastMine - a Web tool into which you can paste a number of yeast gene IDs, and retrieve summary information including symbol, name, and sequence length. The name is a good indicator of a gene's biological role.
3. When you are done with your report, add the following category tag to the page:
[[Category:Eval-FND-Graphs]]
Do not change your submission page after this tag has been added. The page will be marked and the category tag will be removed by the instructor.


 


Quiz option
Open the signup-page for the quiz for this unit (linked from here) and add your name. Your name must be signed up by 12:00 of the day of the Quiz to ensure copies of the quiz are available for all participants.
Quizzes will be written in class, back-to-back if there is more than one quiz scheduled. We may begin at any time. We will have an open-ended Q&A session before the quiz. You can't take the quiz if you are not present in class when the question sheets are handed out, so don't be late. Once all scheduled quizzes are written, we will discuss and mark them. You will mark your own quiz. All marking must be done with a red pen - so you must bring a red pen to class in order to participate. The mark you give yourself may be revised by the instructor after spot-checking quizzes. If this is necessary, you will be notified. You must mark your quiz correctly and honestly - don't get into trouble with academic integrity rules: it will be an academic offence if you mark questions as correct that were discussed in class and should have been marked incorrect. When in doubt, ask.


 
R-code option
The graph defined by the scCCnet edgelist contains a pronounced community structure with several large modules of similar size. This has an obvious biological interpretation of systems of functionally interacting genes. But is this an independent feature that has evolved due to functional requirements in these genes? Or is it simply a consequence of the degree distribution? Your task is to assess this question by comparing the community structure of the scCCnet graph to random graphs with the same degree distribution.
Submit code according to the following requirements. Make sure your code is documented and that you have tested your functions to give correct output.
  • Write a function that takes an igraph graph as input and outputs the size difference between the largest and the second largest community, as defined by cluster_infomap().
  • Write a function that takes an igraph graph as input and returns a random graph with the same degree distribution. Use the sample_degseq() function internally, with the "vl" method.
  • Write a script that produces a vector of 1,000 results from your first function, calculated for a graph produced by your second function using the scCCnet graph as input.
  • Plot and interpret the result[2].
  • Create a new page on the student Wiki as a subpage of your User Page. Put your code there.
  • When you are done with developing this contents, add the following category tag to the page:
[[Category:EVAL-FND-Graphs]]
Do not change your submission page after this tag has been added. The page will be marked and the category tag will be removed by the instructor.


 
Option to write a "Self-Evaluation Question"
Write a "Self-evaluation Question" (with a model solution) that explores the actual use of centrality measures in the recent biomedical literature.
  • Find a recent article (not older than 2 years) that uses a centrality score in the analysis of biological networks. Summarize the use of this measure, and then quote a passage from the article that describes how the measure was used. Then ask the learner to paraphrase what this passage means, and how the measure relates to the biology that is being described. Ensure that the answer is feasible, given what was discussed in the unit, and what was summarized and quoted by you. The goal is for the learner to think about the biological interpretation of graph-theoretical measures obtained from biological networks. Apply the marking rubrics in spirit to satisfy yourself of the quality of your question. Use the format and code templates that you find on the Self evaluation questions page - but don't assume those examples are already models of excellent contributions. This will be a short-answer format question. Note: assume that approximately the same amount of work is expected for all evaluation options. Consequently, the standard of excellence for this option will be quite high.
  • Create a new page on the student Wiki as a subpage of your User Page. Develop your question there.
  • When you are done with developing this contents, add the following category tag to the page:
[[Category:EVAL-FND-Graphs]]
Do not change your submission page after this tag has been added. The page will be marked and the category tag will be removed by the instructor.



 


Contents

Task:

Pavlopoulos et al. (2011) Using graph theory to analyze biological networks. BioData Min 4:10. (pmid: 21527005)

PubMed ] [ DOI ] Understanding complex systems often requires a bottom-up analysis towards a systems biology approach. The need to investigate a system, not only as individual components but as a whole, emerges. This can be done by examining the elementary constituents individually and then how these are connected. The myriad components of a system and their interactions are best characterized as networks and they are mainly represented as graphs where thousands of nodes are connected with thousands of vertices. In this article we demonstrate approaches, models and methods from the graph theory universe and we discuss ways in which they can be used to reveal hidden properties and features of a network. This network profiling combined with knowledge extraction will help us to better understand the biological significance of the system.


 

Task:

 
  • Open RStudio and load the ABC-units R project. If you have loaded it before, choose FileRecent projectsABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
  • Choose ToolsVersion ControlPull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
  • Type init() if requested.
  • Open the file FND-MAT-Graphs_and_networks.R and follow the instructions.


 

Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.


 


 



 


Further reading, links and resources

 


Notes

  1. Post to the mailing list in case you come into a situation where you are trying to do something that ought to have worked with a data frame, but does not work in the same way with a tibble.
  2. If you want to interpret this as a p-value, simply calculate the empirical p-value as (number of networks with size difference less or equal to scCCnet + 1) / (Number of random networks + 1).


 


Self-evaluation

 



 




 

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.



 

About ...
 
Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2017-08-05

Version:

1.0

Version history:

  • 1.0 First live version
  • 0.1 Stub only

CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.