FND-MAT-Graphs and networks

From "A B C"
Jump to navigation Jump to search

Graphs and Networks

(Intoduction to graph theory and network science; iGraph)


 


Abstract:

"Traditional" bioinformatics has focussed on the properties of individual genes, but we are developing a much more functionally oriented view that investigates gene relationships. Analyzing relationships is the domain of graph theory, where relationships are represented as edges between nodes. This unit introduces key concepts and terminology and puts working with graphs into practice with the igraph:: package in R.


Objectives:
This unit will ...

  • ... introduce concepts and definitions of graph theory;
  • ... demonstrate how to work with graphs in R, using the igraph:: package.

Outcomes:
After working through this unit you ...

  • ... know what the most important terms of graph theory mean;
  • ... are familar with a number of quantitiative measures on graphs;
  • ... can use igraph:: to construct graphs, plot graphs with a variety of layout algorithms, and analyse their degree distribution, centrality distribution, and community structure;
  • ... can interpret such measures in a biological context.

Deliverables:

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

  • Prerequisites:
    You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:

    • Calculus: functions and equations; polynomial functions, logarithms, trigonometric functions; integrals and derivatives; theorem and proof.

    This unit builds on material covered in the following prerequisite units:


     



     



     


    Evaluation

    This learning unit can be evaluated for a maximum of 5 marks. There are several options for submission. Choose one option, then ...

    1. Create a new page on the student Wiki as a subpage of your User Page.
    2. Put all of your writing to submit on this one page.
    3. When you are done with everything, go to the Quercus Assignments page and open the first Learning Unit that you have not submitted yet. Paste the URL of your Wiki page into the form, and click on Submit Assignment.

    Your link can be submitted only once and not edited. But you may change your Wiki page at any time. However only the last version before the due date will be marked. All later edits will be silently ignored.


     
    Context
    The R project for the learning units contains a network that I have prepared from functional interaction data from the STRING database, subset with GO annotations for yeast. STRING curates networks of genes that have functional interactions, based on a variety of data such as experimental evidence, literature mining and more. GO (Gene Ontology) categorizes genes with functional annotations. The network I have prepared contains all yeast genes that have been annotated to the "mitotic cell cycle" and that have high-confidence functional interactions as determined by STRING. You can load the edge list with the command ...
    scCCnet <- readRDS("./data/scCCnet.rds")
    

    ... which will place the "tibble" scCCnet (S. cerevisiae Cell Cycle network) into your workspace. tibbles are modern versions of data frames and can be used just like data frames for most practical purposes[1]. The code I wrote to create this tibble is in ./scripts/ABC-makeScCCnet.R - it may be worthwhile for you to look at the script to get a better understanding what this data is. All options require you to plot the network, I expect that layout_with_graphopt() will work reasonably well to define the layout, but igraph:: has many options for graph layout and its worthwhile to try them. In all cases you may need to play with the parameters to find the proper balance of node distance and node size.

    Short Report option
    1. Create a new page on the student Wiki as a subpage of your User Page.
    2. Write a short report on one of the three following topics - A, B, or C. The goal of this short report is to connect graph measures to the biology of the nodes. (All reports must have the R code you wrote in an appendix.)
    A - Degree distribution
    A.1 Create an informative overview plot of the scCCnet network to highlight the degree distribution. Color and scale the nodes according to their degree. How well you choose the layout algorithm, and the plotting parameters for node color, size, and label-size, will determine part of your mark. "Excellent" submissions will include a legend in the plot. Obviously, the full plotting command will be included on your submitted page.
    A.2 Plot a log-frequency against log-rank plot of the degree distribution. Does this look like a scale-free network? Interpret this result. What are the highest-degree nodes? What are the lowest-degree nodes? Do genes in these categories have anything in common?
    B - Centrality
    B.1 Create an informative overview plot of the scCCnet network to highlight the centrality scores of the nodes. Choose one of: betweenness centrality, closeness centrality, or eigenvector centrality for your analysis. I expect that igraph::layout_with_graphopt() will provide a good starting point, but you may need to adjust the parameters. Color and scale the nodes according to their centrality score. How well you choose the layout algorithm, and the plotting parameters for node color, size, and label-size, will determine part of your mark. "Excellent" submissions will include a legend in the plot. Obviously, the full plotting command is to be included on your submitted page.
    B.2 Order the nodes according to their score. Choose the nodes with the 10 highest centrality scores. Interpret the results. What are the highest-centrality nodes? Do genes with high centrality have anything in common?
    C - Clusters
    C.1 Create an informative overview plot of the network to highlight the community structure of the network. Choose igraph::cluster_infomap() and one of the eight other clustering algorithms of igraph:: for comparison. Color and scale the nodes according to their community membership. How well you choose the layout algorithm, and the plotting parameters for node color, size, and label-size, will determine part of your mark. "Excellent" submissions will include a legend in the plot. Obviously, the full plotting command is to be included on your submitted page.
    C.2 Compare the two methods. Do the clusters overlap? Are they very different? Considering the biology of the nodes, which clustering method was more successful?
    C.3 For the "better" of the two methods, analyse the top 4 communities. Why do these genes have more functional interactions with each other than with other genes? Do the members of these clusters have anything in common?
    Hint: The Saccharomycs Genome Database hosts YeastMine - a Web tool into which you can paste a number of yeast gene IDs, and retrieve summary information including symbol, name, and sequence length. The name is a good indicator of a gene's biological role.
    • When you are done, submit the link to your page via Quercus as described above.


     



     
    R-code option
    The graph defined by the scCCnet edgelist contains a pronounced community structure with several large modules of similar size. This has an obvious biological interpretation of systems of functionally interacting genes. But what mechanism could have produced this graph-feature that has evolved to support functional requirements of the yeast cell cycle? Your task is to assess degree distribution and community structure of the scCCnet graph to random graphs that are produced with different algorithms. igraph:: has several stochastic network construction algorithms which it calls "games". You can find them in the igraph:: manual, searching for "game". Among them are preferential attachment, Watts-Strogatz, forest fire, and many more. Identify which algorithms are able to create a network from first principles (i.e. not according to a given degree distribution) that is similar to scCCnet.


    Submit code according to the following requirements. Make sure your code is documented and that you have tested your functions to give correct output.
    • Write a function that takes an igraph:: graph as input and computes the Kullback-Leibler divergence (cf. FND-STA-Probability_distribution) between a simulated graph and scCCnet. Validate your function with igraph::sample_degseq().
    • For three igraph:: "games", find parameters that minimize the KL-divergence between the produced graph and scCCnet.
    • Plot and interpret the results.
    • Create a new page on the student Wiki as a subpage of your User Page. Put your code there.
    • When you are done, submit the link to your page via Quercus as described above.


     
    Option to write a "Self-Evaluation Question"
    You can submit a "Self-Evaluation Question" for at most one of your assignments.
    Write a "Self-evaluation Question" (with a model solution) that explores the actual use of centrality measures in the recent biomedical literature.
    • Find a recent article (not older than 2 years) that uses a centrality score in the analysis of biological networks. Summarize the use of this measure, and then quote a passage from the article that describes how the measure was used. Then ask the learner to paraphrase what this passage means, and how the measure relates to the biology that is being described. Ensure that the answer is feasible, given what was discussed in the unit, and what was summarized and quoted by you. The goal is for the learner to think about the biological interpretation of graph-theoretical measures obtained from biological networks. Apply the marking rubrics in spirit to satisfy yourself of the quality of your question. Use the format and code templates that you find on the Self evaluation questions page - but don't assume those examples are already models of excellent contributions. This will be a short-answer format question. Note: assume that approximately the same amount of work is expected for all evaluation options. Consequently, the standard of excellence for this option will be very high.
    • Create a new page on the student Wiki as a subpage of your User Page. Develop your question there.
    • When you are done, submit the link to your page via Quercus as described above.


    Contents

    Task:

    • Read the introductory notes on graph theory and network science.
    • For a useful overview of graph-theory and network analysis concepts you could additionally have a look at:
    Koutrouli et al. (2020) A Guide to Conquer the Biological Network Era Using Graph Theory. Front Bioeng Biotechnol 8:34. (pmid: 32083072)

    PubMed ] [ DOI ] Networks are one of the most common ways to represent biological systems as complex sets of binary interactions or relations between different bioentities. In this article, we discuss the basic graph theory concepts and the various graph types, as well as the available data structures for storing and reading graphs. In addition, we describe several network properties and we highlight some of the widely used network topological features. We briefly mention the network patterns, motifs and models, and we further comment on the types of biological and biomedical networks along with their corresponding computer- and human-readable file formats. Finally, we discuss a variety of algorithms and metrics for network analyses regarding graph drawing, clustering, visualization, link prediction, perturbation, and network alignment as well as the current state-of-the-art tools. We expect this review to reach a very broad spectrum of readers varying from experts to beginners while encouraging them to enhance the field further.

    Pavlopoulos et al. (2011) Using graph theory to analyze biological networks. BioData Min 4:10. (pmid: 21527005)

    PubMed ] [ DOI ] Understanding complex systems often requires a bottom-up analysis towards a systems biology approach. The need to investigate a system, not only as individual components but as a whole, emerges. This can be done by examining the elementary constituents individually and then how these are connected. The myriad components of a system and their interactions are best characterized as networks and they are mainly represented as graphs where thousands of nodes are connected with thousands of vertices. In this article we demonstrate approaches, models and methods from the graph theory universe and we discuss ways in which they can be used to reveal hidden properties and features of a network. This network profiling combined with knowledge extraction will help us to better understand the biological significance of the system.


     

    Task:

     
    • Open RStudio and load the ABC-units R project. If you have loaded it before, choose FileRecent projectsABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
    • Choose ToolsVersion ControlPull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
    • Type init() if requested.
    • Open the file FND-MAT-Graphs_and_networks.R and follow the instructions.


     

    Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.


     


     


    Notes

    1. Post to the discussion board in case you come into a situation where you are trying to do something that ought to have worked with a data frame, but does not work in the same way with a tibble. There are some differences ...


     


    About ...
     
    Author:

    Boris Steipe <boris.steipe@utoronto.ca>

    Created:

    2017-08-05

    Modified:

    2020-10-07

    Version:

    1.2

    Version history:

    • 1.2 Edit policy update
    • 1.1 2020 Updates and revisions to deliverables
    • 1.0 First live version
    • 0.1 Stub only

    CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.