BIO Assignment Week 11

Assignment for Week 11
Protein-Protein Interactions

< Assignment 10

Note! This assignment is currently active. All significant changes will be announced on the mailing list.

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.

Introduction

Task:

Carefully read the lecture notes for this unit Week 11: Annotated Notes (PDF 12.2 MB).

For a useful overview of graph-theory concepts you could additionally have a look at:

Pavlopoulos et al. (2011) Using graph theory to analyze biological networks. BioData Min 4:10. (pmid: 21527005)

[ PubMed ] [ DOI ] Understanding complex systems often requires a bottom-up analysis towards a systems biology approach. The need to investigate a system, not only as individual components but as a whole, emerges. This can be done by examining the elementary constituents individually and then how these are connected. The myriad components of a system and their interactions are best characterized as networks and they are mainly represented as graphs where thousands of nodes are connected with thousands of vertices. In this article we demonstrate approaches, models and methods from the graph theory universe and we discuss ways in which they can be used to reveal hidden properties and features of a network. This network profiling combined with knowledge extraction will help us to better understand the biological significance of the system.

However, the concepts you need to know for this assignment should become clear from the notes.

Data Sources

Interaction databases have similar problems as sequence databases: the need for standards for abstracting biological concepts into computable objects, data integrity, search and retrieval, and the metrics of comparison. There is however an added complication: interactions are rarely all-or-none, and the high-throughput experimental methods have large false-positive and false-negative rates. This makes it necessary to define confidence scores for interactions. On top of experimental methods, there are also a variety of methods for computational interaction prediction. However, even though the "gold standard" are careful, small-scale laboratory experiments, different curated efforts on the same experimental publication usually lead to different results - with as little as 42% overlap between databases being reported.

Currently, likely the best integrated protein-protein interaction database is IntAct, at the EBI, which besides curating interactions from the literature hosts interactions from the IMEx consortium, an extensive data-sharing agreement between a number of general and specialized source databases.

Task:

Access IntAct and enter the UniProt ID for yeast Mbp1 P39678.
Click on the "Graph" tab to load a network graph.
Switch "Merge edges" off to show the reported edges for this interaction individually. Which protein pair has the most interactions? Does this make sense?

But then what?

If you are like me, you would now like to be able to link expression profiles, information about known complexes, GO annotations, knock-out phenotypes etc. etc. Too bad.

Working with biological graphs in R

Task:

Open RStudio.
Choose File → Recent Projects → BCH441_2016.
Pull the latest version of the project repository from GitHub.
type init()
Open the file BCH441_A11.R and work through the entire tutorial.

At the end of the tutorial, you are being asked to print R code and data on a sheet of paper and bring this to class. This will be marked by me and worth maximally 4 marks. Be careful to follow the instructions exactly, especially regarding how to use your student number as a randomization seed.

This is all that is required. There is optional material below that you may find interesting.

Optional: Data visualization and analysis

If you work a lot with interaction networks, sooner or later you will come across Cytoscape. It is more or less the standard among "professional" systems biologists. But it is not an online tool.

Task:

Navigate to the Cytoscape homepage and inform yourself what the program does and how to install it. There are many tutorials online available. But this is software that needs to be downloaded, and installed and it definitively has a learning curve.

The state of integrated online interaction viewers these days could be improved. Have a look at this article that discusses the gap between what one would need to do, and what is offered:

Jeanquartier et al. (2015) Integrated web visualizations for protein-protein interaction databases. BMC Bioinformatics 16:195. (pmid: 26077899)

[ PubMed ] [ DOI ] BACKGROUND: Understanding living systems is crucial for curing diseases. To achieve this task we have to understand biological networks based on protein-protein interactions. Bioinformatics has come up with a great amount of databases and tools that support analysts in exploring protein-protein interactions on an integrated level for knowledge discovery. They provide predictions and correlations, indicate possibilities for future experimental research and fill the gaps to complete the picture of biochemical processes. There are numerous and huge databases of protein-protein interactions used to gain insights into answering some of the many questions of systems biology. Many computational resources integrate interaction data with additional information on molecular background. However, the vast number of diverse Bioinformatics resources poses an obstacle to the goal of understanding. We present a survey of databases that enable the visual analysis of protein networks. RESULTS: We selected M=10 out of N=53 resources supporting visualization, and we tested against the following set of criteria: interoperability, data integration, quantity of possible interactions, data visualization quality and data coverage. The study reveals differences in usability, visualization features and quality as well as the quantity of interactions. StringDB is the recommended first choice. CPDB presents a comprehensive dataset and IntAct lets the user change the network layout. A comprehensive comparison table is available via web. The supplementary table can be accessed on http://tinyurl.com/PPI-DB-Comparison-2015. CONCLUSIONS: Only some web resources featuring graph visualization can be successfully applied to interactive visual analysis of protein-protein interaction. Study results underline the necessity for further enhancements of visualization integration in biochemical analysis tools. Identified challenges are data comprehensiveness, confidence, interactive feature and visualization maturing.

The online resource that comes out as the best is the one at the String database.

Task:

Navigate to the String database and search for saccharomyces cerevisiae Mbp1 interactors.
Visualize the network. Add a few proteins by clicking the (+) button a two or three times.
Click on a node to get a synopsis of its function.
Explore the "confidence", "evidence" and "actions" networks for the retrieved interactors.
Not all interacting proteins are also predicted to have a functional relationship with Mbp1. Do you agree?
Explore the clustering and layout options. Do you understand what they do?
Explore the Views on

Neighborhood (not relevant for our query though)
Fusion (also not relevant for our query)
Occurence
Coexpression
Experiments
Database, and
Textmining

Each of these are methods for predicting functional relationships. Figure out how each one contributes to evidence of a functional interaction between Mbp1 and its predicted functional partners. I find the Occurrence view a unique and intriguing tool: visualizing in which organisms groups of genes are either all absent or all present allows to quickly establish functional clusters.

In summary, String is a convincingly well built tool to explore functional relationships between proteins.

Links and resources

Razick et al. (2008) iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics 9:405. (pmid: 18823568)

[ PubMed ] [ DOI ] BACKGROUND: Interaction data for a given protein may be spread across multiple databases. We set out to create a unifying index that would facilitate searching for these data and that would group together redundant interaction data while recording the methods used to perform this grouping. RESULTS: We present a method to generate a key for a protein interaction record and a key for each participant protein. These keys may be generated by anyone using only the primary sequence of the proteins, their taxonomy identifiers and the Secure Hash Algorithm. Two interaction records will have identical keys if they refer to the same set of identical protein sequences and taxonomy identifiers. We define records with identical keys as a redundant group. Our method required that we map protein database references found in interaction records to current protein sequence records. Operations performed during this mapping are described by a mapping score that may provide valuable feedback to source interaction databases on problematic references that are malformed, deprecated, ambiguous or unfound. Keys for protein participants allow for retrieval of interaction information independent of the protein references used in the original records. CONCLUSION: We have applied our method to protein interaction records from BIND, BioGrid, DIP, HPRD, IntAct, MINT, MPact, MPPI and OPHID. The resulting interaction reference index is provided in PSI-MITAB 2.5 format at http://irefindex.uio.no. This index may form the basis of alternative redundant groupings based on gene identifiers or near sequence identity groupings.

Mora & Donaldson (2011) iRefR: an R package to manipulate the iRefIndex consolidated protein interaction database. BMC Bioinformatics 12:455. (pmid: 22115179)

[ PubMed ] [ DOI ] BACKGROUND: The iRefIndex addresses the need to consolidate protein interaction data into a single uniform data resource. iRefR provides the user with access to this data source from an R environment. RESULTS: The iRefR package includes tools for selecting specific subsets of interest from the iRefIndex by criteria such as organism, source database, experimental method, protein accessions and publication identifier. Data may be converted between three representations (MITAB, edgeList and graph) for use with other R packages such as igraph, graph and RBGL.The user may choose between different methods for resolving redundancies in interaction data and how n-ary data is represented. In addition, we describe a function to identify binary interaction records that possibly represent protein complexes. We show that the user choice of data selection, redundancy resolution and n-ary data representation all have an impact on graphical analysis. CONCLUSIONS: The package allows the user to control how these issues are dealt with and communicate them via an R-script written using the iRefR package - this will facilitate communication of methods, reproducibility of network analyses and further modification and comparison of methods by researchers.

Footnotes and references

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.

Do consider how to ask your questions so that a meaningful answer is possible:
- How to create a Minimal, Complete, and Verifiable example on stackoverflow and ...
- How to make a great R reproducible example are required reading.

< Assignment 10

BIO Assignment Week 11

Contents

Introduction

Data Sources

Working with biological graphs in R

Optional: Data visualization and analysis

Links and resources

Footnotes and references

Ask, if things don't work for you!

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools