Miscellaneous Databases for Bioinformatics

Contents
SGD - a Yeast Model Organism Database
STRING - functional interactions
Questions, comments
References

Expected Preparations:

	[BIN] Databases
	The units listed above are part of this course and contain important preparatory material.

Keywords: SGD; STRING; …

Objectives:

This unit will …

… introduce various database offerings and explore their use.

Outcomes:

After working through this unit you …

… can navigate and use the databases that are discussed here.

Deliverables:

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.

Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these.

Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Evaluation:

NA: This unit is not evaluated for course marks.

This unit collects short explorations of various databases. It is probably best not to work through the units all in one go, but to go through them in context of an actual use case, when you need information from one of them. Currently we have SGD and STRING.

SGD - a Yeast Model Organism Database

Yeast happens to have a very well maintained model organism database - a Web resource dedicated to Saccharomyces cerevisiae. Where such dedicated resources are available, they are very useful for the community. For the general case however, we need to work with one of the large, general data providers - the NCBI and the EBI. But in order to get a sense of the type of data that is available, let’s explore the SGD database.

Task…

Access the information page on Mbp1 at the Saccharomyces Genome Database.

Browse through the Summary page and note the available information: you should see:
information about the gene and the protein;
Information about it’s roles in the cell curated at the Gene Ontology database;
Information about knock-out phenotypes; (Amazing. Would you have imagined that this is a non-essential gene?)
Information about protein-protein interactions;
Regulation and expression;
A curators’ summary of our understanding of the protein. Mandatory reading.
And key references.
Access the Protein tab and note the much more detailed information.
Domains and their classification;
Sequence;
Shared domains;
and much more…

You will notice that some of this information relates to the molecule itself, and some of it relates to its relationship with other molecules. Some of it is stored at SGD, and some of it is cross-referenced from other databases. And we have textual data, numeric data, and images.

If we would be working on yeast, most data we need is right here: curated, kept current and consistent, referenced to the literature and ready to use. But if you are working on a different species - some “MYSPE”- you need to integrate data yourself, from data sources such as the NCBI, or UniProt. The upside is that most of the information like this is available for many, many species. The downside is that you have to integrate information from many different sources essentially “by hand”.

Task…

Navigate to the Analyze ▸ Gene Lists page. Paste the following identifiers. (This could be the result of some functional screen, or a set of differentially expressed genes, or other gene list returned from an assay or bioinformatics procedure…):

YAR014C YBR040W YBR200W YCL027W YCR089W YDL223C YDR085C YDR141C YER125W YER133W YER149C YHR102W YHR135C YHR158C YIL129C YKL048C YKL189W YKR031C YLL021W YLR229C YLR313C YLR332W YMR232W YNL154C YNR032C-A YOL111C YOR326W YPL123C YPR194C

What do these genes have in common? Can you identify a common theme of function?
Navigate to the Analyze ▸ GO Slim mapper page. Paste your gene list for “Step 1”. Choose “Yeast GO-Slim: Function” as the ontology of terms to search in for these genes for “Step 2”. Choose “SELECT ALL Terms…” for “Step 3”. Consider the results. Most of the genes have unknown functions, there is no clear theme for those with a known function.
Now repeat the procedure for “GO-Slim:Process”. All genes are annotated to “cell morphogenesis” (which is not surprising, because that is how I selected them.). And there are interesting and informative overlaps with other functional categories.
Click on “Download results”. What do you get? How would you read this data into R?

STRING - functional interactions

<div class="col1">
  <!-- Column 1 start -->

The essence of our “new” view of molcular biology is the study of interactions: after characterizing biomolecules individually, we are assembling networks of relationships through protein-protein and other interaction experiments. But visualizing the results is not trivial since we need to display genes as networks, define attributes of the nodes and edges and encode them in our visualization, develop quantitative measures that help us mine the data for information, and map the results back into the network to evaluate the influence of the network topology (gene “neigborhoods”) on our findings. Databases strive to build integrated viewers for this kind of data. However there is much that still needs to be done. Have a look at this article that discusses the gap between what one would need to do, and what is offered:

Jeanquartier, Fleur, Claire Jean-Quartier, and Andreas Holzinger. (2015). “Integrated web visualizations for protein-protein interaction databases”. Bmc Bioinformatics 16:195 .
[PMID: 26077899] [DOI: 10.1186/s12859-015-0615-z]

Abstract …

BACKGROUND: Understanding living systems is crucial for curing diseases. To achieve this task we have to understand biological networks based on protein-protein interactions. Bioinformatics has come up with a great amount of databases and tools that support analysts in exploring protein-protein interactions on an integrated level for knowledge discovery. They provide predictions and correlations, indicate possibilities for future experimental research and fill the gaps to complete the picture of biochemical processes. There are numerous and huge databases of protein-protein interactions used to gain insights into answering some of the many questions of systems biology. Many computational resources integrate interaction data with additional information on molecular background. However, the vast number of diverse Bioinformatics resources poses an obstacle to the goal of understanding. We present a survey of databases that enable the visual analysis of protein networks.

The online resource that comes out as the best is the one at the STRING database.

Task…

Review:

Szklarczyk, Damian et al.. (2019). “STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets”. Nucleic Acids Research 47(D1):D607–D613 .
[PMID: 30476243] [DOI: 10.1093/nar/gky1131]

Abstract …

Proteins and their functional interactions form the backbone of the cellular machinery. Their connectivity network needs to be considered for the full understanding of biological phenomena, but the available information on protein-protein associations is incomplete and exhibits varying levels of annotation granularity and reliability. The STRING database aims to collect, score and integrate all publicly available sources of protein-protein interaction information, and to complement these with computational predictions. Its goal is to achieve a comprehensive and objective global network, including direct (physical) as well as indirect (functional) interactions. The latest version of STRING (11.0) more than doubles the number of organisms it covers, to 5090. The most important new feature is an option to upload entire, genome-wide datasets as input, allowing users to visualize subsets as interaction networks and to perform gene-set enrichment analysis on the entire input. For the enrichment analysis, STRING implements well-known classification systems such as Gene Ontology and KEGG, but also offers additional, new classification systems based on high-throughput text-mining as well as on a hierarchical clustering of the association network itself. The STRING resource is available online at https://string-db.org/.

Navigate to the STRING database and search for saccharomyces cerevisiae Mbp1 interactors.
Visualize the network. Add a few proteins by clicking the (+) button two or three times.
Click on a node to get a synopsis of its function.
Explore the “confidence”, “evidence” and “actions” networks for the retrieved interactors.
Not all interacting proteins are also predicted to have a functional relationship with Mbp1. Do you agree?
Explore the clustering and layout options. Do you understand what they do?
Explore the Views on

*Neighborhood (gene-neighborhood is basically only relevant for prokaryotic operons though)

*Fusion (gene-fusion can identify proteins that stably interact in the cell)

*Occurence

*Coexpression

*Experiments

*Database, and

*Textmining Each of these are methods for predicting functional relationships. Figure out how each one contributes to evidence of a functional interaction between Mbp1 and its predicted functional partners. I find the Occurrence view a unique and intriguing tool: visualizing in which organisms groups of genes are either all absent or all present allows to quickly establish functional clusters.
Explore the “Download” options. Some of this data will be used in other learning units.

In summary, STRING is a convincingly well built tool to explore functional relationships between proteins.

Questions, comments

If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.

Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.

References

About this page …

[END]

Miscellaneous Databases for Bioinformatics

Boris Steipe

Contents

SGD - a Yeast Model Organism Database

STRING - functional interactions

Questions, comments

References