CSB Web tools
CSB on the Web
Important tools and resources for CSB, available on the Web.
Contents
Introductory reading
Kowald & Wierling (2011) Standards, tools, and databases for the analysis of yeast 'omics data. Methods Mol Biol 759:345-65. (pmid: 21863497) |
[ PubMed ] [ DOI ] One of the major objectives of systems biology is the development of mathematical models for the quantitative description of complex biological systems, such as living cells. Biological data and software tools for the design, analysis, and simulation of models are two basic ingredients for the new field of systems biology. In this chapter we give an overview of databases and repositories that provide valuable information for the integrative analysis and modeling of data generated by the different omics techniques. We also provide a review of the most popular software tools currently used in computational systems biology studies. Standards for the annotation of biological data and for the analysis and exchange of models are fundamental for the success of systems biology and provide the glue that connects experimental data with mathematical models. We also discuss some broad trends regarding where systems biology is heading to. |
Contents
Databases
Wheeler (2007) Using GenBank. Methods Mol Biol 406:23-59. (pmid: 18287687) |
[ PubMed ] [ DOI ] GenBank(R) is a comprehensive database of publicly available DNA sequences for more than 205,000 named organisms and for more than 60,000 within the embryophyta, obtained through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Daily data exchange with the European Molecular Biology Laboratory (EMBL) in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the National Center for Biotechnology Information (NCBI) retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases with taxonomy, genome, mapping, protein structure, and domain information and the biomedical journal literature through PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available through FTP. GenBank usage scenarios ranging from local analyses of the data available through FTP to online analyses supported by the NCBI Web-based tools are discussed. To access GenBank and its related retrieval and analysis services, go to the NCBI Homepage at http://www.ncbi.nlm.nih.gov. |
Boutet et al. (2007) UniProtKB/Swiss-Prot. Methods Mol Biol 406:89-112. (pmid: 18287689) |
[ PubMed ] [ DOI ] The Swiss Institute of Bioinformatics (SIB), the European Bioinformatics Institute (EBI), and the Protein Information Resource (PIR) form the Universal Protein Resource (UniProt) consortium. Its main goal is to provide the scientific community with a central resource for protein sequences and functional information. The UniProt consortium maintains the UniProt KnowledgeBase (UniProtKB) and several supplementary databases including the UniProt Reference Clusters (UniRef) and the UniProt Archive (UniParc). (1) UniProtKB is a comprehensive protein sequence knowledgebase that consists of two sections: UniProtKB/Swiss-Prot, which contains manually annotated entries, and UniProtKB/TrEMBL, which contains computer-annotated entries. UniProtKB/Swiss-Prot entries contain information curated by biologists and provide users with cross-links to about 100 external databases and with access to additional information or tools. (2) The UniRef databases (UniRef100, UniRef90, and UniRef50) define clusters of protein sequences that share 100, 90, or 50% identity. (3) The UniParc database stores and maps all publicly available protein sequence data, including obsolete data excluded from UniProtKB. The UniProt databases can be accessed online (http://www.uniprot.org/) or downloaded in several formats (ftp://ftp.uniprot.org/pub). New releases are published every 2 weeks. The purpose of this chapter is to present a guided tour of a UniProtKB/Swiss-Prot entry, paying particular attention to the specificities of plant protein annotation. We will also present some of the tools and databases that are linked to each entry. |
Web servers
Bhagwat & Aravind (2007) PSI-BLAST tutorial. Methods Mol Biol 395:177-86. (pmid: 17993673) |
[ PubMed ] [ DOI ] PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool) derives a position-specific scoring matrix (PSSM) or profile from the multiple sequence alignment of sequences detected above a given score threshold using protein-protein BLAST. This PSSM is used to further search the database for new matches, and is updated for subsequent iterations with these newly detected sequences. Thus, PSI-BLAST provides a means of detecting distant relationships between proteins. In this chapter, we discuss practical aspects of using PSI-BLAST and provide a tutorial on how to uncover distant relationships between proteins and use them to reach biologically meaningful conclusions. |
Exercises
References
Further reading and resources
Links directory (bioinformatics.ca) [ link ] [ page ] bioinformatics.ca is the domain of the Canadian Bioinformatics Workshops, currently hosted by the Ontario Institute of Cancer research. The links directory is a curated collection of databases and services that are useful for bioinformatics and computational biology. Links are browsable in several categories, such as Model Organisms, Expression or Sequence Comparison with many subcategories. Importantly, the site contains links to all resources from the NAR database issues and the NAR web server issues in a searchable interface. The URL links to a search for the term "Systems Biology". |
Bolser et al. (2012) MetaBase--the wiki-database of biological databases. Nucleic Acids Res 40:D1250-4. (pmid: 22139927) |
[ PubMed ] [ DOI ] Biology is generating more data than ever. As a result, there is an ever increasing number of publicly available databases that analyse, integrate and summarize the available data, providing an invaluable resource for the biological community. As this trend continues, there is a pressing need to organize, catalogue and rate these resources, so that the information they contain can be most effectively exploited. MetaBase (MB) (http://MetaDatabase.Org) is a community-curated database containing more than 2000 commonly used biological databases. Each entry is structured using templates and can carry various user comments and annotations. Entries can be searched, listed, browsed or queried. The database was created using the same MediaWiki technology that powers Wikipedia, allowing users to contribute on many different levels. The initial release of MB was derived from the content of the 2007 Nucleic Acids Research (NAR) Database Issue. Since then, approximately 100 databases have been manually collected from the literature, and users have added information for over 240 databases. MB is synchronized annually with the static Molecular Biology Database Collection provided by NAR. To date, there have been 19 significant contributors to the project; each one is listed as an author here to highlight the community aspect of the project. |
Dreszer et al. (2012) The UCSC Genome Browser database: extensions and updates 2011. Nucleic Acids Res 40:D918-23. (pmid: 22086951) |
[ PubMed ] [ DOI ] The University of California Santa Cruz Genome Browser (http://genome.ucsc.edu) offers online public access to a growing database of genomic sequence and annotations for a wide variety of organisms. The Browser is an integrated tool set for visualizing, comparing, analyzing and sharing both publicly available and user-generated genomic data sets. In the past year, the local database has been updated with four new species assemblies, and we anticipate another four will be released by the end of 2011. Further, a large number of annotation tracks have been either added, updated by contributors, or remapped to the latest human reference genome. Among these are new phenotype and disease annotations, UCSC genes, and a major dbSNP update, which required new visualization methods. Growing beyond the local database, this year we have introduced 'track data hubs', which allow the Genome Browser to provide access to remotely located sets of annotations. This feature is designed to significantly extend the number and variety of annotation tracks that are publicly available for visualization and analysis from within our site. We have also introduced several usability features including track search and a context-sensitive menu of options available with a right-click anywhere on the Browser's image. |
Maddatu et al. (2012) Mouse Phenome Database (MPD). Nucleic Acids Res 40:D887-94. (pmid: 22102583) |
[ PubMed ] [ DOI ] The Mouse Phenome Project was launched a decade ago to complement mouse genome sequencing efforts by promoting new phenotyping initiatives under standardized conditions and collecting the data in a central public database, the Mouse Phenome Database (MPD; http://phenome.jax.org). MPD houses a wealth of strain characteristics data to facilitate the use of the laboratory mouse in translational research for human health and disease, helping alleviate problems involving experimentation in humans that cannot be done practically or ethically. Data sets are voluntarily contributed by researchers from a variety of institutions and settings, or in some cases, retrieved by MPD staff from public sources. MPD maintains a growing collection of standardized reference data that assists investigators in selecting mouse strains for research applications; houses treatment/control data for drug studies and other interventions; offers a standardized platform for discovering genotype-phenotype relationships; and provides tools for hypothesis testing. MPD improvements and updates since our last NAR report are presented, including the addition of new tools and features to facilitate navigation and data mining as well as the acquisition of new data (phenotypic, genotypic and gene expression). |
NAR database issue [ link ] [ page ] Every year the journal Nucleic Acids Research (NAR) compiles a special issue on important databases in molecular biology (in January), and on important webservers and other resources (in July). The articles are peer-reviewed, and inclusion into the issue is considered a quality endorsement. Both volumes reflect the best practices in the field, as well as its rapidly changing nature. Links to databases and resources are searchable by keyword and topic in the bioinformatics.ca links directory. |
NAR Web Server issue [ link ] [ page ] Every year the journal Nucleic Acids Research (NAR) compiles a special issue on important webservers in molecular biology (in July), and on important databases (in January). The articles are peer-reviewed, and inclusion into the issue is considered a quality endorsement. Both volumes reflect the best practices in the field, as well as its rapidly changing nature. Links to databases and resources are searchable by keyword and topic in the bioinformatics.ca links directory. |
The NCBI Gene database [ link ] [ page ] Gene is the NCBI's integrated database of gene information in the Entrez system. Records may include Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, compiled into the database itself, and links to genome-, phenotype-, and locus-specific resources worldwide. The URL links to the record for the human E2F1 transcription factor. For detailed information, see the Gene database information page. |
UniProt [ link ] [ page ] UniProt is the protein sequence database of the European Bioinformatics Institute. It is an extraordinarily well constructed, curated, and integrated resource. As a public resource, its results are freely accessible world-wide. The "Knowledge Base" (UniProtKB), which is the database proper, contains two subsections: SwissProt, the manually curated and heavily annotated protein sequence repository; it is approximately equivalent to the NCBI Refseq protein database, albeit with usually higher annotation levels. TrEMBL is much larger and contains sequences that have been computationally translated from the EMBL nucleotide sequence collection. It is approximately equivalent to the NCBI's Entrez protein database. The URL links to the entry for the Saccharomyces cerevisiae cell-cycle regulation transcription factor Mbp1. |
SGD: Saccharomyces Genome Database [ link ] [ page ] The Saccharomyces genome database is a curated database that integrates sequence, structure and function information for yeast molecular biology. It is one of the important model organism databases and can be considered a paradigm for the entire field. The url links to the information page of the cell-cycle regulation transcription factor Mbp1. |
MGI (Mouse Genome Informatics) [ link ] [ page ] The model organism database MGI (Mouse Genome Informatics) is the primary community database resource for the laboratory mouse. It integrates genomics, expression, tumor biology and metabolism information and actively curates GO annotations for mouse genes. The stated goal is to enhance the utility of mouse research for the study of human health and disease. For example, wherever available, human orthologues are cross-referenced with the respective mouse genes. The URL links to the gene details of the mouse orthologue of human E2F1. |
GO: the Gene Ontology project [ link ] [ page ] Ontologies are important tools to organize and compute with non-standardized information, such as gene annotations. The Gene Ontology project (GO) constructs ontologies for gene and gene product attributes across numerous species. Three major ontologies are being developed: molecular process, biological function and cellular location. Each includes terms, their definition, and their relationships. In addition, genes and gene products are being been annotated with their GO terms and the type of evidence that underlies the annotation. A number of tools such as the AmiGO browser are available to analyse relationships, construct ontologies and curate annotations. Data can be freely downloaded in formats that are convenient for computation. |
The Gene Wiki project [ link ] [ page ] The Gene Wiki project aims to create Wikipedia articles for every human gene whose function has been assigned. This provides pages that are ideally suited for free, community-driven, integrated information resources. Access to the project is through the Gene Wiki Portal, which contains guidelines for contributors. The pages are easy to find since they are linked to the HGNC recognized gene name. For example, the URL links to the human E2F1 transcription factor page. |
Gene/Protein Synonym Database [ link ] [ page ] The ExPASy hosted Gene/Protein Synonym Database collects gene name synonyms from the majority of model organism databases and UniProt, cross-references them and provides a searchable interface. |
HUGO Gene Nomenclature Committee [ link ] [ page ] The HUGO Gene Nomenclature Committee (HGNC) has assigned unique gene symbols and names to more than 32,000 human loci, of which over 19,000 are protein coding. genenames.org is a curated online repository of HGNC-approved gene nomenclature and associated resources including links to genomic, proteomic and phenotypic information, as well as dedicated gene family pages. This site is the definitive resource to resolve gene name ambiguities. The URL links to the search results for Rbp3, which is both a deprecated synonym for the human E2F transcription factor 1, and the official name of retinol binding protein 3. |
Reactome [ link ] [ page ] Reactome is a multi-site collaboration to develop an open source, curated bioinformatics database of human pathways and reactions. It includes annotations, pathways and tools for pathway browsing and analysis, including pathway assignment and overrepresentation analysis of user-supplied data sets. Making use of orthology prediction, Reactome also provides cross-species pathway inference for a large number of model organisms. The URL accesses the E2F mediated regulation of DNA replication. |
GMOD Generic Model Organism Database project [ link ] [ page ] GMOD (the Generic Model Organism Database project), is a collection of open source software tools for creating and managing genome-scale biological databases. GMOD tools are in use at many large and small community databases, especially for Model Organisms. The include the genome browser GBrowse, the CHADO relational database, the GFF annotation databases, and much more The goal is to free developers of community scale biomolecualr databases from reinventing the wheel. A good overview of resources and principles is available on the GMOD wiki. |