APB Functional annotation

From "A B C"
Jump to navigation Jump to search

Computing Functional Annotations


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


Summary ...



 

"Function" and "Annotation"

Functional annotation implies the combination of three fundamental components:

  1. we need to define function as a computable abstraction;
  2. we need to define the object we wish to annotate (we shall refer to this as a sequence for now);
  3. we need to associate a function with an object, or a part thereof (we shall refer to this association as "annotation").

Each of these components brings with it a number of non-trivial considerations.

"Function"

Function can be defined in multiple ways. A function is a description, an abstraction of observed behaviour that we apply to a molecule or molecular system. It is a conceptual shortcut to say a molecule has a function; sometimes this shortcut is useful and sometimes it will lead us to inappropriate conclusions.

Property
Sometimes the function of a protein derives from nothing more than a particular property that is due to the physical existence of a protein, such as shape (cellular scaffolding) or rigid filling of space with matter (eye-lens proteins) or oncotic pressure. Abstraction: description.
Physical or other activity
Selective channel proteins, fluorescent proteins, pigments. Abstraction: description.
(Bio)chemical activity
Enzymes. Abstraction: Graph of chemical entities (Substrate(s), Product(s)) connected by enzyme.
Mechanism
Structural mechanistic description of function. Abstraction: residue level description of beginning and end-states.
Position in pathway or network
Signalling or biochemical pathways, or regulatory networks. Abstraction: Graph of chemical entities (Substrate(s), Product(s)) connected by enzyme; or: dual of this graph: enzymes connected by chemical entities (product of one, substrate of another). Position can be further qualified according to role for flow (limiting step, committed step, branch-point, feedback, feed-forward ...)
Physiological role
High-order, conceptual annotation of "purpose". Abstraction: descriptive term, but terms themselves are taken from a graph of concepts since they have a relationship to each other.

Sequence

Sequence is an abstraction of biomolecules. For the purpose of annotation the biomolecule to be annotated needs to be defined, and its relevant features need to be represented by data that can be stored on a computer. We commonly use sequences of characters (strings) for this purpose, but this is neither a rich nor a flexible abstraction. Issues to consider include:

The Biomolecule
what biomolecule is being annotated, what are its database accession numbers and identifiers, what are its crossreferences, names, synonyms ...
The Representation
How is the biomolecule represented (translated sequence?), what other features that define the molecule need to be represented: gene model, other gene elements (promoter, enhancer, silencer, TFBs, ...), chromosomal coordinates (organism, chromosome ... scaffolds, contigs, BACs, reads); are variants known (SNPs, mutations, alternative transcripts ...), non-linear features (disulfides), post-translational modifications; what reference numbering should be used ...
What parts of the representation does an annotation refer to
Annotations can refer to all or part of a biomolecular representation at different levels of granularity and this information must be maintained. These levels include
  • residues,
  • ranges of residues,
  • domains,
  • molecules, and
  • complexes.

"Annotation"

In principle the process of functional annotation is the association of a function (or: functionally important feature) with a part of the representation of a biomolecule.

Types of annotation
  • Scalar value (pK, solvent accessible area ...)
  • Free text
  • Reference to other data (dictionary of controlled vocabulary, internal, external database)

Direct annotation

Literature annotation
Retrieval of expert opinions from text
Expert opinion
Retrieval of expert opinions from data resource
Database cross-references
retrieval of expert annotations (e.g. GO)
Perturbation analysis
Knockout or silencing studies, attempt to associate a specified genotype with an observed phenotype
Cosegregation analysis
Association studies analyse segregation of phenotypes and cosegregation of genetic markers to identify the genetic basis for a phenotype.

Inference from properties

Sequence property
Molecular weight and IEP, localisation, amino acid content/distribution
Sequence signals
posttranslational modification, signal peptides,

Transfer of annotation from homology

Sequence similarity
Inference of homology, transfer of information from annotation of homologues
Structural similarity
Inference of homology, transfer of information from annotation of homologues, rarely conclusions from first principles (presence of metal- or cofactor binding sites, electron wires ...)

Transfer of annotation from context

Coexpression
"Guilt by association" - assumption that coexpression points to roles in same or similar function
Gene order / Neighbourhood
Conservation of gene order / neighbourhood between species suggests function in common system
Coregulation
assumption that the best explanation for coregulation is shared function and thus common expression requirements
Physical interaction
Physical interactions (transient, or in stable complex) points to shared function or role in the same pathway
Genetic interaction
Genetic interactions (or synthetic lethality) points to complementary function, i.e. role in parallel pathway
Coeevolution
If proteins interact functionally, they are expected to be either both absent or both present in an organism (phylogenetic profile)
[More ?]
...

The process of annotation

Create a mapping between a representation of function and a representation of a biomolecule; store source(s) and confidence of the annotation,.

Storage and retrieval of annotations

Database of biomolecules, functions, mappings

Display of annotations

The ideal display of annotation would be very concise. However, because there is so much potential information and so many different types of annotation that need to be addressed, it is difficult to bring everything together into a single ‘unit’. As such, a hierarchical system of data arrangement could be effective. Users could access a Master Display, providing a very brief and general overview of the gene/protein of interest (perhaps graphical). Such a display would ideally feature some form of general Master Annotation (e.g. similar to a FunCat classification, but based upon all of the data obtained by the program). The page would also feature a series of links to different forms of annotation (e.g. involvement in a particular pathway, domain arrangement with predicted function, expression profiles etc.). Each of these links would provide a concise summary, in addition to the information used to generate said annotation, and the option to access more in-depth levels of information. The key to this system is that it provides functional annotation in simple, well-defined groups, which the user can access based on their own interest. Jamie 16:30, 30 January 2006 (EST)

For those parts of the annotation where it applies, genome-browser style displays show a lot of information in a concise format. They usually make it easy to show, hide or reorder the tracks in which the user is interested, making a useful customizable display. They also highlight how the different annotations fit together, whether they annotate the same part of the protein, and so on. The DAS paper pointed out that errors and contradictions in the annotations often become obvious in this format, as well. Examples: DAS protein annotation viewer, gbrowse, gbrowse tutorial Joy 21:58, 30 January 2006 (EST)


The display of annotations should be in a hierarchical arrangement with each level being linked to literature cites or ontology sites. There should be only a defined set of terms of which the scienctific community has already agreed upon to compose the hierarachial arragement. The annotation should be in plain text formate and a graphical gui formate. Rachel 08:39, 31 January 2006 (EST)


  • Interface with existing tools ?

Molecule-centric

Map annotations on a representation of the molecule ...

Function centric

Describe function, explain which molecules collaborate how towards this function

Analysis of annotations

When are we satisfied with the result?

In the context of my research,which is using a systems biology approach to idenitify and annotate abiotic stress assoicated genes in Arabidposis. I would be satisfied with my results if the unknown open reading frame could be catagorized into a defined set of terms, such as a gene ontology terms. So that the unknown gene's expression profile in a microarray or gene sequence motifs would correspond with the defined set of terms in a cross-referenced database, i.e. Gene ontology or Sequence ontology. Theses results could then be used to link together large datasets of expression profiles with well defined annotations. Rachel 09:06, 30 January 2006 (EST)

Since my thesis project is directly concerned with the characterization of members of the MoxR AAA+ family, a group of ATPases whose function of is only poorly understood, a tool for the functional annotation of unknown proteins would be invaluable. It would allow automatic compilation of data about MoxR family members, as well as on known/putative interaction partners, which would help in elucidating the role and mechanism of action of these ATPases. The tool would be most useful to my project if it provided me with enough data to help in the design of further experimentation (e.g. to help verify suggested roles, mechanisms, involvement in pathways etc). It's difficult to say exactly which information would be helpful, thus a range of diverse information would be best -- so long as it was presented concisely. At this point I would be happy with the annotation provided by the program. Results from this experimental evidence could then be fed back into the system to improve the quality of the annotation Jamie 16:08, 30 January 2006 (EST)

I don't really have a "thesis-perspective" on when a functional annotation is useful, since I've never been involved in trying to understand the function of a particular gene or protein! Joy 22:04, 30 January 2006 (EST)

I would rather ask two different questions. Firstly, what exactly is our result? What am I interested in? Localization of my proteins, a pathway that they might act in or a biochemical function? For this question, the Gene Ontology project might be very useful. It provides the vocabulary for molecular functions, biological processes and cellular components (localization). Even if our annotation tool can find an annotation for only one of these three categories, it might still be useful to design experiments to find out about the other categories, for example. Secondly, how much confidence do we have in the result? We should develop a scoring scheme that assigns a quality score to each component of our result (function, process, localization, others?) based on the evidence gathered from all our different Sources of annotations. If the category we are interested in yields a very high score, we can be satisfied with our result. If it does not, we have to think about experiments to produce more evidence to increase the score ;-) Michaela

Prior Approaches

DAS

Overview

The Distributed Annotation System "allows sequence annotations to be decentralized among multiple third-party annotators and integrated on an as-needed basis by client-side software." [1]. The client-side software typically provides a simple graphical view, similar to a genome browser, with independent annotation tracks (called "layers" in DAS).

DAS has three components:

  1. Reference Server: Contains the reference sequence.
  2. Annotation Server: Provides annotations relative to the reference sequence.
  3. Client: Fetches the data from multiple servers and provides an integrated view.

All results from DAS queries are in XML format, with the exception "link" queries that return HTML pages. The format of annotation files is similar to the General Feature Format.

Sequence Numbering

Sequences are numbered relative to the reference sequence, stored on the reference server. The reference sequence contains a series of entry points, each of which has a length. For example, for a DNA sequence, the entry points might represent the beginnings of chromosomes or contigs. All annotations are then relative to these entry points. Each entry point can have substructure, which would be a series of subsequences together with their start and end points. This structure is recursive, that is, each subsequence can have further subsequences. Contradictions are not removed.

The Annotated Object

The annotated object is a collection of the following data:

  • entry_points: the list of entry points and their sizes for a data source
  • dna: the DNA associated with a subsequence
  • type: a biologically significant description, eg "tRNA", "snoRNA" and "miscRNA" an an RNA track
  • features: the annotations for the segment, an a GFF-like format
  • link: an HTML page with information about the annotation
  • stylesheet: suggested display mode for the annotation

example: a request might be of the form: features?ZK154:1000,2000

Example: Protein Annotation

The original DAS specifications were intended for genome annotation, but have since been extended to protein annotations. The Center for Biological Sequence Analysis's Protein Annotation Viewer makes use of the DAS specifications to integrate a number of their protein annotation tools. The annotations include several "highly cited methods", including SignalP and NetPhos. A simple graphical client, written in Perl, is provided to visualize the results. They note that if other annotations were available using DAS, they could easily add them into the viewer.

Advantages of DAS

  • annotators maintain control over their annotation: they can use their own database structure, update their annotations whenever they wish, etc.
  • annotations can disagree without breaking the system; the user can easily identify disagreements and deal with them individually
  • provides a common standard for multiple groups to integrate their data
  • makes it easy to add your own annotation to popular viewers

Relevance of DAS to us

DAS is, potentially, very useful as a common standard for providing annotations. I don't think that the original intentions of DAS are so relevant to us, since we aren't curating our own set of annotations relative to some reference database. On the other hand, if we wanted to have each agent do a single type of analysis and then output its results in DAS format, this could potentially be useful as a common format to collect the various kinds of analysis into a single viewer.

Useful Links

BioDas: DAS specifications and open source software. Implementations in Perl and Java.

GO

  • Gene Ontology (GO)

Summary: The gene ontology consortium uses a strict set of strutured vocabularies for annotating genes, gene products and sequence features. The vocabulary used in ontology uses pharases and terminology familar to biologists. The most often application of the terminology is applied to microarray, in which correlations between the functional annotation and the expression patters are able to give insight into underlying biological phenomena. Other uses for biologist are :

  1. gene function prediction
  2. construction and analysis of cellular pathways
  3. association of genes to genetically inherited diseases

The computer science community uses the GO to test applying description logic to creating a complete and logically consisten ontologies.

  • Additional Ontologies for Biology:
  1. Open Biomedical Ontologies [2] its aim is to extend GO development into other biological domains, (i.e, anatomy, development, proteomic information). This database also includes relationships types ontology and the sequence ontology [3]
  2. Sequence Ontology: its purpose is to provide strutured terms and relationships for describing the features and attributes of biological sequences, i.e. DNA, RNA and proteins

Rachel 11:18, 31 January 2006 (EST)

Integrated Systems

  • AutoFACT

A fully automated tool for the functional annotation of unknown gene/protein sequences. It is a command line driven program developed in PERL which runs on LINUX/UNIX systems. A web-based version of the program is also available, but can only handle a very limited number of sequences at one time. The program is based on a sequence similarity approach, wherein query sequences (either protein or nucleic acid) are compared, using BLAST analysis, to a number of different databases, as selected by the user, and undergo a hierarchical filtering process. Significant matches are detected based upon user-defined cutoff scores and the descriptive lines of these matches are then screened for the presence of ‘informative’ and ‘uninformative’ keywords which describe function. Sequences which match only ‘uninformative’ sequences, or which do not have any significant hits in the databases, are compared to the SMART and PFAM databases to look for recognizable domains/motifs. Nucleotide sequences are also compared to an EST database. The results of these analyses are used to assign the sequence to one of six ‘annotation classes’. Further assignment (e.g. to KEGG Pathways, COG functional groups etc.) is assigned to ‘informative’ sequences. The ‘informative’ terms from the descriptive lines are also used to functionally annotate these sequences. Data is output in a viewing friendly HTML format, as well as in two alternative formats useful for further data manipulation. The program also keeps a log so that all decision making steps in the annotation process are recorded.

Although a powerful program, its assignment of function is based solely on a sequence similarity approach, and is also heavily dependent upon the descriptive annotations currently available in the databases. A functional annotation program that considers additional forms of data in its analysis would be a much more powerful tool. However, AutoFact may be useful as a component of our annotation package, possibly providing information and variables that our program can consider during its functional assignment process. Jamie 16:36, 30 January 2006 (EST)


  • FunCat

A hierarchical classification system based upon the specific biochemical pathways / systems in which proteins are involved. It represents a form of structured, ‘functional vocabulary’. Proteins are assigned to a main functional category, and then to additional subcategories. Proteins may be assigned to multiple subgroups, in order to fully describe their function. Although the system is not generally species specific, certain subcategories exist for cases in which certain functions are confined to particular groups of organisms. The FunCat system has been used to assign proteins manually or automatically (e.g. by sequence similarity). A major benefit of the FunCat system is that it is designed to cover both Prokaryotes and Eukaryotes. A publicly accessible repository, the FunCatDB, stores information associated with the different categories and provides interfaces and access to more in depth information.

The system is interesting, although it is (intentionally) very general in nature. This helps makes the system easy to use, but does not provide a wealth of specific information on a particular protein. As such, it does not really provide a ‘complete’ functional annotation (although considerable information can be obtained for a given sequence using the FunCatDB). It is also, in and of itself, not a predictive tool. The annotation tool we are attempting to develop is ‘predictive’ in nature, and it would be useful if it provided more information than is available through a FunCat classification alone. It is possible, however, that the FunCat categories may be useful as a part of our overall annotation, or in part of our evaluation of the function of our unknown gene. In order to use FunCat effectively, however, it would be necessary to develop a method of automatically assigning proteins to different FunCat groups. Such an automated system would ideally be based on the compilation of data from various different sources and not simply sequence similarity, as the latter can be misleading without sufficient additional support.

The FunCatDB is an interesting resource, and can be used to access a good deal of additional information pertaining to specific proteins currently classified within the FunCat system. Although powerful, its information is still confined to a subset of organisms and does not contain an interface for use as a predictive tool. Jamie 16:36, 30 January 2006 (EST)




   

Further reading and resources