BIO project GO-term table
Constructing the GO term table
Contents
Extracting gene functions from GO and GOA
The table of selected biological process GO terms a starting point for system definition and gene selection.
The objective of selecting GO terms from the ontology is to find nodes that have a small number of genes associated to them (or their children). If the number of annotated genes is too small, not enough may be known about the system to usefully model it. If the number is too large, modelling the system will be too time consuming for the scope of a class project. It is also highly likely that the system should actually be broken down into subsystems, which ought to be modelled individually. I have chosen GO terms with from three to five annotated genes as a cutoff that seems workable.
Roughly, this selection works as follows:
1. Retrieve the ontology terms, and the list of annotated human genes from the database.
2. Remove all terms that are not in the biological process ontology.
3. Associate all human genes with their respective terms.
4. Remove all terms that are leafs (no children) AND have no genes annotated.
5. For each term, find all descendants.
6. Of all descendants, collect all annotated genes.
7. Make the list of genes unique. This is important since there may be more than one path to a gene and we don't want to double count it.
8. Store the number of genes associated to a term and its descendants.
9. Select terms that have 3 to 5 associated genes.
10. Remove terms that contain certain keywords that may make the process less suitable for our purposes, such as "development", "morphogenesis", "behavior".
11. Write the remaining terms into a Wiki table format, with links to GO terms and genes.
Details below.
Creation of the Process Table
The Process Table was created in the following steps:
- 1. I downloaded the most recent version of
goa_human.gaf
from the GOA repository at the EBI, released on November 22. 2016. The file contains gene names and GO terms - here is one line of the contents (selected columns):
UniProtID GeneSymbol GOtermID Reference EvidenceCode Description taxID Q14209 E2F2 GO:0051726 GO_REF:0000019 IEA Transcription factor E2F2 taxon:9606
- 2. I downloaded the GO term file
go-basic.obo
from its repository at GO - the Gene Ontology Consortium. It contains the actual ontology terms (45,896 terms), they look like this (some information omitted).
[Term] id: GO:0051726 name: regulation of cell cycle namespace: biological_process def: "Any process that modulates the rate or extent of progression through the cell cycle." [GOC:ai, GOC:dph, GOC:tb] is_a: GO:0050794 ! regulation of cellular process relationship: regulates GO:0007049 ! cell cycle
- Note the
is_a
relationship that points to the parent term, GO:0050794, of which GO:0051726 is a specialization. This means: the terms have parent-associations listed with them, but don't list child terms. This means we can go up to the root from each term, but we can't descend. We need to record the child terms for each node, to be able to navigate both up and down through the ontology.
- 3. Open this file in R and read all the terms that are labelled as namespace: biological_process into a data frame storing ID, name, definition, and parent term IDs. These are 29,309 terms. Then add each term ID to its parent terms as a child term.
- 4. Open the human GOA file and read all gene annotations into a data frame. These are 392,942. That's a lot - we expect at least three annotations for each of the 19,225 genes that GOA currently contains, one for each ontology, but we find on average 20 annotations. Next, add each gene symbol to the term data frame, if the annotation's GO ID is in our set.
- 5. Next, remove all terms that are leaves (i.e. they don't have children), AND have no annotated genes. These are terms that might be specific to fungi, prokaryotes, viruses or similar. Since our goal is to count the number of genes associated with each term, those can't contribute since their own count is zero. Deleting those nodes obviously creates new leaves in the ontology, so this process has to be iterated, until no more nodes can be removed. This deletes 11,733 terms from the ontology.
- 6. We need a function that finds all descendants of a term. In order to find and visit all descendants, start from a term and put its child(ren) into a queue. While the queue is not empty, remove the first ID from the queue, and add those of its children to the queue that we haven't visited before.
- 7. For each term, count all genes that are associated to it or to any of its descendants. The root term, GO:0008150 has 17,030 associated genes.
- 8. Next, select all terms that have from three to five associated genes AND contain none of a list of keywords. These keywords include: "proliferation", "development", "morphogenesis", "regression", "induction", "maturation", "formation", "growth", "*bolic process", "biosynthetic process", and "behavior". I feel that even though the GO terms may be manageable, the actual system would likely contain an unwieldy number of genes. I also removed all "positive regulation of ...", and "negative regulation of ..." terms. These always have a parent term: "regulation of ..." and that term would be the right one to consider.
- 9. Finally, the 1,224 selected terms were written into a table in an editable Wiki code format, including links to GO and UniProt.
How to use the table to adopt a function and define a "system" for the project
I have randomly chosen a term – GO:0001845 (phagolysosome assembly) – to illustrate the process:
- 1. Browse the table and find a function that interests you. It's very diverse and will give you a good first sense of the complexity of functions in the cell and our state of knowledge about them. Have a look at the term information. There is probably a link to an associated "GONUTS" page with more information.
- 2. Look at the function information available when you follow the links of the annotated genes to UniProt. Most of these genes have more than one GO term annotated to them, and taken together, these GO terms shed more light on how the "system" is more and different than its annotated GO terms. For example from the genes in GO:0001845 let's look at "SRPX". This gene codes for P78539 (SRPX). Here are the annotated processes in which this protein participates:
- Just looking at these processes tells me a lot. The annotation is for one protein, but it is implied in several aspects of biology. Some of them are causal to others, and some of them are generalizations. Here, the phagolysosome needs to be assembled in order to function. Once it has been made, this enables stress response, and contact inhibition, likely via apoptotic pathways and autophagy. This phagolysosome assembly sounds like a good system to work on. But is the term too broad, or too narrow? Could I find a better name for it?
- 3. Follow the link to the phagolysosome assembly GOterm page at QuickGO, and explore the tabs that contain the Ancestor Chart and the Child Terms:
Relationship to GO:0001845 |
Child Term | Child Term Name |
Part of |
GO:0090384 | phagosome-lysosome docking |
Part of |
GO:0090385 | phagosome-lysosome fusion |
Is a |
GO:0090387 | phagolysosome assembly involved in apoptotic cell clearance |
Explore how many proteins are annotated to this term. There are probably VERY many - so set a taxon filter on this table (taxon: 9606) to view only human proteins.
As a result of my first exploration I see:
- There are probably around 20 proteins that I'll need to cnsider as system components;
- most of them are RAB genes which have a reasonably well understood function. It will be interesting to find out how this function enables the process;
- The mechanism of this system will likely include proteins that are specific to this process, and others that are involved in membrane fusion in a general sense and are recruited to this specific task.
This looks like an good starting point to define the components of a system. For now, I would be oK with calling it the "Phagosome / Lysosome Fusion System".
Considerations for choosing a system=
Keep your systems simple. I would avoid choosing systems/processes that integrate sensory, nervous, hormonal and cellular components. This may become too complex. Narrowing it down, to a manageable "subsystem" is a valuable exercise in itself. Such a system may implement
- integrating input,
- transmitting input signals to their effectors,
- regulating the process,
- providing resources,
- defining setpoints,
- assembling or disassembling the system,
- mediating interactions with other systems,
- or similar...
Keep your systems manageable. When considering how many genes are associated with a system, check the taxon section of the relevant GO terms' statistic on QuickGO. The number of genes involved in the process in humans is likely as large as the largest number for ANY species - although many of the human genes may not have been annotated for that process (yet). For example, if the mouse (mus musculus) has 20 annotated genes and humans have only two, that probably does not mean humans can achieve with only two genes that for which the mouse needs twenty. In this situation, looking for orthologues of mouse genes should lead you to human candidate genes. But as a corollary, if the mouse has many, many genes annotated, that particular process might not be so suitable for this project after all.
Resources