BIO Assignment Week 6

From "A B C"
Jump to navigation Jump to search

Assignment for Week 6
Function

< Assignment 5 Assignment 7 >

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.




 

Introduction

 

In this assignment we will perform a few computations with coordinate data in PDB files, look more in depth at domain annotations, and compare them across different proteins related to yeast Mbp1. We will write a function in R to help us plot a graphic of the comparison and collaborate to share data for our plots. But first we need to go through a few more R concepts.


 


Stereo vision

Task:

Continue with your stereo practice.

Practice at least ...

  • two times daily,
  • for 3-5 minutes each session.
  • Measure your interocular distance and your fusion distance as explained here on the Student Wiki and add it to the table.

Keep up your practice throughout the course. Once again: do not go through your practice sessions mechanically. If you are not making constant progress in your practice sessions, contact me so we can help you on the right track.

Programming R code

First, we will cover essentials of R programming: the fundamental statements that are needed to write programs–conditional expressions and loops, and how to define functions that allow us to use programming code. But let's start with a few more data types of R so we can use the concepts later on: matrices, lists and data frames.


Task:

Please begin by working through the short R - tutorial: matrices section and the following sections on "Lists" and "Data frames".

Note that what we have done here is just the bare minimum on vectors, matrices and lists. The concepts are very generally useful, and there are many useful ways to extract subsets of values. We'll come across these in various parts of R sample code. But unless you read the provided code examples line by line, make sure you understand every single statement and ask if you are not clear about the syntax, this will all be quickly forgotten. Use it or lose it!


R is a full featured programming language and in order to be that, it needs the ability to manipulate Control Flow, i.e. the order in which instructions are executed. By far the most important control flow statement is a conditional branch i.e. the instruction to execute code at alternative points of the program, depending on the presence or absence of a condition. Using such conditional branch instructions, two main types of programming constructs can be realized: conditional expressions and loops.

Conditional expressions

The template for conditional expression in R is:

if( <expression 1> ) {
  <statement 1>
} 
else if ( <expresssion 2> ) {
  <statement 2>
} 
else {
  <statement 3>
}

...where both the else if (...) { ... } and the else (...) block are optional. We have encountered this construct previously, when we assigned the appropriate colors for amino acids in the frequency plot:

if      (names(logRatio[i]) == "F") { barColors[i] <- hydrophobic }
else if (names(logRatio[i]) == "G") { barColors[i] <- plain }
   [... etc ...]
else                                { barColors[i] <- plain }

Logical expressions

We have to consider the <expression> in a bit more detail: anything that is, or produces, or can be interpreted as, a Boolean TRUE or FALSE value can serve as an expression in a conditional statement.

Task:
Here are some examples. Copy the code to an R script, predict what will happen on in each line and try it out:

# A boolean constant is interpreted as is:
if (TRUE)    {print("true")} else {print("false")}
if (FALSE)   {print("true")} else {print("false")}

# Strings of "true" and "false" are coerced to their 
# Boolean equivalent, but contrary to other programming languages
# arbitrary, non-empty or empty strings are not interpreted.
if ("true") {print("true")} else {print("false")}
if ("false") {print("true")} else {print("false")}
if ("widdershins")    {print("true")} else {print("false")}
if ("") {print("true")} else {print("false")}

# All non-zero, defined numbers are TRUE
if (1)      {print("true")} else {print("false")}
if (0)      {print("true")} else {print("false")}
if (-1)     {print("true")} else {print("false")}
if (pi)     {print("true")} else {print("false")}
if (NULL)   {print("true")} else {print("false")}
if (NA)     {print("true")} else {print("false")}
if (NaN)    {print("true")} else {print("false")}
if (Inf)    {print("true")} else {print("false")}

# functions can return Boolean values
affirm <- function() { return(TRUE) }
deny <- function() { return(FALSE) }
if (affirm())    {print("true")} else {print("false")}
if (deny())    {print("true")} else {print("false")}

# N.B. coercion of Booleans into numbers can be done as well
and is sometimes useful: consider ...
a <- c(TRUE, TRUE, FALSE, TRUE, FALSE)
a
as.numeric(a)
sum(a)

# ... or coercing the other way ...
as.logical(-1:1)

Logical operators

To actually write a conditional statement, we have to be able to test a condition and this is what logical operators do. Is something equal to something else? Is it less? Does something exist? Is it a number?

Task:
Here are some examples. Again, predict what will happen ...

TRUE             # Just a statement.

#   unary operator
! TRUE           # NOT ...

# binary operators
FALSE > TRUE     # GREATER THAN ...
FALSE < TRUE     # ... these are coerced to numbers
FALSE < -1        
0 == FALSE       # Careful! == compares, = assigns!!!

"x" == "u"       # using lexical sort order ...
"x" >= "u"
"x" <= "u"
"x" != "u"
"aa" > "u"       # ... not just length, if different.
"abc" < "u"  

TRUE | FALSE    # OR: TRUE if either is true
TRUE & FALSE    # AND: TRUE if both are TRUE

# equality and identity
?identical
a <- c(TRUE)
b <- c(TRUE)
a; b
a == b
identical(a, b)

b <- 1
a; b
a == b
identical(a, b)   # Aha: equal, but not identical


# some other useful tests for conditional expressions
?all
?any
?duplicated
?exists
?is.character
?is.factor
?is.integer
?is.null
?is.numeric
?is.unsorted
?is.vector


Loops

Loops allow you to repeat tasks many times over. The template is:

for (<name> in <vector>) {
   <statement>
}

Task:
Consider the following: Again, copy the code to a script, study it, predict what will happen and then run it.

# simple for loop
for (i in 1:10) {
	print(c(i, i^2, i^3))
}

# Compare excution times: one million square roots from a vector ...
n <- 1000000
x <- 1:n
y <- sqrt(x)

# ... or done explicitly in a for-loop
for (i in 1:n) {
  y[i] <- sqrt (x[i])
}

If you can achieve your result with an R vector expression, it will be faster than using a loop. But sometimes you need to do things explicitly, for example if you need to access intermediate results.


Here is an example to play with loops: a password generator. Passwords are a pain. We need them everywhere, they are frustrating to type, to remember and since the cracking programs are getting smarter they become more and more likely to be broken. Here is a simple password generator that creates random strings with consonant/vowel alterations. These are melodic and easy to memorize, but actually as strong as an 8-character, fully random password that uses all characters of the keyboard such as )He.{2jJ or #h$bB2X^ (which is pretty much unmemorizable). The former is taken from 207 * 77 1015 possibilities, the latter is from 948 ~ 6*1015 possibilities. HIgh-end GPU supported password crackers can test about 109 passwords a second, the passwords generated by this little algorithm would thus take on the order of 106 seconds or eleven days to crack[1]. This is probably good enough to deter a casual attack.

Task:
Copy, study and run ...

# Suggest memorizable passwords
# Below we use the functions: 
?nchar
?sample
?substr
?paste
?print

#define a string of  consonants ...
con <- "bcdfghjklmnpqrstvwxz"
# ... and a string of of vowels
vow <- "aeiouy"

for (i in 1:10) {  # ten sample passwords to choose from ...
    pass = rep("", 14)  # make an empty character vector
    for (j in 1:7) {    # seven consonant/vowel pairs to be created ...
        k   <- sample(1:nchar(con), 1)  # pick a random index for consonants ...
        ch  <- substr(con,k,k)          #  ... get the corresponding character ...
        idx <- (2*j)-1                  # ... compute the position (index) of where to put the consonant ...
        pass[idx] <- ch                 # ...  and put it in the right spot

        # same thing for the vowel, but coded with fewer intermediate assignments
        # of results to variables
        k <- sample(1:nchar(vow), 1)
        pass[(2*j)] <- substr(vow,k,k) 
    }
    print( paste(pass, collapse="") )  # collapse the vector in to a string and print
}

Functions

Finally: functions. Functions look very much like the statements we have seen above. the template looks like:

<name> <- function (<parameters>) {
   <statements>
}

In this statement, the function is assigned to the name - any valid name in R. Once it is assigned, it the function can be invoked with name(). The parameter list (the values we write into the parentheses followin the function name) can be empty, or hold a list of variable names. If variable names are present, you need to enter the corresponding parameters when you execute the function. These assigned variables are available inside the function, and can be used for computations. This is called "passing the variable into the function".

You have encountered a function to choose YFO names. In this function, your Student ID was the parameter. Here is another example to play with: a function that calculates how old you are. In days. This is neat - you can celebrate your 10,000 birthday - or so.

Task:
Copy, explore and run ...

Define the function ...
# A lifedays calculator function

myLifeDays <- function(date = NULL) { # give "date" a default value so we can test whether it has been set
    if (is.null(date)) {
        print ("Enter your birthday as a string in \"YYYY-MM-DD\" format.")
        return()
    }
    x <- strptime(date, "%Y-%m-%d") # convert string to time
    y <- format(Sys.time(), "%Y-%m-%d") # convert "now" to time
    diff <- round(as.numeric(difftime(y, x, unit="days")))
    print(paste("This date was ", diff, " days ago."))
}
Use the function (example)
   myLifeDays("1932-09-25")  # Glenn Gould's birthday

Here is a good opportunity to play and practice programming: modify this function to accept a second argument. When a second argument is present (e.g. 10000) the function should print the calendar date on which the input date will be that number of days ago. Then you could use it to know when to celebrate your 10,000th lifeDay, or your 777th anniversary day or whatever.

Enjoy.

The PDB

Search for GO and EC numbers at PDB...

The search options in the PDB structure database are as sophisticated as those at the NCBI. For now, we will try a simple keyword search to get us started.


Task:

  1. Visit the RCSB PDB website at http://www.pdb.org/
  2. Briefly orient yourself regarding the database contents and its information offerings and services.
  3. Enter Mbp1 into the search field.
  4. In your journal, note down the PDB IDs for the three Saccharomyces cerevisiae Mbp1 transcription factor structures your search has retrieved.
  5. Click on one of the entries and explore the information and services linked from that page.

 

CDD domain annotation

In the last assignment, you followed a link to CDD Search Results from the RefSeq record for yeast Mbp1 and briefly looked at the information offered by the NCBI's Conserved Domain Database, a database of Position Specific Scoring Matrices that embody domain definitions. Rather than access precomputed results, you can also search CDD with sequences: assuming you have saved the YFO Mbp1 sequence in FASTA format, this is straightforward. If you did not save this sequence, return to Assignment 3 and retrieve it again.


Task:

  1. Access the CDD database at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
  2. Read the information. CDD is a superset of various other database domain annotations as well as NCBI-curated domain definitions.
  3. Copy the YFO Mbp1 FASTA sequence, paste it into the search form and click Submit.
    1. On the result page, clik on View full result
    2. Note that there are a number of partially overlapping ankyrin domain modules. We will study ankyrin domains in a later assignment.
    3. Also note that there may be blocks of sequence colored cyan in the sequence bar. Hover your mouse over the blocks to see what these blocks signify.
    4. Open the link to Search for similar domain architecture in a separate window and study it. This is the CDART database. Think about what these results may be useful for.
    5. Click on one of the ANK superfamily graphics and see what the associated information looks like: there is a summary of structure and function, links to specific literature and a tree of the relationship of related sequences.

SMART domain annotation

The SMART database at the EMBL in Heidelberg offers an alternative view on domain architectures. I personally find it more useful for annotations because it integrates a number of additional features. You can search by sequence, or by accession number and that raises the question of how to retrieve a database cross-reference from an NCBI sequence identifier to a UniProt sequence ID:


ID mapping

Task:

  1. Access the UniProt ID mapping service at http://www.uniprot.org/mapping/
  2. Paste the RefSeq identifier for YFO Mbp1 into the search field.
  3. Use the menu to choose From RefSeq Protein and To UniprotKB AC–the UniProt Knowledge Base Accession number.
  4. Click on Map to execute the search.
  5. Note the ID - it probably starts with a letter, followed by numbers and letters. Here are some examples for fungal Mbp1-like proteins: P39678 Q5B8H6 Q5ANP5 P41412 etc.
  6. Click on the link, and explore how the UniProt sequence page is similar or different from the RefSeq page.

SMART search

Task:

  1. Access the SMART database at http://smart.embl-heidelberg.de/
  2. Click the lick to access SMART in the normal mode.
  3. Paste the YFO Mbp1 UniProtKB Accession number into the Sequence ID or ACC field.
  4. Check the boxes for:
    1. PFAM domains (domains defined by sequence similarity in the PFAM database)
    2. signal peptides (using the Gunnar von Heijne's SignalP 4.0 server at the Technical University in Lyngby, Denmark)
    3. internal repeats (using the programs ariadne and prospero at the Wellcome Trust Centre for Human Genetics at Oxford University, England)
    4. intrinsic protein disorder (using Rune Linding's DisEMBL program at the EMBL in Heidelberg, Germany)
  5. Click on Sequence SMART to run the search and annotation. (In case you get an error like: "Sorry, your entry seems to have no SMART domain ...", let me know and repeat the search with the actual FASTA sequence instead of the accession number.)

Study the results. Specifically, have a look at the proteins with similar domain ORGANISATION and COMPOSITION. This is similar to the NCBI's CDART.

Visual comparison of domain annotations in R

The versatile plotting functions of R allow us to compare domain annotations. The distribution of segments that are annotated as being "low-complexity" or "disordered is particulalry interesting: these are functional features of the amino acid sequence that are often not associated with sequence similarity.

In the following code tutorial, we create a plot similar to the CDD and SMART displays. It is based on the SMART domain annotations of the six-fungal reference species for the course.

Task:
Copy the code to an R script, study and execute it.

# plotDomains
# tutorial and functions to plot a colored rectangle from a list of domain annotations


# First task: create a list structure for the annotations: this is a list of lists
# As you see below, we need to mix strings, numbers and vectors of numbers. In R
# such mixed data types must go into a list.

Mbp1Domains <- list()   # start with an empty list

# For each species annotation, compile the SMART domain annotations in a list.
Mbp1Domains <- rbind(Mbp1Domains, list(  # rbind() appends the list to the existing
    species = "Saccharomyces cerevisiae",
    code    = "SACCE",
    ACC     = "P39678",
    length  = 833,
    KilAN   = c(18,102),  # Note: Vector of (start, end) pairs
    AThook  = NULL,       # Note: NULL, because this annotation was not observed in this sequence
    Seg     = c(108,122,236,241,279,307,700,717),
    DisEMBL = NULL,
    Ankyrin = c(394,423,427,463,512,541),   # Note: Merge overlapping domains, if present
    Coils   = c(633, 655)
    ))

Mbp1Domains <- rbind(Mbp1Domains, list(
    species = "Emericella nidulans",
    code    = "ASPNI",
    ACC     = "Q5B8H6",
    length  = 695,
    KilAN   = c(23,94),
    AThook  = NULL,
    Seg     = c(529,543),
    DisEMBL = NULL,
    Ankyrin = c(260,289,381,413),
    Coils   = c(509,572)
    ))

Mbp1Domains <- rbind(Mbp1Domains, list(
    species = "Candida albicans",
    code    = "CANAL",
    ACC     = "Q5ANP5",
    length  = 852,
    KilAN   = c(19,102),
    AThook  = NULL,
    Seg     = c(351,365,678,692),
    DisEMBL = NULL,
    Ankyrin = c(376,408,412,448,516,545),
    Coils   = c(665,692)
    ))

Mbp1Domains <- rbind(Mbp1Domains, list(
    species = "Neurospora crassa",
    code    = "NEUCR",
    ACC     = "Q7RW59",
    length  = 833,
    KilAN   = c(31,110),
    AThook  = NULL,
    Seg     = c(130,141,253,266,514,525,554,564,601,618,620,629,636,652,658,672,725,735,752,771),
    DisEMBL = NULL,
    Ankyrin = c(268,297,390,419),
    Coils   = c(500,550)
    ))

Mbp1Domains <- rbind(Mbp1Domains, list(
    species = "Schizosaccharomyces pombe",
    code    = "SCHPO",
    ACC     = "P41412",
    length  = 657,
    KilAN   = c(21,97),
    AThook  = NULL,
    Seg     = c(111,125,136,145,176,191,422,447),
    DisEMBL = NULL,
    Ankyrin = c(247,276,368,397),
    Coils   = c(457,538)
    ))

Mbp1Domains <- rbind(Mbp1Domains, list(
    species = "Ustilago maydis",
    code    = "USTMA",
    ACC     = "Q4P117",
    length  = 956,
    KilAN   = c(21,98),
    AThook  = NULL,
    Seg     = c(106,116,161,183,657,672,776,796),
    DisEMBL = NULL,
    Ankyrin = c(245,274,355,384),
    Coils   = c(581,609)
    ))


# Working with data in lists and dataframes can be awkward, since the result
# of filters and slices are themselves lists, not vectors.
# Therefore we need to use the unlist() function a lot. When in doubt: unlist()

#### Boxes #####
# Define a function to draw colored boxes, given input of a vector with
# (start,end) pairs, a color, and the height where the box should go.
drawBoxes <- function(v, c, h) {   # vector of xleft, xright pairs; color; height
    if (is.null(v)) { return() }
    for (i in seq(1,length(v),by=2)) {
        rect(v[i], h-0.1, v[i+1], h+0.1, border="black", col=c)
    }
}

#### Annotations ####
# Define a function to write the species code, draw a grey
# horizontal line and call drawBoxes() for every annotation type
# in the list
drawGene <- function(rIndex) {
    # define colors:
    kil <- "#32344F"
    ank <- "#691A2C"
    seg <- "#598C9E"
    coi <- "#8B9998"
    xxx <- "#EDF7F7"
    
    text (-30, rIndex, adj=1, labels=unlist(Mbp1Domains[rIndex,"code"]), cex=0.75 )
    lines(c(0, unlist(Mbp1Domains[rIndex,"length"])), c(rIndex, rIndex), lwd=3, col="#999999")

    drawBoxes(unlist(Mbp1Domains[rIndex,"KilAN"]),   kil, rIndex)
    drawBoxes(unlist(Mbp1Domains[rIndex,"AThook"]),  xxx, rIndex)
    drawBoxes(unlist(Mbp1Domains[rIndex,"Seg"]),     seg, rIndex)
    drawBoxes(unlist(Mbp1Domains[rIndex,"DisEMBL"]), xxx, rIndex)
    drawBoxes(unlist(Mbp1Domains[rIndex,"Ankyrin"]), ank, rIndex)
    drawBoxes(unlist(Mbp1Domains[rIndex,"Coils"]),   coi, rIndex)
}

#### Plot ####
# define the size of the plot-frame
yMax <- length(Mbp1Domains[,1])   # number of domains in list
xMax <- max(unlist(Mbp1Domains[,"length"]))  # largest sequence length

# plot an empty frame
plot(1,1, xlim=c(-100,xMax), ylim=c(0, yMax) , type="n", yaxt="n", bty="n", xlab="sequence number", ylab="")

# Finally, iterate over all species and call drawGene()
for (i in 1:length(Mbp1Domains[,1])) {
    drawGene(i)
}

# end


When you execute the code, your plot should look similar to this one:

SMART domain annotations for Mbp1 proteins from six fungal species.

Task:
On the Student Wiki, add the annotations for YFO to the plot:

  1. Copy one of the list definitions for Mbp1 domains and edit it with the appropriate values for your own annotations.
  2. Test that you can add the YFO annotation to the plot.
  3. Submit your validated code block to the Student Wiki here. The goal is to compile an overview of all species we are studying in class.
  4. If your working annotation block is in the Wiki before noontime on Wednesday, you will be awarded a 10% bonus on the quiz.


EC

GO

Introduction

Harris (2008) Developing an ontology. Methods Mol Biol 452:111-24. (pmid: 18563371)

PubMed ] [ DOI ]

Hackenberg & Matthiesen (2010) Algorithms and methods for correlating experimental results with annotation databases. Methods Mol Biol 593:315-40. (pmid: 19957156)

PubMed ] [ DOI ]

GO

The Gene Ontology project is the most influential contributor to the definition of function in computational biology and the use of GO terms and GO annotations is ubiquitous.

GO: the Gene Ontology project


link ] [ page ]
size=200px
du Plessis et al. (2011) The what, where, how and why of gene ontology--a primer for bioinformaticians. Brief Bioinformatics 12:723-35. (pmid: 21330331)

PubMed ] [ DOI ]

The GO actually comprises three separate ontologies:

Molecular function
...


Biological Process
...


Cellular component
...


GO terms

GO terms comprise the core of the information in the ontology: a carefully crafted definition of a term in any of GO's separate ontologies.


GO relationships

The nature of the relationships is as much a part of the ontology as the terms themselves. GO uses three categories of relationships:

  • is a
  • part of
  • regulates


GO annotations

The GO terms are conceptual in nature, and while they represent our interpretation of biological phenomena, they do not intrinsically represent biological objects, such a specific genes or proteins. In order to link molecules with these concepts, the ontology is used to annotate genes. The annotation project is referred to as GOA.

Dimmer et al. (2007) Methods for gene ontology annotation. Methods Mol Biol 406:495-520. (pmid: 18287709)

PubMed ] [ DOI ]


GO evidence codes

Annotations can be made according to literature data or computational inference and it is important to note how an annotation has been justified by the curator to evaluate the level of trust we should have in the annotation. GO uses evidence codes to make this process transparent. When computing with the ontology, we may want to filter (exclude) particular terms in order to avoid tautologies: for example if we were to infer functional relationships between homologous genes, we should exclude annotations that have been based on the same inference or similar, and compute only with the actual experimental data.

The following evidence codes are in current use; if you want to exclude inferred anotations you would restrict the codes you use to the ones shown in bold: EXP, IDA, IPI, IMP, IEP, and perhaps IGI, although the interpretation of genetic interactions can require assumptions.

Automatically-assigned Evidence Codes
  • IEA: Inferred from Electronic Annotation
Curator-assigned Evidence Codes
  • Experimental Evidence Codes
    • EXP: Inferred from Experiment
    • IDA: Inferred from Direct Assay
    • IPI: Inferred from Physical Interaction
    • IMP: Inferred from Mutant Phenotype
    • IGI: Inferred from Genetic Interaction
    • IEP: Inferred from Expression Pattern
  • Computational Analysis Evidence Codes
    • ISS: Inferred from Sequence or Structural Similarity
    • ISO: Inferred from Sequence Orthology
    • ISA: Inferred from Sequence Alignment
    • ISM: Inferred from Sequence Model
    • IGC: Inferred from Genomic Context
    • IBA: Inferred from Biological aspect of Ancestor
    • IBD: Inferred from Biological aspect of Descendant
    • IKR: Inferred from Key Residues
    • IRD: Inferred from Rapid Divergence
    • RCA: inferred from Reviewed Computational Analysis
  • Author Statement Evidence Codes
    • TAS: Traceable Author Statement
    • NAS: Non-traceable Author Statement
  • Curator Statement Evidence Codes
    • IC: Inferred by Curator
    • ND: No biological Data available

For further details, see the Guide to GO Evidence Codes and the GO Evidence Code Decision Tree.


 

GO tools

For many projects, the simplest approach will be to download the GO ontology itself. It is a well constructed, easily parseable file that is well suited for computation. For details, see Computing with GO on this wiki.



AmiGO

practical work with GO: at first via the AmiGO browser AmiGO is a GO browser developed by the Gene Ontology consortium and hosted on their website.

AmiGO - Gene products

Task:

  1. Navigate to the GO homepage.
  2. Enter SOX2 into the search box to initiate a search for the human SOX2 transcription factor (WP, HUGO) (as gene or protein name).
  3. There are a number of hits in various organisms: sulfhydryl oxidases and (sex determining region Y)-box genes. Check to see the various ways by which you could filter and restrict the results.
  4. Select Homo sapiens as the species filter and set the filter. Note that this still does not give you a unique hit, but ...
  5. ... you can identify the Transcription factor SOX-2 and follow its gene product information link. Study the information on that page.
  6. Later, we will need Entrez Gene IDs. The GOA information page provides these as GeneID in the External references section. Note it down. With the same approach, find and record the Gene IDs (a) of the functionally related Oct4 (POU5F1) protein, (b) the human cell-cycle transcription factor E2F1, (c) the human bone morphogenetic protein-4 transforming growth factor BMP4, (d) the human UDP glucuronosyltransferase 1 family protein 1, an enzyme that is differentially expressed in some cancers, UGT1A1, and (d) as a positive control, SOX2's interaction partner NANOG, which we would expect to be annotated as functionally similar to both Oct4 and SOX2.


AmiGO - Associations

GO annotations for a protein are called associations.

Task:

  1. Open the associations information page for the human SOX2 protein via the link in the right column in a separate tab. Study the information on that page.
  2. Note that you can filter the associations by ontology and evidence code. You have read about the three GO ontologies in your previous assignment, but you should also be familiar with the evidence codes. Click on any of the evidence links to access the Evidence code definition page and study the definitions of the codes. Make sure you understand which codes point to experimental observation, and which codes denote computational inference, or say that the evidence is someone's opinion (TAS, IC etc.). Note: it is good practice - but regrettably not universally implemented standard - to clearly document database semantics and keep definitions associated with database entries easily accessible, as GO is doing here. You won't find this everywhere, but as a user please feel encouraged to complain to the database providers if you come across a database where the semantics are not clear. Seriously: opaque semantics make database annotations useless.
  3. There are many associations (around 60) and a good way to select which ones to pursue is to follow the most specific ones. Set IDA as a filter and among the returned terms select GO:0035019somatic stem cell maintenance in the Biological Process ontology. Follow that link.
  4. Study the information available on that page and through the tabs on the page, especially the graph view.
  5. In the Inferred Tree View tab, find the genes annotated to this go term for homo sapiens. There should be about 55. Click on the number behind the term. The resulting page will give you all human proteins that have been annotated with this particular term. Note that the great majority of these is via the IEA evidence code.


Semantic similarity

A good, recent overview of ontology based functional annotation is found in the following article. This is not a formal reading assignment, but do familiarize yourself with section 3: Derivation of Semantic Similarity between Terms in an Ontology as an introduction to the code-based annotations below.

Gan et al. (2013) From ontology to semantic similarity: calculation of ontology-based semantic similarity. ScientificWorldJournal 2013:793091. (pmid: 23533360)

PubMed ] [ DOI ]


Practical work with GO: bioconductor.

The bioconductor project hosts the GOSemSim package for semantic similarity.

Task:

  1. Work through the following R-code. If you have problems, discuss them on the mailing list. Don't go through the code mechanically but make sure you are clear about what it does.
# GOsemanticSimilarity.R
# GO semantic similarity example
# B. Steipe for BCB420, January 2014

setwd("~/your-R-project-directory")

# GOSemSim is an R-package in the bioconductor project. It is not installed via
# the usual install.packages() comand (via CRAN) but via an installation script
# that is run from the bioconductor Website.

source("http://bioconductor.org/biocLite.R")
biocLite("GOSemSim")

library(GOSemSim)

# This loads the library and starts the Bioconductor environment.
# You can get an overview of functions by executing ...
browseVignettes()
# ... which will open a listing in your Web browser. Open the 
# introduction to GOSemSim PDF. As the introduction suggests,
# now is a good time to execute ...
help(GOSemSim)

# The simplest function is to measure the semantic similarity of two GO
# terms. For example, SOX2 was annotated with GO:0035019 (somatic stem cell
# maintenance), QSOX2 was annotated with GO:0045454 (cell redox homeostasis),
# and Oct4 (POU5F1) with GO:0009786 (regulation of asymmetric cell division),
# among other associations. Lets calculate these similarities.
goSim("GO:0035019", "GO:0009786", ont="BP", measure="Wang")
goSim("GO:0035019", "GO:0045454", ont="BP", measure="Wang")

# Fair enough. Two numbers. Clearly we would appreciate an idea of the values
# that high similarity and low similarity can take. But in any case - 
# we are really less interested in the similarity of GO terms - these
# are a function of how the Ontology was constructed. We are more
# interested in the functional similarity of our genes, and these
# have a number of GO terms associated with them.

# GOSemSim provides the functions ...
?geneSim()
?mgeneSim()
# ... to compute these values. Refer to the vignette for details, in
# particular, consider how multiple GO terms are combined, and how to
# keep/drop evidence codes.
# Here is a pairwise similarity example: the gene IDs are the ones you
# have recorded previously. Note that this will download a package
# of GO annotations - you might not want to do this on a low-bandwidth
# connection.
geneSim("6657", "5460", ont = "BP", measure="Wang", combine = "BMA")
# Another number. And the list of GO terms that were considered.

# Your task: use the mgeneSim() function to calculate the similarities
# between all six proteins for which you have recorded the GeneIDs
# previously (SOX2, POU5F1, E2F1, BMP4, UGT1A1 and NANOG) in the 
# biological process ontology. 

# This will run for some time. On my machine, half an hour or so. 

# Do the results correspond to your expectations?

GO reading and resources

General
Sauro & Bergmann (2008) Standards and ontologies in computational systems biology. Essays Biochem 45:211-22. (pmid: 18793134)

PubMed ] [ DOI ]


Phenotype etc. Ontologies
Human Phenotype Ontology
See also:
Köhler et al. (2014) The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res 42:D966-74. (pmid: 24217912)

PubMed ] [ DOI ]

Schriml et al. (2012) Disease Ontology: a backbone for disease semantic integration. Nucleic Acids Res 40:D940-6. (pmid: 22080554)

PubMed ] [ DOI ]

Evelo et al. (2011) Answering biological questions: querying a systems biology database for nutrigenomics. Genes Nutr 6:81-7. (pmid: 21437033)

PubMed ] [ DOI ]

Oti et al. (2009) The biological coherence of human phenome databases. Am J Hum Genet 85:801-8. (pmid: 20004759)

PubMed ] [ DOI ]

Groth et al. (2007) PhenomicDB: a new cross-species genotype/phenotype resource. Nucleic Acids Res 35:D696-9. (pmid: 16982638)

PubMed ] [ DOI ]


Semantic similarity
Wu et al. (2013) Improving the measurement of semantic similarity between gene ontology terms and gene products: insights from an edge- and IC-based hybrid method. PLoS ONE 8:e66745. (pmid: 23741529)

PubMed ] [ DOI ]

Gan et al. (2013) From ontology to semantic similarity: calculation of ontology-based semantic similarity. ScientificWorldJournal 2013:793091. (pmid: 23533360)

PubMed ] [ DOI ]

Alvarez & Yan (2011) A graph-based semantic similarity measure for the gene ontology. J Bioinform Comput Biol 9:681-95. (pmid: 22084008)

PubMed ] [ DOI ]

Jain & Bader (2010) An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology. BMC Bioinformatics 11:562. (pmid: 21078182)

PubMed ] [ DOI ]

Yu et al. (2010) GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 26:976-8. (pmid: 20179076)

PubMed ] [ DOI ]

GO
Gene Ontology Consortium (2012) The Gene Ontology: enhancements for 2011. Nucleic Acids Res 40:D559-64. (pmid: 22102568)

PubMed ] [ DOI ]

Bastos et al. (2011) Application of gene ontology to gene identification. Methods Mol Biol 760:141-57. (pmid: 21779995)

PubMed ] [ DOI ]

Gene Ontology Consortium (2010) The Gene Ontology in 2010: extensions and refinements. Nucleic Acids Res 38:D331-5. (pmid: 19920128)

PubMed ] [ DOI ]

Carol Goble on the tension between purists and pragmatists in life-science ontology construction. Plenary talk at SOFG2...

Goble & Wroe (2004) The Montagues and the Capulets. Comp Funct Genomics 5:623-32. (pmid: 18629186)

PubMed ] [ DOI ]


That is all.


 

Links and resources

 


Footnotes and references

  1. That's assuming the worst case in that the attacker needs to know the pattern with which the password is formed, i.e. the number of characters and the alphabet that we chose from. But note that there is an even worse case: if the attacker had access to our code and the seed to our random number generator. When the random number generator starts up, a new seed is generated from system time, thus the possible space of seeds can be devastatingly small. But even if a seed is set explicitly with the set.seed() function, that seed is a 32-bit integer and thus can take only a bit more than 4*109 values, six orders of magnitude less than the 1015 password complexity we thought we had. It turns out that the code may be a much greater vulnerability than the password itself. Keep that in mind. Keep it secret. Keep it safe.


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.



< Assignment 5 Assignment 7 >