BIO Assignment Week 6

Assignment for Week 6
Function

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.

Introduction

In this assignment we will perform a few computations with coordinate data in PDB files, look more in depth at domain annotations, and compare them across different proteins related to yeast Mbp1. We will write a function in R to help us plot a graphic of the comparison and collaborate to share data for our plots. But first we need to go through a few more R concepts.

Stereo vision

Task:

Continue with your stereo practice.

Practice at least ...

two times daily,
for 3-5 minutes each session.

Measure your interocular distance and your fusion distance as explained here on the Student Wiki and add it to the table.

Keep up your practice throughout the course. Once again: do not go through your practice sessions mechanically. If you are not making constant progress in your practice sessions, contact me so we can help you on the right track.

Programming R code

First, we will cover essentials of R programming: the fundamental statements that are needed to write programs–conditional expressions and loops, and how to define functions that allow us to use programming code. But let's start with a few more data types of R so we can use the concepts later on: matrices, lists and data frames.

Task:

Please begin by working through the short R - tutorial: matrices section and the following sections on "Lists" and "Data frames".

Note that what we have done here is just the bare minimum on vectors, matrices and lists. The concepts are very generally useful, and there are many useful ways to extract subsets of values. We'll come across these in various parts of R sample code. But unless you read the provided code examples line by line, make sure you understand every single statement and ask if you are not clear about the syntax, this will all be quickly forgotten. Use it or lose it!

R is a full featured programming language and in order to be that, it needs the ability to manipulate Control Flow, i.e. the order in which instructions are executed. By far the most important control flow statement is a conditional branch i.e. the instruction to execute code at alternative points of the program, depending on the presence or absence of a condition. Using such conditional branch instructions, two main types of programming constructs can be realized: conditional expressions and loops.

Conditional expressions

The template for conditional expression in R is:

if( <expression 1> ) {
  <statement 1>
} 
else if ( <expresssion 2> ) {
  <statement 2>
} 
else {
  <statement 3>
}

...where both the else if (...) { ... } and the else (...) block are optional. We have encountered this construct previously, when we assigned the appropriate colors for amino acids in the frequency plot:

if      (names(logRatio[i]) == "F") { barColors[i] <- hydrophobic }
else if (names(logRatio[i]) == "G") { barColors[i] <- plain }
   [... etc ...]
else                                { barColors[i] <- plain }

Logical expressions

We have to consider the <expression> in a bit more detail: anything that is, or produces, or can be interpreted as, a Boolean TRUE or FALSE value can serve as an expression in a conditional statement.

Task:
Here are some examples. Copy the code to an R script, predict what will happen on in each line and try it out:

# A boolean constant is interpreted as is:
if (TRUE)    {print("true")} else {print("false")}
if (FALSE)   {print("true")} else {print("false")}

# Strings of "true" and "false" are coerced to their 
# Boolean equivalent, but contrary to other programming languages
# arbitrary, non-empty or empty strings are not interpreted.
if ("true") {print("true")} else {print("false")}
if ("false") {print("true")} else {print("false")}
if ("widdershins")    {print("true")} else {print("false")}
if ("") {print("true")} else {print("false")}

# All non-zero, defined numbers are TRUE
if (1)      {print("true")} else {print("false")}
if (0)      {print("true")} else {print("false")}
if (-1)     {print("true")} else {print("false")}
if (pi)     {print("true")} else {print("false")}
if (NULL)   {print("true")} else {print("false")}
if (NA)     {print("true")} else {print("false")}
if (NaN)    {print("true")} else {print("false")}
if (Inf)    {print("true")} else {print("false")}

# functions can return Boolean values
affirm <- function() { return(TRUE) }
deny <- function() { return(FALSE) }
if (affirm())    {print("true")} else {print("false")}
if (deny())    {print("true")} else {print("false")}

# N.B. coercion of Booleans into numbers can be done as well
and is sometimes useful: consider ...
a <- c(TRUE, TRUE, FALSE, TRUE, FALSE)
a
as.numeric(a)
sum(a)

# ... or coercing the other way ...
as.logical(-1:1)

Logical operators

To actually write a conditional statement, we have to be able to test a condition and this is what logical operators do. Is something equal to something else? Is it less? Does something exist? Is it a number?

Task:
Here are some examples. Again, predict what will happen ...

TRUE             # Just a statement.

#   unary operator
! TRUE           # NOT ...

# binary operators
FALSE > TRUE     # GREATER THAN ...
FALSE < TRUE     # ... these are coerced to numbers
FALSE < -1        
0 == FALSE       # Careful! == compares, = assigns!!!

"x" == "u"       # using lexical sort order ...
"x" >= "u"
"x" <= "u"
"x" != "u"
"aa" > "u"       # ... not just length, if different.
"abc" < "u"  

TRUE | FALSE    # OR: TRUE if either is true
TRUE & FALSE    # AND: TRUE if both are TRUE

# equality and identity
?identical
a <- c(TRUE)
b <- c(TRUE)
a; b
a == b
identical(a, b)

b <- 1
a; b
a == b
identical(a, b)   # Aha: equal, but not identical


# some other useful tests for conditional expressions
?all
?any
?duplicated
?exists
?is.character
?is.factor
?is.integer
?is.null
?is.numeric
?is.unsorted
?is.vector

Loops

Loops allow you to repeat tasks many times over. The template is:

for (<name> in <vector>) {
   <statement>
}

Task:
Consider the following: Again, copy the code to a script, study it, predict what will happen and then run it.

# simple for loop
for (i in 1:10) {
	print(c(i, i^2, i^3))
}

# Compare excution times: one million square roots from a vector ...
n <- 1000000
x <- 1:n
y <- sqrt(x)

# ... or done explicitly in a for-loop
for (i in 1:n) {
  y[i] <- sqrt (x[i])
}

If you can achieve your result with an R vector expression, it will be faster than using a loop. But sometimes you need to do things explicitly, for example if you need to access intermediate results.

Here is an example to play with loops: a password generator. Passwords are a pain. We need them everywhere, they are frustrating to type, to remember and since the cracking programs are getting smarter they become more and more likely to be broken. Here is a simple password generator that creates random strings with consonant/vowel alterations. These are melodic and easy to memorize, but actually as strong as an 8-character, fully random password that uses all characters of the keyboard such as )He.{2jJ or #h$bB2X^ (which is pretty much unmemorizable). The former is taken from 20⁷ * 7⁷ 10¹⁵ possibilities, the latter is from 94⁸ ~ 6*10¹⁵ possibilities. HIgh-end GPU supported password crackers can test about 10⁹ passwords a second, the passwords generated by this little algorithm would thus take on the order of 10⁶ seconds or eleven days to crack^[1]. This is probably good enough to deter a casual attack.

Task:
Copy, study and run ...

# Suggest memorizable passwords
# Below we use the functions: 
?nchar
?sample
?substr
?paste
?print

#define a string of  consonants ...
con <- "bcdfghjklmnpqrstvwxz"
# ... and a string of of vowels
vow <- "aeiouy"

for (i in 1:10) {  # ten sample passwords to choose from ...
    pass = rep("", 14)  # make an empty character vector
    for (j in 1:7) {    # seven consonant/vowel pairs to be created ...
        k   <- sample(1:nchar(con), 1)  # pick a random index for consonants ...
        ch  <- substr(con,k,k)          #  ... get the corresponding character ...
        idx <- (2*j)-1                  # ... compute the position (index) of where to put the consonant ...
        pass[idx] <- ch                 # ...  and put it in the right spot

        # same thing for the vowel, but coded with fewer intermediate assignments
        # of results to variables
        k <- sample(1:nchar(vow), 1)
        pass[(2*j)] <- substr(vow,k,k) 
    }
    print( paste(pass, collapse="") )  # collapse the vector in to a string and print
}

Functions

Finally: functions. Functions look very much like the statements we have seen above. the template looks like:

<name> <- function (<parameters>) {
   <statements>
}

In this statement, the function is assigned to the name - any valid name in R. Once it is assigned, it the function can be invoked with name(). The parameter list (the values we write into the parentheses followin the function name) can be empty, or hold a list of variable names. If variable names are present, you need to enter the corresponding parameters when you execute the function. These assigned variables are available inside the function, and can be used for computations. This is called "passing the variable into the function".

You have encountered a function to choose YFO names. In this function, your Student ID was the parameter. Here is another example to play with: a function that calculates how old you are. In days. This is neat - you can celebrate your 10,000 birthday - or so.

Task:
Copy, explore and run ...

Define the function ...

# A lifedays calculator function

myLifeDays <- function(date = NULL) { # give "date" a default value so we can test whether it has been set
    if (is.null(date)) {
        print ("Enter your birthday as a string in \"YYYY-MM-DD\" format.")
        return()
    }
    x <- strptime(date, "%Y-%m-%d") # convert string to time
    y <- format(Sys.time(), "%Y-%m-%d") # convert "now" to time
    diff <- round(as.numeric(difftime(y, x, unit="days")))
    print(paste("This date was ", diff, " days ago."))
}

Use the function (example)

   myLifeDays("1932-09-25")  # Glenn Gould's birthday

Here is a good opportunity to play and practice programming: modify this function to accept a second argument. When a second argument is present (e.g. 10000) the function should print the calendar date on which the input date will be that number of days ago. Then you could use it to know when to celebrate your 10,000^th lifeDay, or your 777^th anniversary day or whatever.

Enjoy.

The PDB

Search for GO and EC numbers at PDB...

The search options in the PDB structure database are as sophisticated as those at the NCBI. For now, we will try a simple keyword search to get us started.

Task:

Visit the RCSB PDB website at http://www.pdb.org/
Briefly orient yourself regarding the database contents and its information offerings and services.
Enter Mbp1 into the search field.
In your journal, note down the PDB IDs for the three Saccharomyces cerevisiae Mbp1 transcription factor structures your search has retrieved.
Click on one of the entries and explore the information and services linked from that page.

CDD domain annotation

In the last assignment, you followed a link to CDD Search Results from the RefSeq record for yeast Mbp1 and briefly looked at the information offered by the NCBI's Conserved Domain Database, a database of Position Specific Scoring Matrices that embody domain definitions. Rather than access precomputed results, you can also search CDD with sequences: assuming you have saved the YFO Mbp1 sequence in FASTA format, this is straightforward. If you did not save this sequence, return to Assignment 3 and retrieve it again.

Task:

Access the CDD database at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
Read the information. CDD is a superset of various other database domain annotations as well as NCBI-curated domain definitions.
Copy the YFO Mbp1 FASTA sequence, paste it into the search form and click Submit.
1. On the result page, clik on View full result
2. Note that there are a number of partially overlapping ankyrin domain modules. We will study ankyrin domains in a later assignment.
3. Also note that there may be blocks of sequence colored cyan in the sequence bar. Hover your mouse over the blocks to see what these blocks signify.
4. Open the link to Search for similar domain architecture in a separate window and study it. This is the CDART database. Think about what these results may be useful for.
5. Click on one of the ANK superfamily graphics and see what the associated information looks like: there is a summary of structure and function, links to specific literature and a tree of the relationship of related sequences.

SMART domain annotation

The SMART database at the EMBL in Heidelberg offers an alternative view on domain architectures. I personally find it more useful for annotations because it integrates a number of additional features. You can search by sequence, or by accession number and that raises the question of how to retrieve a database cross-reference from an NCBI sequence identifier to a UniProt sequence ID:

ID mapping

Task:

Access the UniProt ID mapping service at http://www.uniprot.org/mapping/
Paste the RefSeq identifier for YFO Mbp1 into the search field.
Use the menu to choose From RefSeq Protein and To UniprotKB AC–the UniProt Knowledge Base Accession number.
Click on Map to execute the search.
Note the ID - it probably starts with a letter, followed by numbers and letters. Here are some examples for fungal Mbp1-like proteins: P39678 Q5B8H6 Q5ANP5 P41412 etc.
Click on the link, and explore how the UniProt sequence page is similar or different from the RefSeq page.

SMART search

Task:

Access the SMART database at http://smart.embl-heidelberg.de/
Click the lick to access SMART in the normal mode.
Paste the YFO Mbp1 UniProtKB Accession number into the Sequence ID or ACC field.
Check the boxes for:
1. PFAM domains (domains defined by sequence similarity in the PFAM database)
2. signal peptides (using the Gunnar von Heijne's SignalP 4.0 server at the Technical University in Lyngby, Denmark)
3. internal repeats (using the programs ariadne and prospero at the Wellcome Trust Centre for Human Genetics at Oxford University, England)
4. intrinsic protein disorder (using Rune Linding's DisEMBL program at the EMBL in Heidelberg, Germany)
Click on Sequence SMART to run the search and annotation. (In case you get an error like: "Sorry, your entry seems to have no SMART domain ...", let me know and repeat the search with the actual FASTA sequence instead of the accession number.)

Study the results. Specifically, have a look at the proteins with similar domain ORGANISATION and COMPOSITION. This is similar to the NCBI's CDART.

Introduction

Integrating evolutionary information with structural information allows us to establish which residues are invariant in a family–these are presumably structurally important sites–and which residues are functionally important, since they are invariant within, but changeable between subfamilies.

To visualize these relationships, we will load an MSA of APSES domains with VMD and color it by conservation.

The DNA binding site

Now, that you know how YFO Mbp1 aligns with yeast Mbp1, you can evaluate functional conservation in these homologous proteins. You probably already downloaded the two Biochemistry papers by Taylor et al. (2000) and by Deleeuw et al. (2008) that we encountered in Assignment 2. These discuss the residues involved in DNA binding^[2]. In particular the residues between 50-74 have been proposed to comprise the DNA recognition domain.

Task:

Using the APSES domain alignment you have just constructed, find the YFO Mbp1 residues that correspond to the range 50-74 in yeast.
Note whether the sequences are especially highly conserved in this region.
Using Chimera, look at the region. Use the sequence window to make sure that the sequence numbering between the paper and the PDB file are the same (they are often not identical!). Then select the residues - the proposed recognition domain - and color them differently for emphasis. Study this in stereo to get a sense of the spatial relationships. Check where the conserved residues are.
A good representation is stick - but other representations that include sidechains will also serve well.
Calculate a solvent accessible surface of the protein in a separate representation and make it transparent.
You could combine three representations: (1) the backbone (in ribbon view), (2) the sidechains of residues that presumably contact DNA, distinctly colored, and (3) a transparent surface of the entire protein. This image should show whether residues annotated as DNA binding form a contiguous binding interface.

DNA binding interfaces are expected to comprise a number of positively charged amino acids, that might form salt-bridges with the phosphate backbone.

Task:

Study and consider whether this is the case here and which residues might be included.

APSES domains in Chimera (from A4)

What precisely constitutes an APSES domain however is a matter of definition, as you can explore in the following (optional) task.

ExpandOptional: Load the structure in Chimera, like you did in the last assignment and switch on stereo viewing ... (more)

Display the protein in ribbon style, e.g. with the Interactive 1 preset.
Access the Interpro information page for Mbp1 at the EBI: http://www.ebi.ac.uk/interpro/protein/P39678
In the section Domains and repeats, mouse over the red annotations and note down the residue numbers for the annotated domains. Also follow the links to the respective Interpro domain definition pages.

At this point we have definitions for the following regions on the Mbp1 protein ...

The KilA-N (pfam 04383) domain definition as applied to the Mbp1 protein sequence by CDD;
The InterPro KilA, N-terminal/APSES-type HTH, DNA-binding (IPR018004) definition annotated on the Mbp1 sequence;
The InterPro Transcription regulator HTH, APSES-type DNA-binding domain (IPR003163) definition annotated on the Mbp1 sequence;
(... in addition – without following the source here – the UniProt record for Mbp1 annotates a "HTH APSES-type" domain from residues 5-111)

... each with its distinct and partially overlapping sequence range. Back to Chimera:

In the sequence window, select the sequence corresponding to the Interpro KilA-N annotation and colour this fragment red. Remember that you can get the sequence numbers of a residue in the sequence window when you hover the pointer over it - but do confirm that the sequence numbering that Chimera displays matches the numbering of the Interpro domain definition.
Then select the residue range(s) by which the CDD KilA-N definition is larger, and colour that fragment orange.
Then select the residue range(s) by which the InterPro APSES domain definition is larger, and colour that fragment yellow.
If the structure contains residues outside these ranges, colour these white.
Study this in a side-by-side stereo view and get a sense for how the extra sequence beyond the Kil-A N domain(s) is part of the structure, and how the integrity of the folded structure would be affected if these fragments were missing.
Display Hydrogen bonds, to get a sense of interactions between residues from the differently colored parts. First show the protein as a stick model, with sticks that are thicker than the default to give a better sense of sidechain packing:

(i) Select → Select all

(ii) Actions → Ribbon → hide

(iii) Select → Structure → protein

(iv) Actions → Atoms/Bonds → show

(v) Actions → Atoms/Bonds → stick

(vi) click on the looking glass icon at the bottom right of the graphics window to bring up the inspector window and choose Inspect ... Bond. Change the radius to 0.4.
Then calculate and display the hydrogen bonds:

(vii) Tools → Surface/Binding Analysis → FindHbond

(viii) Set the Line width to 3.0, leave all other parameters with their default values an click Apply

Clear the selection.
Study this view, especially regarding side chain H-bonds. Are there many? Do side chains interact more with other sidechains, or with the backbone?
Let's now simplify the scene a bit and focus on backbone/backbone H-bonds:

(ix) Select → Structure → Backbone → full

(x) Actions → Atoms/Bonds → show only

Clear the selection.
In this way you can appreciate how H-bonds build secondary structure - α-helices and β-sheets - and how these interact with each other ... in part across the KilA N boundary.
Save the resulting image as a jpeg no larger than 600px across and upload it to your Lab notebook on the Wiki.
When you are done, congratulate yourself on having earned a bonus of 10% on the next quiz.

There is a rather important lesson in this: domain definitions may be fluid, and their boundaries may be computationally derived from sequence comparisons across many families, and do not necessarily correspond to individual structures. Make sure you understand this well. }}

Given this, it seems appropriate to search the sequence database with the sequence of an Mbp1 structure–this being a structured, stable, subdomain of the whole that presumably contains the protein's most unique and specific function. Let us retrieve this sequence. All PDB structures have their sequences stored in the NCBI protein database. They can be accessed simply via the PDB-ID, which serves as an identifier both for the NCBI and the PDB databases. However there is a small catch (isn't there always?). PDB files can contain more than one protein, e.g. if the crystal structure contains a complex^[3]. Each of the individual proteins gets a so-called chain ID–a one letter identifier– to identify them uniquely. To find their unique sequence in the database, you need to know the PDB ID as well as the chain ID. If the file contains only a single protein (as in our case), the chain ID is always A^[4]. make sure you understand the concept of protein chains, and chain IDs.

Chimera "sequence": implicit or explicit ?

We discussed the distinction between implicit and explicit sequence. But which one does the Chimera sequence window display? Let's find out.

Task:

Open Chimera and load the 1BM8 structure from the PDB.
Save the ccordinate file with File → Save PDB ..., use a filename of test.pdb.
Open this file in a plain text editor: notepad, TextEdit, nano or the like, but not MS Word! Make sure you view the file in a fixed-width font, not proportionally spaced, i.e. Courier, not Arial. Otherwise the columns in the file won't line up.
Find the records that begin with SEQRES ... and confirm that the first amino acid is GLN.
Now scroll down to the ATOM section of the file. Identify the records for the first residue in the structure. Delete all lines for side-chain atoms except for the CB atom. This changes the coordinates for glutamine to those of alanine.
Replace the GLN residue name with ALA (uppercase). This relabels the residue as Alanine in the coordinate section. Therefore you have changed the implicit sequence. Implicit and explicit sequence are now different. The second atom record should now look like this:

ATOM 2 CA ALA A 4 -0.575 5.127 16.398 1.00 51.22 C

Save the file and load it in Chimera.
Open the sequence window: does it display Q or A as the first reside?

Therefore, does Chimera use the implicit or explicit sequence in the sequence window?

Coloring by conservation

With VMD, you can import a sequence alignment into the MultiSeq extension and color residues by conservation. The protocol below assumes that an MSA exists - you could have produced it in many different ways, for convenience, I have precalculated one for you. This may not contain the sequences from YFO, if you are curious about these you are welcome to add them and realign.

Task:

Load the Mbp1 APSES alignment into MultiSeq.

Access the set of MUSCLE aligned and edited fungal APSES domains.
Copy the alignment and save it into a convenient directory on your computer as a plain text file. Give it the extension .aln .
Open VMD and load the 1BM8 structure.
As usual, turn the axes off and display your structure in side-by-side stereo.
Visualize the structure as New Cartoon with Index coloring to re-orient yourself. Identify the recognition helix and the "wing".
Open Extensions → Analysis → Multiseq.
You can answer No to download metadata databases, we won't need them here.
In the MultiSeq Window, navigate to File → Import Data...; Choose "From Files" and Browse to the location of the alignment you have saved. The File navigation window gives you options which files to enable: choose to Enable ALN files (these are CLUSTAL formatted multiple sequence alignments).
Open the alignment file, click on Ok to import the data. If the data can't be loaded, the file may have the wrong extension: .aln is required.
find the Mbp1_SACCE sequence in the list, click on it and move it to the top of the Sequences list with your mouse (the list is not static, you can re-order the sequences in any way you like).

You will see that the 1BM8 sequence and the Mbp1_SACCA APSES domain sequence do not match: at the N-terminus the sequence that corresponds to the PDB structure has extra residues, and in the middle the APSES sequences may have gaps inserted.

Task:

Bring the 1MB1 sequence in register with the APSES alignment.

MultiSeq supports typical text-editor selection mechanisms. Clicking on a residue selects it, clicking on a row selects the whole sequence. Dragging with the mouse selects several residues, shift-clicking selects ranges, and option-clicking toggles the selection on or off for individual residues. Using the mouse and/or the shift key as required, select the entire first column of the Sequences you have imported. Note: don't include the 1BM8 sequence - this is just for the aligned sequences.
Select Edit → Enable Editing... → Gaps only to allow changing indels.
Pressing the spacebar once should insert a gap character before the selected column in all sequences. Insert as many gaps as you need to align the beginning of sequences with the corresponding residues of 1BM8: S I M ... . Note: Have patience - the program's response can be a bit sluggish.
Now insert as many gaps as you need into the 1BM8 structure sequence, to align it completely with the Mbp1_SACCE APSES domain sequence. (Simply select residues in the sequence and use the space bar to insert gaps. (Note: I have noticed a bug that sometimes prevents slider or keyboard input to the MultiSeq window; it fails to regain focus after operations in a different window. I don't know whether this is a Mac related problem or a more general bug in MultiSeq. When this happens I quit VMD and restore a saved session. It is a bit annoying but not mission-critical. But to be able to do that, you might want to save your session every now and then.)
When you are done, it may be prudent to save the state of your alignment. Use File → Save Session...

Task:

Color by similarity

Use the View → Coloring → Sequence similarity → BLOSUM30 option to color the residues in the alignment and structure. This clearly shows you where conserved and variable residues are located and allows to analyze their structural context.
Navigate to the Representations window and create a Tube representation of the structure's backbone. Use User coloring to color it according to the conservation score that the Multiseq extension has calculated.
Create a new representation, choose Licorice as the drawing method, User as the coloring method and select (sidechain or name CA) and not element H (note: CA, the C-alpha atom must be capitalized.)
Double-click on the NewCartoon representation to hide it.
You can adjust the color scale in the usual way by navigating to VMD main → Graphics → Colors..., choosing the Color Scale tab and adjusting the scale midpoint.

Study this structure in some detail. If you wish, you could load and superimpose the DNA complexes to determine which conserved residues are in the vicinity of the double helix strands and potentially able to interact with backbone or bases. Note that the most highly conserved residues in the family alignment are all structurally conserved elements of the core. Solvent exposed residues that comprise the surface of the recognition helix are quite variable, especially at the binding site. You may also find - if you load the DNA molecules, that residues that contact the phosphate backbone in general tend to be more highly conserved than residues that contact bases.

Visual comparison of domain annotations in R

The versatile plotting functions of R allow us to compare domain annotations. The distribution of segments that are annotated as being "low-complexity" or "disordered is particulalry interesting: these are functional features of the amino acid sequence that are often not associated with sequence similarity.

In the following code tutorial, we create a plot similar to the CDD and SMART displays. It is based on the SMART domain annotations of the six-fungal reference species for the course.

Task:
Copy the code to an R script, study and execute it.

# plotDomains
# tutorial and functions to plot a colored rectangle from a list of domain annotations


# First task: create a list structure for the annotations: this is a list of lists
# As you see below, we need to mix strings, numbers and vectors of numbers. In R
# such mixed data types must go into a list.

Mbp1Domains <- list()   # start with an empty list

# For each species annotation, compile the SMART domain annotations in a list.
Mbp1Domains <- rbind(Mbp1Domains, list(  # rbind() appends the list to the existing
    species = "Saccharomyces cerevisiae",
    code    = "SACCE",
    ACC     = "P39678",
    length  = 833,
    KilAN   = c(18,102),  # Note: Vector of (start, end) pairs
    AThook  = NULL,       # Note: NULL, because this annotation was not observed in this sequence
    Seg     = c(108,122,236,241,279,307,700,717),
    DisEMBL = NULL,
    Ankyrin = c(394,423,427,463,512,541),   # Note: Merge overlapping domains, if present
    Coils   = c(633, 655)
    ))

Mbp1Domains <- rbind(Mbp1Domains, list(
    species = "Emericella nidulans",
    code    = "ASPNI",
    ACC     = "Q5B8H6",
    length  = 695,
    KilAN   = c(23,94),
    AThook  = NULL,
    Seg     = c(529,543),
    DisEMBL = NULL,
    Ankyrin = c(260,289,381,413),
    Coils   = c(509,572)
    ))

Mbp1Domains <- rbind(Mbp1Domains, list(
    species = "Candida albicans",
    code    = "CANAL",
    ACC     = "Q5ANP5",
    length  = 852,
    KilAN   = c(19,102),
    AThook  = NULL,
    Seg     = c(351,365,678,692),
    DisEMBL = NULL,
    Ankyrin = c(376,408,412,448,516,545),
    Coils   = c(665,692)
    ))

Mbp1Domains <- rbind(Mbp1Domains, list(
    species = "Neurospora crassa",
    code    = "NEUCR",
    ACC     = "Q7RW59",
    length  = 833,
    KilAN   = c(31,110),
    AThook  = NULL,
    Seg     = c(130,141,253,266,514,525,554,564,601,618,620,629,636,652,658,672,725,735,752,771),
    DisEMBL = NULL,
    Ankyrin = c(268,297,390,419),
    Coils   = c(500,550)
    ))

Mbp1Domains <- rbind(Mbp1Domains, list(
    species = "Schizosaccharomyces pombe",
    code    = "SCHPO",
    ACC     = "P41412",
    length  = 657,
    KilAN   = c(21,97),
    AThook  = NULL,
    Seg     = c(111,125,136,145,176,191,422,447),
    DisEMBL = NULL,
    Ankyrin = c(247,276,368,397),
    Coils   = c(457,538)
    ))

Mbp1Domains <- rbind(Mbp1Domains, list(
    species = "Ustilago maydis",
    code    = "USTMA",
    ACC     = "Q4P117",
    length  = 956,
    KilAN   = c(21,98),
    AThook  = NULL,
    Seg     = c(106,116,161,183,657,672,776,796),
    DisEMBL = NULL,
    Ankyrin = c(245,274,355,384),
    Coils   = c(581,609)
    ))


# Working with data in lists and dataframes can be awkward, since the result
# of filters and slices are themselves lists, not vectors.
# Therefore we need to use the unlist() function a lot. When in doubt: unlist()

#### Boxes #####
# Define a function to draw colored boxes, given input of a vector with
# (start,end) pairs, a color, and the height where the box should go.
drawBoxes <- function(v, c, h) {   # vector of xleft, xright pairs; color; height
    if (is.null(v)) { return() }
    for (i in seq(1,length(v),by=2)) {
        rect(v[i], h-0.1, v[i+1], h+0.1, border="black", col=c)
    }
}

#### Annotations ####
# Define a function to write the species code, draw a grey
# horizontal line and call drawBoxes() for every annotation type
# in the list
drawGene <- function(rIndex) {
    # define colors:
    kil <- "#32344F"
    ank <- "#691A2C"
    seg <- "#598C9E"
    coi <- "#8B9998"
    xxx <- "#EDF7F7"
    
    text (-30, rIndex, adj=1, labels=unlist(Mbp1Domains[rIndex,"code"]), cex=0.75 )
    lines(c(0, unlist(Mbp1Domains[rIndex,"length"])), c(rIndex, rIndex), lwd=3, col="#999999")

    drawBoxes(unlist(Mbp1Domains[rIndex,"KilAN"]),   kil, rIndex)
    drawBoxes(unlist(Mbp1Domains[rIndex,"AThook"]),  xxx, rIndex)
    drawBoxes(unlist(Mbp1Domains[rIndex,"Seg"]),     seg, rIndex)
    drawBoxes(unlist(Mbp1Domains[rIndex,"DisEMBL"]), xxx, rIndex)
    drawBoxes(unlist(Mbp1Domains[rIndex,"Ankyrin"]), ank, rIndex)
    drawBoxes(unlist(Mbp1Domains[rIndex,"Coils"]),   coi, rIndex)
}

#### Plot ####
# define the size of the plot-frame
yMax <- length(Mbp1Domains[,1])   # number of domains in list
xMax <- max(unlist(Mbp1Domains[,"length"]))  # largest sequence length

# plot an empty frame
plot(1,1, xlim=c(-100,xMax), ylim=c(0, yMax) , type="n", yaxt="n", bty="n", xlab="sequence number", ylab="")

# Finally, iterate over all species and call drawGene()
for (i in 1:length(Mbp1Domains[,1])) {
    drawGene(i)
}

# end

When you execute the code, your plot should look similar to this one:

SMART domain annotations for Mbp1 proteins from six fungal species.

Task:
On the Student Wiki, add the annotations for YFO to the plot:

Copy one of the list definitions for Mbp1 domains and edit it with the appropriate values for your own annotations.
Test that you can add the YFO annotation to the plot.
Submit your validated code block to the Student Wiki here. The goal is to compile an overview of all species we are studying in class.
If your working annotation block is in the Wiki before noontime on Wednesday, you will be awarded a 10% bonus on the quiz.

EC

GO

Introduction

Harris (2008) Developing an ontology. Methods Mol Biol 452:111-24. (pmid: 18563371)

BIO Assignment Week 6

Contents

Introduction

Stereo vision

Programming R code

Conditional expressions

Logical expressions

Logical operators

Loops

Functions

The PDB

CDD domain annotation

SMART domain annotation

ID mapping

SMART search

Introduction

The DNA binding site

APSES domains in Chimera (from A4)

Chimera "sequence": implicit or explicit ?

Coloring by conservation

Visual comparison of domain annotations in R

EC

GO

Introduction

GO

GO terms

GO relationships

GO annotations

GO evidence codes

GO tools

AmiGO

AmiGO - Gene products

AmiGO - Associations

Semantic similarity

GO reading and resources

Links and resources

Footnotes and references

Ask, if things don't work for you!

Navigation menu

Search