Difference between revisions of "BIO Assignment Week 3"

From "A B C"
Jump to navigation Jump to search
(Created page with "<div id="BIO"> <div class="b1"> Assignment for Week 3<br /> <span style="font-size: 70%">Sequence Analysis</span> </div> <!-- {{Template:Active}} --> {{Template:Inactive}} C...")
 
m
Line 5: Line 5:
 
</div>
 
</div>
  
<!-- {{Template:Active}} -->
+
<!-- {{Template:Inactive}} -->
{{Template:Inactive}}
+
{{Template:Active}}
  
 
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.  
 
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.  
Line 14: Line 14:
  
  
 +
&nbsp;
 +
==Introduction==
  
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+
In this assignment we will
 +
* use the sequence search program BLAST to retrieve a sequence similar to yeast Mbp1 in YFO.
 +
* use a number of tools to annotate the sequence.
  
=== The EBI (1 mark) ===
+
&nbsp;
 +
 
 +
==Retrieve==
 +
 
 +
 
 +
In [[BIO_Assignment_Week_2#Protein|Assignment 2]] you looked at sequences in YFO that are [http://www.ncbi.nlm.nih.gov/protein?LinkName=protein_protein&from_uid=6320147 related to yeast Mbp1], by following a link from the RefSeq record. I mentioned that there are more principled ways to find related proteins: that principle is to search for similar sequences. Exactly how this works will be the subject of later lectures, but the tool that is most commonly used for this task is called '''BLAST''' (Basic Local Alignment And Search Tool). The task of this assignment is to perform a number of sequence annotations to the sequence from YFO that is '''most similar''' to Mbp1, or, more precisely, that contains an APSES domain that is most similar<ref>As you will see later on in the assignment, Mbp1-related proteins contain "Ankyrin" domains, a very widely distributed protein-protein interaction motif that may give rise to false-positive similarities for full-length sequence searches. Therefore, we search only with the DNA binding domain sequence, since this is the functionality that best characterizes the "function" of the protein we are interested in.</ref>.
 +
 
 +
&nbsp;
 +
===Search input===
 +
 
 +
 
 +
First, we need to '''define the sequence''' we will search with, as the search input. I have highlighted the extent of the APSES domain sequence in the previous assignment, but when you explored the corresponding structure in VMD, you have seen that the structured protein domain is larger and the additional secondary structure elements are in fact well integrated into the overall domain. This is not surprising: canonical domain definitions are compiled from many species and examples, and they generally comprise only the common core. Looking up the source of the domain annotations for Mbp1 is very easy:
 +
 
 +
 
 +
{{task|1=
 +
<ol>
 +
<li> Access the [http://www.ncbi.nlm.nih.gov/protein/NP_010227 RefSeq record for yeast Mbp1].</li>
 +
<li>  Look for the link to the '''CDD Search Results''' in the '''Related Information''' section and follow it.</li>
 +
</ol>
 +
 
 +
 
 +
This is a domain annotation: CDD is the NCBI's '''C'''onserved '''D'''omain '''D'''atabase and the annotation was done by a tool that scanned the sequence of Mbp1 for segments that are similar to any of the domain definitions stored in the CDD. We will return to CDD in the next assignment.
 +
 
 +
 
 +
<ol start="3">
 +
<li>Click on the box labeled Kila-N to access the CDD entry for this domain.</li>
 +
<li>Read the abstract. You should understand the relationship between Kila-N and APSES domains. One is a subfamily of the other.</li>
 +
<li>Confirm that the domain definition for Mbp1 (labeled as "query") corresponds to the region we highlighted in the last assignment.</li>
 +
</ol>
 +
 
 +
 
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand" data-collapsetext="Collapse" style="border:#000000 solid 1px; padding: 10px; margin-left:25px; margin-right:25px;">Finally load the molecule in VMD, like you did in the last assignment and study the domain architecture in stereo. <div  class="mw-collapsible-content">Use a '''New Cartoon''' representation for a selection of the "extra" residues at N- and C- terminus and color it yellow. Use another '''New Cartoon''' representation for the APSES domain proper, and color it by '''Index'''. Study this in a side-by-side stereo image and get a sense for how the ''extra'' sequence is part of the domain, and what would happen if it were removed. Orient this stereo-image for a clear view of the domain architecture in question, scale to fill the available window well, and render it. Convert the resulting image to jpeg format, resize it to be no larger than 600px across and post it to the group. The first five such posts will receive 10% bonus on the next quiz.</div>
 
</div>
 
</div>
In many ways the European EBI is complementary to the US NCBI. A data-sharing agreement for instance guarantees that the contents of the EMBL Nucleotide Sequence Database, GenBank and the Japanese DDBJ are synchronized on a daily basis. But there are of course also unique and uniquely valuable resources at the EBI. In this part of the assignment
 
*you should explore the EBI Web site, familiarize yourself with its contents and services and explore the resources to become confident you will find information that you are looking for.
 
*You should read the 2can tutorial on database browsing and the UniProt knowledgebase.
 
*You should compared a UniProt record with the corresponding GenPept record and use the ensembl browser to access a gene report.
 
  
  
#Enter the '''EBI Website''' at http://www.ebi.ac.uk/ Look for the site-map and explore the contents of this site, the databases, the services and its other offerings. Spend some time getting an idea of what is being offered here.
+
There is a rather important lesson in this: domain definitions may be fluid, and their boundaries may be computationally derived from sequence comparisons across many families, and do not necessarily correspond to individual structures. Make sure you understand this well.
# Visit the '''2can''' education support portal at http://www.ebi.ac.uk/2can/home.html . Explore its offerings, in particular, follow the links <u>Bioinformatics tutorials</u> &rarr; <u>Database browsing</u> and read the section on the different interface systems. You have encountered Entrez previously, now find out more about SRS, BioMart and UniProt Search.
+
}}
# To learn more about the '''UniProt''' database: access the UniProt user manual at http://ca.expasy.org/sprot/userman.html and read through sections '''1''' and '''2''' of the manual.
 
##Contrast the contents of a Uniprot record with a GenPept record: for example [http://www.uniprot.org/uniprot/P39678 '''MBP1_YEAST'''] and [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&id=6320147 '''NP_010227'''].
 
# Follow the link to '''Ensembl''', click on ''saccharomyces cerevisiae'' and then on chromosome IV. Access the regions from basepair 340000 to 380000; contrast the display with the NCBI MapViewer. Identify the Mbp1 gene and click on it to retrieve its Gene report (under the systematic name: YDL056W). Find your way from this Gene report to the expressed protein sequence and list the steps you have gone through.
 
  
  
 +
Given this, it seems appropriate to search with the sequence of an Mbp1 structure&ndash;this being a structured, stable, subdomain of the whole that presumably contains the protein's most unique and specific function. Let us retrieve this sequence. All PDB structures have their sequences stored in the NCBI protein database. They can be accessed simply via the PDB-ID, which serves as an identifier both for the NCBI and the PDB databases. However there is a small catch (isn't there always?). PDB files can contain more than one protein, e.g. if the crystal structure contains a complex<ref>Think of the [http://www.pdb.org/pdb/101/motm.do?momID=121 ribosome] or [http://www.pdb.org/pdb/101/motm.do?momID=3 DNA-polymerase] as extreme examples.</ref>. Each of the individual proteins gets a so-called '''chain ID'''&ndash;a one letter identifier&ndash; to identify them uniquely. To find their unique sequence in the database, you need to know the PDB ID as well as the chain ID. If the file contains only a single protein (as in our case), the chain ID is always '''<code>A</code>'''<ref>Otherwise, you need to study the PDB Web page for the structure, or the text in the PDB file itself, to identify which part of the complex is labeled with which chain ID. For example, immunoglobulin structures some time label the ''light-'' and ''heavy chain'' fragments as "L" and "H", and sometimes as "A" and "B"&ndash;there are no fixed rules. You can also load the structure in VMD, color "by chain" and use the mouse to click on residues in each chain to identify it.</ref>. make sure you understand the concept of protein chains, and chain IDs.
  
  
===VMD===
+
{{task|1=
 +
<ol>
 +
<li> Back at the [http://www.ncbi.nlm.nih.gov/protein/NP_010227 RefSeq record for yeast Mbp1], enter the '''PDB-ID''', an underscore, and the '''chain ID''' for one of the crystal structures into the search field. You can use <code>1MB1_A</code> or <code>1BM8_A</code>, but don't use <code>1L3G</code>: this NMR structure includes a large stretch of unstructured residues.</li>
 +
<li> Click on '''Display settings''' and choose '''FASTA (text)'''. You should get something like:
 +
<source lang="text">
 +
>gi|157830387|pdb|1BM8|A Chain A, Dna-Binding Domain Of Mbp1
 +
QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKY
 +
QGTWVPLNIAKQLAEKFSVYDQLKPLFDF
 +
</source></li>
 +
<li> Save this sequence as a file, in case we need it later.</li>
 +
</ol>
 +
}}
  
{{task|
+
 
* Access the [[VMD|'''VMD''' page]].  
+
Next, we use this sequence to find its most similar relative in YFO using BLAST
* Install the program as per the instructions in the section: "Installing VMD".
+
 
* In the tutorial section work through
+
 
** Part 1 (Introduction), and  
+
{{task|1=
** Part 2 (Working with a single molecule).
+
# Navigate to the [http://www.ncbi.nlm.nih.gov '''BLAST''' entry page at the NCBI].
 +
# Click on '''protein blast''' as the BLAST program to run.
 +
# Paste the sequence of the yeast Mbp1 DNA-binding domain into the search field.
 +
# Set the following parameters:
 +
## As '''Database''' option choose '''Reference proteins (refseq_protein)'''
 +
## As '''Organism''' enter the binomial name of YFO. Make sure you spell it right, the page will try to autocomplete your entry. Species level is detailed enough, you don't have to specify the strain (e.g. I would specify "''Ustilago maydis''" '''not''' "''Ustilago maydis'' 521").
 +
# Then click on the '''BLAST''' button and wait for the result to appear. You will first see a graph of any conserved domains in your query sequence, that is not yet what you are waiting for...
 +
# Patience.
 +
# Patience. The database is large.
 +
# Patience. It took about five minutes to complete for me, but execution times vary greatly by time of day.
 +
# The top "hit" on the results page is what you are looking for. Its alignment and alignment score are shown in the '''Alignments''' section a bit further down the page. Your hit should have on the order of more than 40% identities to the query and match at least 80 residues or so. <small>If your match seems less and worse than that, please eMail me to troubleshoot.</small>
 +
# The first item for each hit is a link to its database entry, right next to the checkbox.  It says something like <code>ref&#124;NP_123456789</code> or <code>ref&#124;XP_123456789</code> ... follow that link.
 +
# Note the RefSeq ID, and save the sequence in FASTA format, as above. For simplicity I will refer to this sequence as "''YFO'' Mbp1" in the future.
 
}}
 
}}
  
== Stereo vision (1 mark):===
+
&nbsp;
 +
 
 +
==Annotate==
 +
 
 +
Let us perform a few simple sequence annotations with the EMBOSS package, using the ''Saccharomyces'' and ''YFO'' Mbp1 sequences. As I noted in class, a large number of simple but fundamental sequence analysis tools are combined in the EMBOSS package. This package can be [http://emboss.sourceforge.net/download/ installed locally on your own machine], or run via a public Web interface<ref>Google for [http://www.google.ca/search?q=EMBOSS+explorer EMBOSS explorer], public access points include http://emboss.bioinformatics.nl/ and http://bioinfo.unice.fr/emboss/index.html </ref>.
 +
 
 +
Access one of the EMBOSS Explorer services and explore some of the tools:
 +
 
 +
 
 +
{{task|1=
 +
;Local composition
 +
# Find <code>pepinfo</code> under the '''PROTEIN COMPOSITION''' heading.
 +
# Paste the YFO Mbp1 sequence in FASTA format into the input field.
 +
# Run with default parameters.
 +
# Do the same in a separate window for yeast Mbp1.
 +
# Try to compare ... kind of hard without reference, overlay and expectation, isn't it?
 +
}}
 +
 
 +
 
 +
{{task|1=
 +
;Global composition
 +
# Find <code>pepstats</code> under the '''PROTEIN COMPOSITION''' heading.
 +
# Paste the YFO Mbp1 sequence in FASTA format into the input field.
 +
# Run with default parameters.
 +
# Do the same in a separate window for yeast Mbp1.
 +
# Try to compare ... are there significant and unexpected differences?
 +
}}
 +
 
 +
 
 +
{{task|1=
 +
;Motifs
 +
# Find <code>patmatmotifs</code> and <code>pepcoils</code> .
 +
# Run these with the YFO Mbp1 sequence and yeast Mbp1.
 +
# Try to compare ... do both sequences have the same motif annotated in approximately comparable regions of the protein?
 +
}}
  
{{task|
 
Access the '''[[Stereo Vision]]''' tutorial and practice viewing molecular structures in stereo.
 
  
Practice at least ...
+
{{task|1=
* two times daily,
+
;Transmembrane sequences
* for 3-5 minutes each session,
+
# Find <code>tmap</code>. Also find <code>shuffleseq</code>.
 +
# Use your YFO Mbp1 sequence to annotate transmembrane helices for it and for a few shuffled sequences. YFO Mbp1 should have neither, and the shuffled sequences work as negative controls in any case.
 +
# Also compare the following positive contros:
 +
<div class="mw-collapsible mw-collapsed" style="margin-left:25px; width:80%; border:#001166 solid 1px; padding: 10px; ">
 +
Gef1 - a yeast chloride channel with 10 trans-membrane helices and outward localized N-terminus:
 +
<div class="mw-collapsible-content">
 +
<source lang="text">
 +
>gi|6322500|ref|NP_012574.1| Gef1p [Saccharomyces cerevisiae S288c]
 +
MPTTYVPINQPIGDGEDVIDTNRFTNIPETQNFDQFVTIDKIAEENRPLSVDSDREFLNSKYRHYREVIW
 +
DRAKTFITLSSTAIVIGCIAGFLQVFTETLVNWKTGHCQRNWLLNKSFCCNGVVNEVTSTSNLLLKRQEF
 +
ECEAQGLWIAWKGHVSPFIIFMLLSVLFALISTLLVKYVAPMATGSGISEIKVWVSGFEYNKEFLGFLTL
 +
VIKSVALPLAISSGLSVGKEGPSVHYATCCGYLLTKWLLRDTLTYSSQYEYITAASGAGVAVAFGAPIGG
 +
VLFGLEEIASANRFNSSTLWKSYYVALVAITTLKYIDPFRNGRVILFNVTYDRDWKVQEIPIFIALGIFG
 +
GLYGKYISKWNINFIHFRKMYLSSWPVQEVLFLATLTALISYFNEFLKLDMTESMGILFHECVKNDNTST
 +
FSHRLCQLDENTHAFEFLKIFTSLCFATVIRALLVVVSYGARVPAGIFVPSMAVGATFGRAVSLLVERFI
 +
SGPSVITPGAYAFLGAAATLSGITNLTLTVVVIMFELTGAFMYIIPLMIVVAITRIILSTSGISGGIADQ
 +
MIMVNGFPYLEDEQDEEEEETLEKYTAEQLMSSKLITINETIYLSELESLLYDSASEYSVHGFPITKDED
 +
KFEKEKRCIGYVLKRHLASKIMMQSVNSTKAQTTLVYFNKSNEELGHRENCIGFKDIMNESPISVKKAVP
 +
VTLLFRMFKELGCKTIIVEESGILKGLVTAKDILRFKRIKYREVHGAKFTYNEALDRRCWSVIHFIIKRF
 +
TTNRNGNVI
 +
</source>
 +
</div>
 +
</div>
 +
# Evaluate the output: does the algorithm (wrongly) predict tm-helices in your protein? In the shuffled sequences? Does it find the tm-helices in Gef1?
 
}}
 
}}
  
Keep up your practice throughout the course. '''Stereo viewing will be required in the final exam,''' but more importantly, it is a wonderful skill that will greatly support any activity of yours related to structural molecular biology. Practice with different molecules and try out different colours and renderings.
 
  
'''Note: do not go through your practice sessions mechanically. If you are not making any progress with stereo vision, contact me so we can help you on the right track.'''
+
Try to familiarize yourself with the offerings in the EMBOSS package. I find some of the nucleic acid tools indispensable in the lab, such as restriction-site mapping tools, and I frequently use the alignment tools <code>Needle</code> and <code>Water</code>, but by and large the utility of many of the components&ndash;while fast, efficient and straightforward to use&ndash; suffers from lack of reference and comparison and from terse output. The routines show their conceptual origin in the 1970s and 1980s. We will encounter alternatives in later assignments.
 +
 
 +
 
 +
=== Plotting sequence composition in R ===
 +
 
 +
 
 +
It is rather straightforward to code a simple amino acid composition measurement in R and I have shown you an example in the lecture. You need to familiarize yourself with
 +
 
 +
{{task|1=
 +
# While you already have '''R''' open, access the  [[R tutorial|'''R tutorial''']] and work through the sections on
 +
## [[R tutorial#Scripts|'''Scripts''']];
 +
## [[R tutorial#Variables|'''Variables''']];
 +
## [[R tutorial#Scalar data|'''Scalar data''']]; and
 +
## [[R tutorial#Vectors|'''Vectors''']]. This should help you understand the following code in principle.
 +
# Study the following code:
 +
 
 +
 
 +
<source lang="R">
 +
# aaFreq.R
 +
# plot amino-acid frequency excess
 +
# Boris Steipe for BCH441, Sept. 2012
 +
 
 +
# install.packages("seqinr")
 +
library("seqinr")
 +
 
 +
#load a fasta-formatted sequence from your working directory
 +
mbp1 <- read.fasta("mbp1.fa", seqtype="AA", set.att=FALSE)
 +
 
 +
# define reference composition, a database average
 +
# (data taken from http://web.expasy.org/docs/relnotes/relstat.html)
 +
 
 +
refData <- c(
 +
    "A"=8.26,
 +
    "Q"=3.93,
 +
    "L"=9.66,
 +
    "S"=6.56,
 +
    "R"=5.53,
 +
    "E"=6.75,
 +
    "K"=5.84,
 +
    "T"=5.34,
 +
    "N"=4.06,
 +
    "G"=7.08,
 +
    "M"=2.42,
 +
    "W"=1.08,
 +
    "D"=5.45,
 +
    "H"=2.27,
 +
    "F"=3.86,
 +
    "Y"=2.92,
 +
    "C"=1.37,
 +
    "I"=5.96,
 +
    "P"=4.70,
 +
    "V"=6.87
 +
    )
 +
refData = refData / sum(refData) #Normalize
 +
 
 +
# how many different characters are in the dataset?
 +
lref <- length(refData)
 +
   
 +
# tabulate  the sequence of interest and normalize
 +
obsData <- table(mbp1)            # count occurrences
 +
obsData = obsData / sum(obsData)  # Normalize
 +
 
 +
# initialize a vector of zeros, which will hold our results
 +
logRatio <- rep(lref, 0)
  
=='''R'''==
+
# loop over all elements of the reference and calculate log-ratios
 +
for (i in 1:lref) {
 +
aa <- names(refData)[i] # get the name of that amino acid
 +
fObs <- obsData[aa]  # retrieve the frequency for that name
 +
fRef <- refData[aa]
 +
logRatio[aa] <- log(fObs / fRef) / log(2)
 +
}
 +
logRatio <- sort(logRatio, decreasing = TRUE) # sort values descending
  
{{#lst:R|intro}}
+
# generate a y-axis label that also contains a subscript
 +
# (see text() for details)
 +
label <- expression(paste(log[2],"( f(obs) / f(ref) )", sep = ""))
  
'''R''' is not a main focus of the course, but an important tool I would like you to pick up "on the side".
+
# define colors for the bars
 +
chargePlus  <- "#282A3F"
 +
chargeMinus <- "#531523"
 +
hydrophilic <- "#47707F"
 +
hydrophobic <- "#6F7A79"
 +
plain      <- "#EDF7F7"
  
{{task|
+
#hydrophilic <- "#194759"
* Access the [[R tutorial|'''R tutorial''']] on this site.
+
#hydrophobic <- "#3E8C84"
* Work through the sections [[R tutorial#Installation|'''Installation''']][[R tutorial#User interface|'''User interface''']],  and [[R tutorial#Packages|'''Packages''']].
+
#plain      <- "#EDF7F7"
 +
 
 +
# Assign the colors to the different amino acid types
 +
barColors <- rep(length(refData), 0)
 +
for (i in 1:length(refData)) {
 +
if      (names(logRatio[i]) == "A") { barColors[i] <- hydrophobic }
 +
else if (names(logRatio[i]) == "C") { barColors[i] <- plain }
 +
else if (names(logRatio[i]) == "D") { barColors[i] <- chargeMinus }
 +
else if (names(logRatio[i]) == "E") { barColors[i] <- chargeMinus }
 +
else if (names(logRatio[i]) == "F") { barColors[i] <- hydrophobic }
 +
else if (names(logRatio[i]) == "G") { barColors[i] <- plain }
 +
else if (names(logRatio[i]) == "H") { barColors[i] <- chargePlus }
 +
else if (names(logRatio[i]) == "I") { barColors[i] <- hydrophobic }
 +
else if (names(logRatio[i]) == "K") { barColors[i] <- chargePlus }
 +
else if (names(logRatio[i]) == "L") { barColors[i] <- hydrophobic }
 +
else if (names(logRatio[i]) == "M") { barColors[i] <- hydrophobic }
 +
else if (names(logRatio[i]) == "N") { barColors[i] <- hydrophilic }
 +
else if (names(logRatio[i]) == "P") { barColors[i] <- plain }
 +
else if (names(logRatio[i]) == "Q") { barColors[i] <- hydrophilic }
 +
else if (names(logRatio[i]) == "R") { barColors[i] <- chargePlus }
 +
else if (names(logRatio[i]) == "S") { barColors[i] <- hydrophilic }
 +
else if (names(logRatio[i]) == "T") { barColors[i] <- hydrophilic }
 +
else if (names(logRatio[i]) == "V") { barColors[i] <- hydrophobic }
 +
else if (names(logRatio[i]) == "W") { barColors[i] <- hydrophobic }
 +
else if (names(logRatio[i]) == "Y") { barColors[i] <- hydrophobic }
 +
else                                { barColors[i] <- plain }
 +
}
 +
 
 +
# create a plot
 +
barplot(logRatio, ylim=c(-4,1), col = barColors, ylab = label, cex.names=0.9)
 +
 
 +
# draw a line at 0
 +
abline(h=0)
 +
 
 +
# create a legend that indicates what the colors mean
 +
legend (x = 1,
 +
        y = -1,
 +
        legend = c("charged (+)",
 +
                  "charged (-)",
 +
                  "hydrophilic",
 +
                  "hydrophobic",
 +
                  "plain"),
 +
        bty = "n",
 +
        fill = c(chargePlus,
 +
                chargeMinus,
 +
                hydrophilic,
 +
                hydrophobic,
 +
                plain)
 +
        )
 +
       
 +
</source>
 +
 
 +
 
 +
:3. Copy the code and execute it in R. First for yeast Mbp1 (you need the appropriate FASTA file in your working directory), then edit and run for YFO. The yeast protein is enriched in hydrophilic residues and has a slight excess of (+) charged residues. This might be compatible with a partially natively disordered protein with DNA-binding properties. How is YFO Mbp1 similar or different?
 
}}
 
}}
  
 +
 +
 +
;That is all.
 +
 +
 +
&nbsp;
 +
 +
== Links and resources ==
 +
 +
<!-- {{#pmid: 19957275}} -->
 +
<!-- {{WWW|WWW_GMOD}} -->
 +
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 +
 +
 +
&nbsp;
 +
{{#lst:BIO_Assignment_Week_1|assignment_footer}}
 +
 +
 +
&nbsp;
 
[[Category:Bioinformatics]]
 
[[Category:Bioinformatics]]
 
</div>
 
</div>

Revision as of 17:36, 28 September 2012

Assignment for Week 3
Sequence Analysis

Note! This assignment is currently active. All significant changes will be announced on the mailing list.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.



 

Introduction

In this assignment we will

  • use the sequence search program BLAST to retrieve a sequence similar to yeast Mbp1 in YFO.
  • use a number of tools to annotate the sequence.

 

Retrieve

In Assignment 2 you looked at sequences in YFO that are related to yeast Mbp1, by following a link from the RefSeq record. I mentioned that there are more principled ways to find related proteins: that principle is to search for similar sequences. Exactly how this works will be the subject of later lectures, but the tool that is most commonly used for this task is called BLAST (Basic Local Alignment And Search Tool). The task of this assignment is to perform a number of sequence annotations to the sequence from YFO that is most similar to Mbp1, or, more precisely, that contains an APSES domain that is most similar[1].

 

Search input

First, we need to define the sequence we will search with, as the search input. I have highlighted the extent of the APSES domain sequence in the previous assignment, but when you explored the corresponding structure in VMD, you have seen that the structured protein domain is larger and the additional secondary structure elements are in fact well integrated into the overall domain. This is not surprising: canonical domain definitions are compiled from many species and examples, and they generally comprise only the common core. Looking up the source of the domain annotations for Mbp1 is very easy:


Task:

  1. Access the RefSeq record for yeast Mbp1.
  2. Look for the link to the CDD Search Results in the Related Information section and follow it.


This is a domain annotation: CDD is the NCBI's Conserved Domain Database and the annotation was done by a tool that scanned the sequence of Mbp1 for segments that are similar to any of the domain definitions stored in the CDD. We will return to CDD in the next assignment.


  1. Click on the box labeled Kila-N to access the CDD entry for this domain.
  2. Read the abstract. You should understand the relationship between Kila-N and APSES domains. One is a subfamily of the other.
  3. Confirm that the domain definition for Mbp1 (labeled as "query") corresponds to the region we highlighted in the last assignment.


Finally load the molecule in VMD, like you did in the last assignment and study the domain architecture in stereo.
Use a New Cartoon representation for a selection of the "extra" residues at N- and C- terminus and color it yellow. Use another New Cartoon representation for the APSES domain proper, and color it by Index. Study this in a side-by-side stereo image and get a sense for how the extra sequence is part of the domain, and what would happen if it were removed. Orient this stereo-image for a clear view of the domain architecture in question, scale to fill the available window well, and render it. Convert the resulting image to jpeg format, resize it to be no larger than 600px across and post it to the group. The first five such posts will receive 10% bonus on the next quiz.


There is a rather important lesson in this: domain definitions may be fluid, and their boundaries may be computationally derived from sequence comparisons across many families, and do not necessarily correspond to individual structures. Make sure you understand this well.


Given this, it seems appropriate to search with the sequence of an Mbp1 structure–this being a structured, stable, subdomain of the whole that presumably contains the protein's most unique and specific function. Let us retrieve this sequence. All PDB structures have their sequences stored in the NCBI protein database. They can be accessed simply via the PDB-ID, which serves as an identifier both for the NCBI and the PDB databases. However there is a small catch (isn't there always?). PDB files can contain more than one protein, e.g. if the crystal structure contains a complex[2]. Each of the individual proteins gets a so-called chain ID–a one letter identifier– to identify them uniquely. To find their unique sequence in the database, you need to know the PDB ID as well as the chain ID. If the file contains only a single protein (as in our case), the chain ID is always A[3]. make sure you understand the concept of protein chains, and chain IDs.


Task:

  1. Back at the RefSeq record for yeast Mbp1, enter the PDB-ID, an underscore, and the chain ID for one of the crystal structures into the search field. You can use 1MB1_A or 1BM8_A, but don't use 1L3G: this NMR structure includes a large stretch of unstructured residues.
  2. Click on Display settings and choose FASTA (text). You should get something like:
    >gi|157830387|pdb|1BM8|A Chain A, Dna-Binding Domain Of Mbp1
    QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKY
    QGTWVPLNIAKQLAEKFSVYDQLKPLFDF
  3. Save this sequence as a file, in case we need it later.


Next, we use this sequence to find its most similar relative in YFO using BLAST


Task:

  1. Navigate to the BLAST entry page at the NCBI.
  2. Click on protein blast as the BLAST program to run.
  3. Paste the sequence of the yeast Mbp1 DNA-binding domain into the search field.
  4. Set the following parameters:
    1. As Database option choose Reference proteins (refseq_protein)
    2. As Organism enter the binomial name of YFO. Make sure you spell it right, the page will try to autocomplete your entry. Species level is detailed enough, you don't have to specify the strain (e.g. I would specify "Ustilago maydis" not "Ustilago maydis 521").
  5. Then click on the BLAST button and wait for the result to appear. You will first see a graph of any conserved domains in your query sequence, that is not yet what you are waiting for...
  6. Patience.
  7. Patience. The database is large.
  8. Patience. It took about five minutes to complete for me, but execution times vary greatly by time of day.
  9. The top "hit" on the results page is what you are looking for. Its alignment and alignment score are shown in the Alignments section a bit further down the page. Your hit should have on the order of more than 40% identities to the query and match at least 80 residues or so. If your match seems less and worse than that, please eMail me to troubleshoot.
  10. The first item for each hit is a link to its database entry, right next to the checkbox. It says something like ref|NP_123456789 or ref|XP_123456789 ... follow that link.
  11. Note the RefSeq ID, and save the sequence in FASTA format, as above. For simplicity I will refer to this sequence as "YFO Mbp1" in the future.

 

Annotate

Let us perform a few simple sequence annotations with the EMBOSS package, using the Saccharomyces and YFO Mbp1 sequences. As I noted in class, a large number of simple but fundamental sequence analysis tools are combined in the EMBOSS package. This package can be installed locally on your own machine, or run via a public Web interface[4].

Access one of the EMBOSS Explorer services and explore some of the tools:


Task:

Local composition
  1. Find pepinfo under the PROTEIN COMPOSITION heading.
  2. Paste the YFO Mbp1 sequence in FASTA format into the input field.
  3. Run with default parameters.
  4. Do the same in a separate window for yeast Mbp1.
  5. Try to compare ... kind of hard without reference, overlay and expectation, isn't it?


Task:

Global composition
  1. Find pepstats under the PROTEIN COMPOSITION heading.
  2. Paste the YFO Mbp1 sequence in FASTA format into the input field.
  3. Run with default parameters.
  4. Do the same in a separate window for yeast Mbp1.
  5. Try to compare ... are there significant and unexpected differences?


Task:

Motifs
  1. Find patmatmotifs and pepcoils .
  2. Run these with the YFO Mbp1 sequence and yeast Mbp1.
  3. Try to compare ... do both sequences have the same motif annotated in approximately comparable regions of the protein?


Task:

Transmembrane sequences
  1. Find tmap. Also find shuffleseq.
  2. Use your YFO Mbp1 sequence to annotate transmembrane helices for it and for a few shuffled sequences. YFO Mbp1 should have neither, and the shuffled sequences work as negative controls in any case.
  3. Also compare the following positive contros:

Gef1 - a yeast chloride channel with 10 trans-membrane helices and outward localized N-terminus:

>gi|6322500|ref|NP_012574.1| Gef1p [Saccharomyces cerevisiae S288c]
MPTTYVPINQPIGDGEDVIDTNRFTNIPETQNFDQFVTIDKIAEENRPLSVDSDREFLNSKYRHYREVIW
DRAKTFITLSSTAIVIGCIAGFLQVFTETLVNWKTGHCQRNWLLNKSFCCNGVVNEVTSTSNLLLKRQEF
ECEAQGLWIAWKGHVSPFIIFMLLSVLFALISTLLVKYVAPMATGSGISEIKVWVSGFEYNKEFLGFLTL
VIKSVALPLAISSGLSVGKEGPSVHYATCCGYLLTKWLLRDTLTYSSQYEYITAASGAGVAVAFGAPIGG
VLFGLEEIASANRFNSSTLWKSYYVALVAITTLKYIDPFRNGRVILFNVTYDRDWKVQEIPIFIALGIFG
GLYGKYISKWNINFIHFRKMYLSSWPVQEVLFLATLTALISYFNEFLKLDMTESMGILFHECVKNDNTST
FSHRLCQLDENTHAFEFLKIFTSLCFATVIRALLVVVSYGARVPAGIFVPSMAVGATFGRAVSLLVERFI
SGPSVITPGAYAFLGAAATLSGITNLTLTVVVIMFELTGAFMYIIPLMIVVAITRIILSTSGISGGIADQ
MIMVNGFPYLEDEQDEEEEETLEKYTAEQLMSSKLITINETIYLSELESLLYDSASEYSVHGFPITKDED
KFEKEKRCIGYVLKRHLASKIMMQSVNSTKAQTTLVYFNKSNEELGHRENCIGFKDIMNESPISVKKAVP
VTLLFRMFKELGCKTIIVEESGILKGLVTAKDILRFKRIKYREVHGAKFTYNEALDRRCWSVIHFIIKRF
TTNRNGNVI
  1. Evaluate the output: does the algorithm (wrongly) predict tm-helices in your protein? In the shuffled sequences? Does it find the tm-helices in Gef1?


Try to familiarize yourself with the offerings in the EMBOSS package. I find some of the nucleic acid tools indispensable in the lab, such as restriction-site mapping tools, and I frequently use the alignment tools Needle and Water, but by and large the utility of many of the components–while fast, efficient and straightforward to use– suffers from lack of reference and comparison and from terse output. The routines show their conceptual origin in the 1970s and 1980s. We will encounter alternatives in later assignments.


Plotting sequence composition in R

It is rather straightforward to code a simple amino acid composition measurement in R and I have shown you an example in the lecture. You need to familiarize yourself with

Task:

  1. While you already have R open, access the R tutorial and work through the sections on
    1. Scripts;
    2. Variables;
    3. Scalar data; and
    4. Vectors. This should help you understand the following code in principle.
  2. Study the following code:


# aaFreq.R
# plot amino-acid frequency excess
# Boris Steipe for BCH441, Sept. 2012

# install.packages("seqinr")
library("seqinr")

#load a fasta-formatted sequence from your working directory
mbp1 <- read.fasta("mbp1.fa", seqtype="AA", set.att=FALSE)

# define reference composition, a database average 
# (data taken from http://web.expasy.org/docs/relnotes/relstat.html)

refData <- c(
    "A"=8.26,
    "Q"=3.93,
    "L"=9.66,
    "S"=6.56,
    "R"=5.53,
    "E"=6.75,
    "K"=5.84,
    "T"=5.34,
    "N"=4.06,
    "G"=7.08,
    "M"=2.42,
    "W"=1.08,
    "D"=5.45,
    "H"=2.27,
    "F"=3.86,
    "Y"=2.92,
    "C"=1.37,
    "I"=5.96,
    "P"=4.70,
    "V"=6.87
    )
refData = refData / sum(refData) #Normalize

# how many different characters are in the dataset?
lref <- length(refData) 
    
# tabulate  the sequence of interest and normalize
obsData <- table(mbp1)             # count occurrences
obsData = obsData / sum(obsData)   # Normalize

# initialize a vector of zeros, which will hold our results 
logRatio <- rep(lref, 0) 

# loop over all elements of the reference and calculate log-ratios
for (i in 1:lref) {
	aa <- names(refData)[i] # get the name of that amino acid
	fObs <- obsData[aa]  # retrieve the frequency for that name
	fRef <- refData[aa]
	logRatio[aa] <- log(fObs / fRef) / log(2)
}
logRatio <- sort(logRatio, decreasing = TRUE) # sort values descending 

# generate a y-axis label that also contains a subscript
# (see text() for details)
label <- expression(paste(log[2],"( f(obs) / f(ref) )", sep = ""))

# define colors for the bars
chargePlus  <- "#282A3F"
chargeMinus <- "#531523"
hydrophilic <- "#47707F"
hydrophobic <- "#6F7A79"
plain       <- "#EDF7F7"

#hydrophilic <- "#194759"
#hydrophobic <- "#3E8C84"
#plain       <- "#EDF7F7"

# Assign the colors to the different amino acid types
barColors <- rep(length(refData), 0)
for (i in 1:length(refData)) {
	if      (names(logRatio[i]) == "A") { barColors[i] <- hydrophobic }
	else if (names(logRatio[i]) == "C") { barColors[i] <- plain }
	else if (names(logRatio[i]) == "D") { barColors[i] <- chargeMinus }
	else if (names(logRatio[i]) == "E") { barColors[i] <- chargeMinus }
	else if (names(logRatio[i]) == "F") { barColors[i] <- hydrophobic }
	else if (names(logRatio[i]) == "G") { barColors[i] <- plain }
	else if (names(logRatio[i]) == "H") { barColors[i] <- chargePlus }
	else if (names(logRatio[i]) == "I") { barColors[i] <- hydrophobic }
	else if (names(logRatio[i]) == "K") { barColors[i] <- chargePlus }
	else if (names(logRatio[i]) == "L") { barColors[i] <- hydrophobic }
	else if (names(logRatio[i]) == "M") { barColors[i] <- hydrophobic }
	else if (names(logRatio[i]) == "N") { barColors[i] <- hydrophilic }
	else if (names(logRatio[i]) == "P") { barColors[i] <- plain }
	else if (names(logRatio[i]) == "Q") { barColors[i] <- hydrophilic }
	else if (names(logRatio[i]) == "R") { barColors[i] <- chargePlus }
	else if (names(logRatio[i]) == "S") { barColors[i] <- hydrophilic }
	else if (names(logRatio[i]) == "T") { barColors[i] <- hydrophilic }
	else if (names(logRatio[i]) == "V") { barColors[i] <- hydrophobic }
	else if (names(logRatio[i]) == "W") { barColors[i] <- hydrophobic }
	else if (names(logRatio[i]) == "Y") { barColors[i] <- hydrophobic }
	else                                { barColors[i] <- plain }
}

# create a plot
barplot(logRatio, ylim=c(-4,1), col = barColors, ylab = label, cex.names=0.9)

# draw a line at 0
abline(h=0)

# create a legend that indicates what the colors mean
legend (x = 1,
        y = -1,
        legend = c("charged (+)",
                   "charged (-)",
                   "hydrophilic",
                   "hydrophobic",
                   "plain"),
        bty = "n",
        fill = c(chargePlus,
                 chargeMinus,
                 hydrophilic,
                 hydrophobic, 
                 plain)
        )


3. Copy the code and execute it in R. First for yeast Mbp1 (you need the appropriate FASTA file in your working directory), then edit and run for YFO. The yeast protein is enriched in hydrophilic residues and has a slight excess of (+) charged residues. This might be compatible with a partially natively disordered protein with DNA-binding properties. How is YFO Mbp1 similar or different?


That is all.


 

Links and resources

 


Footnotes and references

  1. As you will see later on in the assignment, Mbp1-related proteins contain "Ankyrin" domains, a very widely distributed protein-protein interaction motif that may give rise to false-positive similarities for full-length sequence searches. Therefore, we search only with the DNA binding domain sequence, since this is the functionality that best characterizes the "function" of the protein we are interested in.
  2. Think of the ribosome or DNA-polymerase as extreme examples.
  3. Otherwise, you need to study the PDB Web page for the structure, or the text in the PDB file itself, to identify which part of the complex is labeled with which chain ID. For example, immunoglobulin structures some time label the light- and heavy chain fragments as "L" and "H", and sometimes as "A" and "B"–there are no fixed rules. You can also load the structure in VMD, color "by chain" and use the mouse to click on residues in each chain to identify it.
  4. Google for EMBOSS explorer, public access points include http://emboss.bioinformatics.nl/ and http://bioinfo.unice.fr/emboss/index.html


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.