Difference between revisions of "BIO Assignment Week 6"

From "A B C"
Jump to navigation Jump to search
m
m
 
(15 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
<div class="b1">
 
<div class="b1">
 
Assignment for Week 6<br />
 
Assignment for Week 6<br />
<span style="font-size: 70%">Sensitive database searches with PSI-BLAST</span>
+
<span style="font-size: 70%">Function</span>
 
</div>
 
</div>
 +
<table style="width:100%;"><tr>
 +
<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_5|&lt;&nbsp;Assignment&nbsp;5]]</td>
 +
<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[BIO_Assignment_Week_7|Assignment&nbsp;7&nbsp;&gt;]]</td>
 +
</tr></table>
  
{{Template:Active}}
+
{{Template:Inactive}}
  
 
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.  
 
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.  
Line 11: Line 15:
  
 
__TOC__
 
__TOC__
 +
 +
  
  
Line 16: Line 22:
 
==Introduction==
 
==Introduction==
  
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+
&nbsp;
 +
 
 +
In this assignment we will first download a number of APSES domain containing sequences into our database - and we will automate the process. Then we will annotate them with domain data. First manually, and then again, we will automate this. Next we will extract the APSES domains from our database according to the annotations. And finally we will align them, and visualize domain conservation in the 3D model to study parts of the protein that are conserved.
  
  
&nbsp;<br>
+
&nbsp;
  
;Take care of things, and they will take care of you.
+
==Downloading Protein Data From the Web==
:''Shunryu Suzuki''
 
</div>
 
  
  
Anyone can click buttons on a Web page, but to use the powerful sequence database search tools ''right'' often takes considerable more care, caution and consideration.
+
In [[BIO_Assignment_Week_3|Assignment 3]] we created a schema for a local protein sequence collection, and implemented it as an R list. We added sequences to this database by hand, but since the information should be cross-referenced and available based on a protein's RefSeq ID, we should really have a function that automates this process. It is far too easy to make mistakes and enter erroneous information otherwise.  
  
Much of what we know about a protein's physiological function is based on the '''conservation''' of that function as the species evolves. We assess conservation by comparing sequences between related proteins. Conservation - or its opposite: ''variation'' - is a consequence of '''selection under constraints''': protein sequences change as a consequence of DNA mutations, this changes the protein's structure, this in turn changes functions and that has the multiple effects on a species' fitness function. Detrimental variants may be removed. Variation that is tolerated is largely neutral and therefore found only in positions that are neither structurally nor functionally critical. Conservation patterns can thus provide evidence for many different questions: structural conservation among proteins with similar 3D-structures, functional conservation among homologues with comparable roles, or amino acid propensities as predictors for protein engineering and design tasks.
 
  
Measuring conservation requires alignment. Therefore a carefully done multiple sequence alignment ('''MSA''') is a cornerstone for the annotation of the essential properties a gene or protein. MSAs are also useful to resolve ambiguities in the precise placement of indels and to ensure that columns in alignments actually contain amino acids that evolve in a similar context. MSAs serve as input for
+
{{task|1=
* functional annotation;
 
* protein homology modeling;
 
* phylogenetic analyses, and
 
* sensitive homology searches in databases.
 
  
In order to perform a multiple sequence alignment, we obviously need a set of homologous sequences. This is where the trouble begins. All interpretation of MSA results depends '''absolutely''' on how the input sequences were chosen. Should we include only orthologs, or paralogs as well? Should we include only species with fully sequenced genomes, or can we tolerate that some orthologous genes are possibly missing for a species? Should we include all sequences we can lay our hands on, or should we restrict the selection to a manageable number of ''representative'' sequences? All of these choices influence our interpretation:
+
Work through the following code examples.
*orthologs are expected to be functionally and structurally conserved;
+
<source lang="R">
*paralogs may have divergent function but have similar structure;
 
*missing genes may make paralogs look like orthologs; and
 
*selection bias may weight our results toward sequences that are over-represented, and not provide a fair representation of evolutionary divergence.
 
  
 +
# To begin, we load some libraries with functions
 +
# we need...
  
In this assignment, we will set ourselves the task to use PSI-BLAST and '''find all orthologs and paralogs of the APSES domain containing transcription factors in YFO'''. We will use these sequences later for multiple alignments, calculation of conservation ''etc''. The methodical problem we will address is: how do we perform a sensitive PSI-BLAST search '''in one organism'''. This is the issue:
+
# httr sends and receives information via the http
* If we restrict the PSI-BLAST search to YFO, PSI-BLAST has little chance of building a meaningful profile - the number of homologues that actually are '''in''' YFO is too small. Thus the search will not become very sensitive.
+
# protocol, just like a Web browser.
* If we search in all species, the number of hits may become too large. This is maybe not such a problem if we can restrict our search to fungi, but imagine you are working with a bacterial protein, or a protein that is conserved across the entire tree of life: your search will find thousands of sequences. And by next year, thousands more will have been added. How will you evaluate the fringe cases, where you need to decide whether to add a new sequence with marginal E-value to the profile, or whether to hold off for one or two iterations, and to see whether the E-value drops significantly - to avoid profile corruption?
+
if (!require(httr, quietly=TRUE)) {
 +
install.packages("httr")
 +
library(httr)
 +
}
  
Therefore we have to find a middle ground: add enough species (sequences) to compile a sensitive profile, but not so many that we can't anymore assess the sequences that contribute to the profile.
+
# NCBI's eUtils send information in XML format; we
 +
# need to be able to parse XML.
 +
if (!require(XML, quietly=TRUE)) {
 +
install.packages("XML")
 +
library(XML)
 +
}
  
 +
# stringr has a number of useful utility functions
 +
# to work with strings. E.g. a function that
 +
# strips leading and trailing whitespace from
 +
# strings.
 +
if (!require(stringr, quietly=TRUE)) {
 +
install.packages("stringr")
 +
library(stringr)
 +
}
  
To put this into practice, the sequence search needs to address two issues before we begin:
 
# We need to define the sequence we are searching with; and
 
# We need to define the dataset we are searching in.
 
  
 +
# We will walk through the process with the refSeqID
 +
# of yeast Mbp1
 +
refSeqID <- "NP_010227"
  
  
 +
# UniProt.
 +
# The UniProt ID mapping service supports a "RESTful
 +
# API": responses can be obtained simply via a Web-
 +
# browsers request. Such requests are commonly sent
 +
# via the GET or POST verbs that a Webserver responds
 +
# to, when a client asks for data. GET requests are
 +
# visible in the URL of the request; POST requests
 +
# are not directly visible, they are commonly used
 +
# to send the contents of forms, or when transmitting
 +
# larger, complex data items. The UniProt ID mapping
 +
# sevice can accept long lists of IDs, thus using the
 +
# POST mechanism makes sense.
  
 +
# R has a POST() function as part of the httr package.
  
==Defining the sequence to search with==
+
# It's very straightforward to use: just define the URL
 +
# of the server and send a list of items as the
 +
# body of the request.
  
 +
# uniProt ID mapping service
 +
URL <- "http://www.uniprot.org/mapping/"
 +
response <- POST(URL,
 +
                body = list(from = "P_REFSEQ_AC",
 +
                            to = "ACC",
 +
                            format = "tab",
 +
                            query = refSeqID))
  
Consider again the task we set out from: '''find all orthologs and paralogs of the APSES domain containing transcription factors in YFO'''.
+
response
  
 +
# If the query is successful, tabbed text is returned.
 +
# and we capture the fourth element as the requested
 +
# mapped ID.
 +
unlist(strsplit(content(response), "\\s+"))
  
{{task|1=
+
# If the query can't be fulfilled because of a problem
What query sequence should you use? Should you ...
+
# with the server, a WebPage is rturned. But the server status
 +
# is also returned and we can check the status code. I have
 +
# lately gotten many "503" status codes: Server Not Available...
  
 +
if (response$status_code == 200) { # 200: oK
 +
uniProtID <- unlist(strsplit(content(response), "\\s+"))[4]
 +
if (is.na(uniProtID)) {
 +
warning(paste("UniProt ID mapping service returned NA.",
 +
              "Check your RefSeqID."))
 +
}
 +
} else {
 +
uniProtID <- NA
 +
warning(paste("No uniProt ID mapping available:",
 +
              "server returned status",
 +
              response$status_code))
 +
}
  
# Search with the full-length Mbp1 sequence from ''Saccharomyces cerevisiae''?
+
uniProtID  # Let's see what we got...
# Search with the full-length Mbp1 homolog that you found in YFO?
+
          # This should be "P39678"
# Search with the ''S. cerevisiae'' APSES domain sequence?
+
          # (or NA if the query failed)
# Search with the APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?
+
</source>
# Search with the KilA-N domain sequence?
 
  
  
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand" data-collapsetext="Collapse" style="border:#000000 solid 1px; padding: 10px; margin-left:25px; margin-right:25px;">Reflect on this (pretend this is a quiz question) and come up with a reasoned answer. Then click on "Expand" to read my opinion on this question.
+
Next, we'll retrieve data from the various NCBI databases.
<div  class="mw-collapsible-content">
 
;The full-length Mbp1 sequence from ''Saccharomyces cerevisiae''
 
:Since this sequence contains multiple domains (in particular the ubiquitous Ankyrin domains) it is not suitable for BLAST database searches. You must restrict your search to the domain of greatest interest for your question. That would be the APSES domain.
 
  
;The full-length Mbp1 homolog that you found in YFO
+
It is has become unreasonably difficult to screenscrape the NCBI site
:What organism the search sequence comes from does not make a difference. Since you aim to find '''all''' homologs in YFO, it is not necessary to have your search sequence more or less similar to '''any particular''' homologs. In fact '''any''' APSES sequence should give you the same result, since they are '''all''' homologous. But the full-length sequence in YFO has the same problem as the ''Saccharomyces'' sequence.
+
since the actual page contents are dynamically loaded via
 +
AJAX. This may be intentional, or just  overengineering.
 +
While NCBI offers a subset of their data via the eutils API and
 +
that works well enough, some of the data that is available to the
 +
Web browser's eyes is not served to a program.
  
;The ''S. cerevisiae'' APSES domain sequence?
+
The eutils API returns data in XML format. Have a
:That would be my first choice, just because it is nicely defined as the sequence of the <code>1BM8</code> PDB structure. (<code>1MB1</code> would also work, but you would need to edit out the penta-Histidine tag at the C-terminus that was engineered into the sequence to help purify the recombinantly expressed protein.)
+
look at the following URL in your browser to see what that looks like:
  
;The APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?
+
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=NP_010227
:As argued above: since they are all homologs, any of them should lead to the same set of results.
 
  
;The KilA-N domain sequence?
 
:This is a shorter sequence and a more distant homolog to the domain we are interested in. It would not be my first choice: the fact that it is more distantly related might make the search '''more sensitive'''. The fact that it is shorter might make the search '''less specific'''. The effect of this tradeoff would need to be compared and considered. By the way: the same holds for the even shorter subdomain 50-74 we discussed in the last assignment. However: one of the results of our analysis will be '''whether APSES domains in fungi all have the same length as the Mbp1 domain, or whether some are indeed much shorter, as sugested by the Pfam alignment.'''
 
  
 +
<source lang="R">
  
So in my opinion, you should search with the yeast Mbp1 APSES domain, i.e. the sequence which you have previously studied in the crystal structure. Where is that? Well, you might have saved it in your journal, or you can get it again from the [http://www.pdb.org/pdb/explore/explore.do?structureId=1BM8 '''PDB'''] (i.e. [http://www.pdb.org/pdb/files/fasta.txt?structureIdList=1BM8 here], or from [[BIO_Assignment_Week_3#Search input|Assignment 3]].
+
# In order to parse such data, we need tools from the  
 +
# XML package.  
  
</div>
+
# First we build a query URL...
</div>
+
eUtilsBase <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
}}
 
  
&nbsp;
 
  
==Selecting species for a PSI-BLAST search==
+
# Then we assemble an URL that will search for get the
 +
# unique, NCBI internal identifier,  the GI number,
 +
# for our refSeqID...
 +
URL <- paste(eUtilsBase,
 +
            "esearch.fcgi?",    # ...using the esearch program
 +
                                  # that finds an entry in an
 +
                                  # NCBI database
 +
            "db=protein",
 +
            "&term=", refSeqID,
 +
            sep="")
 +
# Copy the URL and paste it into your browser to see
 +
# what the response should look like.
 +
URL
  
 +
# To fetch a response in R, we use the function htmlParse()
 +
# with our URL as its argument.
 +
response <- htmlParse(URL)
 +
response
  
As discussed in the introduction, in order to use our sequence set for studying structural and functional features and conservation patterns of our APSES domain proteins, we should start with a well selected dataset of APSES domain containing homologs in YFO. Since these may be quite divergent, we can't rely on '''BLAST''' to find all of them, we need to use the much more sensitive search of '''PSI-BLAST''' instead. But even though you are interested only in YFO's genes, it would be a mistake to restrict the PSI-BLAST search to YFO. PSI-BLAST becomes more sensitive if the profile represents more diverged homologs. Therefore we should always search with a broadly representative set of species, even if we are interested only in the results for one of the species. This is important. Please reflect on this for a bit and make sure you understand the rationale why we include sequences in the search that we are not actually interested in.
+
# This is XML. We can take the response apart into
 +
# its indvidual components with the xmlToList function.
  
 +
xmlToList(response)
  
But you can also search with '''too many''' species: if the number of species is large and PSI-BLAST finds a large number of results:
+
# Note how the XML "tree" is represented as a list of
# it becomes unwieldy to check the newly included sequences at each iteration, inclusion of false-positive hits may result, profile corruption and loss of specificity. The search will fail.
+
# lists of lists ...
# since genomes from some parts of the Tree Of Life are over represented, the inclusion of all sequences leads to selection bias and loss of sensitivity.
+
# If we know exactly what elelement we are looking for,
 +
# we can extract it from this structure:
 +
xmlToList(response)[["body"]][["esearchresult"]][["idlist"]][["id"]]
  
 +
# But this is not very robus, it would break with the
 +
# slightest change that the NCBI makes to their response
 +
# and the NCBI changes things A LOT!
  
We should therefore try to find a subset of species
+
# Somewhat more robust is to specify the type of element
# that represent as large a '''range''' as possible on the evolutionary tree;
+
# we want - its the text contained in an <id>...</id>
# that are as well '''distributed''' as possible on the tree; and
+
# elelement, and use the XPath XML parsing language to
# whose '''genomes''' are fully sequenced.
+
# retrieve it.
  
These criteria are important. Again, reflect on them and understand their justification. Choosing your species well for a PSI-BLAST search can be crucial to obtain results that are robust and meaningful.
+
# getNodeSet() lets us fetch tagged contents by
 +
# applying toString.XMLNode() to it...
  
How can we '''define''' a list of such species, and how can we '''use''' the list?
+
node <- getNodeSet(response, "//id/text()")
 +
unlist(lapply(node, toString.XMLNode))  # "6320147 "
  
The definition is a rather typical bioinformatics task for integrating datasources: "retrieve a list of reresentative fungi with fully sequenced genomes".  Unfortunately, to do this in a principled way requires tools that you can't (yet) program: we would need to use a list of genome sequenced fungi, estimate their evolutionary distance and select a well-distributed sample. But we can come close enough to this with the following steps:
+
# We will be doing this a lot, so we write a function
 +
# for it...
 +
node2string <- function(doc, tag) {
 +
    # an extractor function for the contents of elements
 +
    # between given tags in an XML response.
 +
    # Contents of all matching elements is returned in
 +
    # a vector of strings.
 +
path <- paste("//", tag, "/text()", sep="")
 +
nodes <- getNodeSet(doc, path)
 +
return(unlist(lapply(nodes, toString.XMLNode)))
 +
}
  
# Use a list of genome sequenced fungi (from NCBI);
+
# using node2string() ...
# BLAST the yeast Mbp1 APSES domain against that list;
+
GID <- node2string(response, "id")
# Evaluate the taxonomy report that BLAST generates
+
GID
# Select species of approximately similar ''taxonomic rank''.
 
  
Again: reflect on this process and make sure you understand the principle. You should be able to ask yourself: how would I do this for a protein I work with after the course... ? (And know the answer.)
+
# The GI is the pivot for all our data requests at the
 +
# NCBI.  
  
 +
# Let's first get the associated data for this GI
 +
URL <- paste(eUtilsBase,
 +
            "esummary.fcgi?",
 +
            "db=protein",
 +
            "&id=",
 +
            GID,
 +
            "&version=2.0",
 +
            sep="")
 +
response <- htmlParse(URL)
 +
URL
 +
response
  
{{task|1=
+
taxID <- node2string(response, "taxid")
 +
organism <- node2string(response, "organism")
 +
taxID
 +
organism
  
# Navigate to the [http://blast.ncbi.nlm.nih.gov/ '''BLAST'''] home page.
 
# Find the link to '''list all genomic BLAST databases''' and follow it. This list will take you to a selection of genome-sequenced fungi.
 
# Find the section of '''Fungi''' and click on the small triangle if it is not yet "open".
 
# Don't be deceived: there are more species in the database than these. You could follow the links if you wanted to search in '''one particular genome'''. We will search in a '''set of genomes''' instead. Click on the small, round '''B''' icon, next to the group label '''Fungi'''. You should arrive at [http://www.ncbi.nlm.nih.gov/sutils/genom_table.cgi?organism=fungi this] page.
 
#From the drop-down menus select:
 
## Query: '''Protein'''
 
## Database: '''Protein'''
 
## BLAST-Program: '''blastp'''
 
# Check the boxes next to all species that have a pale yellow background. As you can read in the header of the page, these are completed genomic sequences. As of today, there are 34 such genomes.
 
# Paste your search sequence - i.e. the sequence of the yeast Mbp1 APSES domain into the field.
 
# Click on '''BLAST''', then on '''View results''' on the next page.
 
# In the header section of the BLAST report, find the line '''Other reports''' and open the '''Taxonomy report''' in a separate tab or window.
 
# For completeness, scroll through the list of '''Descriptions''' - "Sequences producing significant alignments" and look at he accession numbers. Most of these are RefSeq IDs (either <code>NP_...</code> or <code>XP_...</code>). Make sure that for all of the species that do '''not''' have RefSeq identifiers there are variants or strains that do. The reason is: we would not want to inadvertently exclude species in favour of closely related other species for which the genome has not yet been imported into RefSeq. Since we will be doing our full-scale search on RefSeq, we want to ensure all our species are actually represented there. Please reflect on this for a moment and make sure you understand this point.
 
# Now examine the taxonomy report. The page has three sections: a '''Lineage Report''', an '''Organism Report''' and the '''Taxonomy report'''.
 
  
}}
+
# Next, fetch the actual sequence
 +
URL <- paste(eUtilsBase,
 +
            "efetch.fcgi?",
 +
            "db=protein",
 +
            "&id=",
 +
            GID,
 +
            "&retmode=text&rettype=fasta",
 +
            sep="")
 +
response <- htmlParse(URL)
 +
URL
 +
response
  
To make use of the Taxonomy report, you should know that {{WP|Biological classification|biological classification}} provides a hierarchical system that defines relationships for all living entities. The levels of the hierarchy are so called {{WP|Taxonomic rank|'''taxonomic ranks'''}}. These ranks are defined in ''Codes of Nomenclature'' that are curated by the self-governed international associations of scientists working in the field. The number of ranks is not specified: there is a general consensus on seven principal ranks (see below, in bold) but many subcategories exist and may be newly introduced. It is desired&ndash;but not mandated&ndash;that ranks represent ''clades'' (a group of related species, or a "branch" of a phylogeny), and it is desired&ndash;but not madated&ndash;that the rank is sharply defined. The system is based on subjective dissimilarity. Needless to say that it is in flux. However the coarse outlines are basically stable and will serve for our purpose of identifying a number of well-distributed species from a set.
+
fasta <- node2string(response, "p")
 +
fasta
  
If we follow a link to an entry in the NCBI's Taxonomy database, eg. [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292 ''Saccharomyces cerevisiae S228c''], the strain from which the original "yeast genome" was sequenced in the late 1990s, we see the following specification of its taxonomic lineage:
+
seq <- unlist(strsplit(fasta, "\\n"))[-1] # Drop the first elelment,
 +
                                          # it is the FASTA header.
 +
seq
  
  
<source lang="text">
+
# Next, fetch the crossreference to the NCBI Gene
cellular organisms; Eukaryota; Opisthokonta; Fungi; Dikarya;
+
# database
Ascomycota; Saccharomyceta; Saccharomycotina; Saccharomycetes;
+
URL <- paste(eUtilsBase,
Saccharomycetales; Saccharomycetaceae; Saccharomyces; Saccharomyces cerevisiae
+
            "elink.fcgi?",
</source>
+
            "dbfrom=protein",
 +
            "&db=gene",
 +
            "&id=",
 +
            GID,
 +
            sep="")
 +
response <- htmlParse(URL)
 +
URL
 +
response
  
 +
geneID <- node2string(response, "linksetdb/id")
 +
geneID
  
These names can be mapped into taxonomic ranks ranks, since the suffixes of these names e.g. ''-mycotina'', ''-mycetaceae'' are specific to defined ranks. (NCBI does not provide this mapping, but {{WP|Taxonomic rank|Wikipedia}} is helpful here.)
+
# ... and the actual Gene record:
 +
URL <- paste(eUtilsBase,
 +
            "esummary.fcgi?",
 +
            "&db=gene",
 +
            "&id=",
 +
            geneID,
 +
            sep="")
 +
response <- htmlParse(URL)
 +
URL
 +
response
  
<table>
+
name <- node2string(response, "name")
 +
genome_xref <- node2string(response, "chraccver")
 +
genome_from <- node2string(response, "chrstart")[1]
 +
genome_to <- node2string(response, "chrstop")[1]
 +
name
 +
genome_xref
 +
genome_from
 +
genome_to
  
<tr class="sh">
+
# So far so good. But since we need to do this a lot
<td>Rank</td>
+
# we need to roll all of this into a function.
<td>Suffix</td>
 
<td>Example</td>
 
</tr>
 
  
<tr class="s1">
+
# I have added the function to the dbUtilities code
<td>Domain</td>
+
# so you can update it easily.
<td></td>
 
<td>Eukaryota (Eukarya)</td>
 
</tr>
 
  
<tr class="s2">
+
# Run:
<td>&nbsp;&nbsp;Subdomain</td>
 
<td>&nbsp;</td>
 
<td>Opisthokonta</td>
 
</tr>
 
  
<tr class="s1">
+
updateDbUtilities("55ca561e2944af6e9ce5cf2a558d0a3c588ea9af")
<td>'''Kingdom'''</td>
 
<td>&nbsp;</td>
 
<td>Fungi</td>
 
</tr>
 
  
<tr class="s2">
+
# If that is successful, try these three testcases
<td>&nbsp;&nbsp;Subkingdom</td>
 
<td>&nbsp;</td>
 
<td>Dikarya</td>
 
</tr>
 
  
<tr class="s1">
+
myNewDB <- createDB()
<td>'''Phylum'''</td>
+
tmp <- fetchProteinData("NP_010227") # Mbp1p
<td>&nbsp;</td>
+
tmp
<td>Ascomycota</td>
+
myNewDB <- addToDB(myNewDB, tmp)
</tr>
+
myNewDB
  
<tr class="s2">
+
tmp <- fetchProteinData("NP_011036") # Swi4p
<td>&nbsp;&nbsp;''rankless taxon''<ref>The -myceta are well supported groups above the Class rank. See {{WP|Leotiomyceta|''Leotiomyceta''}} for details and references.</ref></td>
+
tmp
<td>-myceta</td>
+
myNewDB <- addToDB(myNewDB, tmp)
<td>Saccharomyceta</td>
+
myNewDB
</tr>
 
  
<tr class="s1">
+
tmp <- fetchProteinData("NP_012881") # Phd1p
<td>&nbsp;&nbsp;Subphylum</td>
+
tmp
<td>-mycotina</td>
+
myNewDB <- addToDB(myNewDB, tmp)
<td>Saccharomycotina</td>
+
myNewDB
</tr>
 
  
<tr class="s2">
 
<td>'''Class'''</td>
 
<td>-mycetes</td>
 
<td>Saccharomycetes</td>
 
</tr>
 
  
<tr class="s1">
 
<td>&nbsp;&nbsp;Subclass</td>
 
<td>-mycetidae</td>
 
<td>&nbsp;</td>
 
</tr>
 
  
<tr class="s2">
+
</source>
<td>'''Order'''</td>
 
<td>-ales</td>
 
<td>Saccharomycetales</td>
 
</tr>
 
  
<tr class="s1">
 
<td>'''Family'''</td>
 
<td>-aceae</td>
 
<td>Saccharomycetaceae</td>
 
</tr>
 
  
<tr class="s2">
+
}}
<td>&nbsp;&nbsp;Subfamily</td>
 
<td>-oideae</td>
 
<td>&nbsp;</td>
 
</tr>
 
  
<tr class="s1">
 
<td>&nbsp;&nbsp;Tribe</td>
 
<td>-eae</td>
 
<td>&nbsp;</td>
 
</tr>
 
  
<tr class="s2">
 
<td>&nbsp;&nbsp;Subtribe</td>
 
<td>-ineae</td>
 
<td>&nbsp;</td>
 
</tr>
 
  
<tr class="s1">
+
This new <code>fetchProteinData()</code> function seems to be quite convenient. I have compiled a [[Reference_APSES_proteins_(reference_species)|set of APSES domain proteins]] for ten fungi species and loaded the 48 protein's data into an R database in a few minutes. This "reference database" will be automatically loaded for you with the '''next''' dbUtilities update. Note that it will be recreated every time you start up '''R'''. This means two things: (i) if you break something in the reference database, it's not a problem. (ii) if you store your own data in it, it will be lost. In order to add your own genes, you need to make a working copy for yourself.
<td>'''Genus'''</td>
 
<td>&nbsp;</td>
 
<td>Saccharomyces</td>
 
</tr>
 
  
<tr class="s2">
 
<td>'''Species'''</td>
 
<td>&nbsp;</td>
 
<td>''Saccharomyces cerevisiae''</td>
 
</tr>
 
  
<table>
+
====Computer literacy====
  
  
You can see that there is not a common mapping between the yeast lineage and the commonly recognized categories - not all ranks are represented. Nor is this consistent across species in the taxonomic database: some have subfamily ranks and some don't. However, armed with this information, we should be able make sense of the Taxonomy Report for our purposes:
+
;Digression - some musings on computer literacy and code engineering.
 +
It's really useful to get into a consistent habit of giving your files a meaningful name. The name should include something that tells you what the file contains, and something that tells you the date or version. I give versions major and minor numbers, and - knowing how much things always change - I write major version numbers with a leading zero eg. <code>04</code> so that they will be correctly sorted by name in a directory listing. The same goes for dates: always write <code>YYYY-MM-DD</code> to ensure proper sorting.
  
 +
On the topic of versions: creating the database with its data structures and the functions that operate on them is an ongoing process, and changes in one part of the code may have important consequences for another part. Imagine I made a poor choice of a column name early on: changing that would need to be done in every single function of the code that reads or writes or analyzes data. Once the code reaches a certain level of complexity, organizing it well is just as important as writing it in the first place. In the new update of <code>dbUtilities.R</code>, a database has a <code>$version</code> element, and every function checks that the database version matches the version for which the function was written. Obviously, this also means the developer must provide tools to migrate contents from an older version to a newer version. And since migrating can run into trouble and leave all data in an inconsistent and unfixable state, it's a good time to remind you to back up important data frequently. Of course you will want to save your database once you've done any significant work with it. And you will especially want to save the databases you create for your Term Project. But you should also (and perhaps more importantly) save the script that you use to create the database in the first place. And on that note: when was the last time you made a full backup of your computer's hard-drive? Too long ago? I thought so.
  
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand" data-collapsetext="Collapse" style="border:#000000 solid 1px; padding: 10px; margin-left:25px; margin-right:25px;">The NCBI BLAST taxonomy report for hits to Mbp1 homologs in genome-sequenced fungi:
+
;Backup your hard-drive now.
  
<source lang="text">
 
Dikarya
 
. saccharomyceta
 
. . Saccharomycetales
 
</source>
 
<div  class="mw-collapsible-content">
 
<source lang="text">
 
. . . Saccharomycetaceae
 
. . . . Saccharomyces
 
. . . . . Saccharomyces cerevisiae
 
. . . . . . Saccharomyces cerevisiae S288c
 
. . . . . . Saccharomyces cerevisiae CEN.PK113-7D
 
. . . . . . Saccharomyces cerevisiae YJM789
 
. . . . . . Saccharomyces cerevisiae RM11-1a
 
. . . . . . Saccharomyces cerevisiae AWRI1631
 
. . . . . . Saccharomyces cerevisiae JAY291
 
. . . . . . Saccharomyces cerevisiae Lalvin QA23
 
. . . . . . Saccharomyces cerevisiae FostersB
 
. . . . . . Saccharomyces cerevisiae AWRI796
 
. . . . . . Saccharomyces cerevisiae VL3
 
. . . . . . Saccharomyces cerevisiae Vin13
 
. . . . . . Saccharomyces cerevisiae EC1118
 
. . . . . . Saccharomyces cerevisiae FostersO
 
. . . . . Saccharomyces cerevisiae x Saccharomyces kudriavzevii VIN7
 
. . . . mitosporic Nakaseomyces
 
. . . . . Candida glabrata
 
. . . . . . Candida glabrata CBS 138
 
. . . . Tetrapisispora phaffii CBS 4417
 
. . . . Kluyveromyces
 
. . . . . Kluyveromyces lactis
 
. . . . . . Kluyveromyces lactis NRRL Y-1140
 
. . . . Naumovozyma
 
. . . . . Naumovozyma dairenensis CBS 421
 
. . . . . Naumovozyma castellii CBS 4309
 
. . . . Zygosaccharomyces
 
. . . . . Zygosaccharomyces rouxii
 
. . . . . . Zygosaccharomyces rouxii CBS 732
 
. . . . Vanderwaltozyma polyspora DSM 70294
 
. . . . Eremothecium
 
. . . . . Eremothecium cymbalariae DBVPG#7215
 
. . . . . Eremothecium gossypii
 
. . . . . . Ashbya gossypii ATCC 10895
 
. . . . . . Ashbya gossypii FDAG1
 
. . . . Torulaspora delbrueckii
 
. . . . Lachancea thermotolerans CBS 6340
 
. . . . Komagataella pastoris
 
. . . . . Komagataella pastoris GS115
 
. . . . . Komagataella pastoris CBS 7435
 
. . . Candida
 
. . . . Candida dubliniensis CD36
 
. . . . Candida albicans
 
. . . . . Candida albicans WO-1
 
. . . . . Candida albicans SC5314
 
. . . Debaryomycetaceae
 
. . . . Scheffersomyces stipitis CBS 6054
 
. . . . Debaryomyces hansenii CBS767
 
. . . Yarrowia lipolytica CLIB122
 
. . leotiomyceta
 
. . . mitosporic Trichocomaceae
 
. . . . Aspergillus
 
. . . . . Aspergillus niger
 
. . . . . . Aspergillus niger CBS 513.88
 
. . . . . . Aspergillus niger ATCC 1015
 
. . . . . Aspergillus fumigatus
 
. . . . . . Aspergillus fumigatus Af293
 
. . . . . . Aspergillus fumigatus A1163
 
. . . . Penicillium chrysogenum Wisconsin 54-1255
 
. . . Sordariomycetidae
 
. . . . Magnaporthe
 
. . . . . Magnaporthe oryzae 70-15
 
. . . . . Magnaporthe grisea
 
. . . . Chaetomiaceae .
 
. . . . . Myceliophthora thermophila ATCC 42464 .
 
. . . . . Thielavia terrestris NRRL 8126
 
. . . Dothideomycetes .
 
. . . . Zymoseptoria tritici IPO323 .
 
. . . . Phaeosphaeria nodorum SN15
 
. . Schizosaccharomyces
 
. . . Schizosaccharomyces pombe
 
. . . . Schizosaccharomyces pombe 972h-
 
. Basidiomycota .
 
. . Ustilago maydis 521 .
 
. . Filobasidiella/Cryptococcus neoformans species complex
 
. . . Cryptococcus neoformans var. neoformans .
 
. . . . Cryptococcus neoformans var. neoformans JEC21 .
 
. . . . Cryptococcus neoformans var. neoformans B-3501A .
 
. . . Cryptococcus gattii WM276 .
 
</source>
 
</div>
 
</div>
 
  
 +
If your last backup at the time of next week's quiz was less than two days ago, you will receive a 0.5 mark bonus.
  
You need to note that this report gives '''the highest taxonomic rank that is common to a group below it''', i.e. two species differ in the rank immediately below the last one named. For example the two species identified as common at the ''class'' level ...
 
  
<source lang="text">. . . Dothideomycetes .
+
===New Database ===
. . . . Zymoseptoria tritici IPO323 .
 
. . . . Phaeosphaeria nodorum SN15
 
</source>
 
  
 +
Here is some sample code to work with the new database, enter new protein data for YFO, save it and load it again when needed.
  
... differ at the ''subclass'' rank as we can see if we follow their links to the taxonomy browser.
 
  
 +
<source lang="R">
 +
# You don't need to load the reference database refDB. If
 +
# everything is set up correctly, it gets loaded at startup.
 +
# (Just so you know: you can turn off that behaviour if you
 +
# ever should want to...)
  
<source lang="text">. . . Dothideomycetes .
 
. . . . Dothideomycetidae
 
. . . . . Zymoseptoria tritici IPO323 .
 
. . . . Pleosporomycetidae
 
. . . . . Phaeosphaeria nodorum SN15
 
</source>
 
  
 +
# First you need to load the newest version of dbUtilities.R
  
Our goal is to remove species that are "too similar". What that means precisely is really up to us, but for this purpose let's keep only one representative of each ''subfamily'', i.e. we will remove (by hand) all but one representative of a set that shares a -oideaea (subfamily), -eaea (tribe) or -ineae (subtribe) designation or below. The result is the following set of 23 species (I have formatted them so they can be pasted into the Entrez filter field, but one could also enter species one by one, by pressing the '''(+)''' botton after the organism list):
+
updateDButilities("7bb32ab3d0861ad81bdcb9294f0f6a737b820bf9")
  
 +
# If you get an error:
 +
#    Error: could not find function "updateDButilities"
 +
# ... then it seems you didn't do the previous update.
  
<source lang="text">
+
# Try getting the update with the new key but the previous function:
Saccharomyces cerevisiae [ORGN]
+
# updateDbUtilities()
OR Candida glabrata [ORGN]
+
#
OR Tetrapisispora phaffii [ORGN]
+
# If that function is not found either, confirm that your ~/.Rprofile
OR Kluyveromyces lactis [ORGN]
+
# actually loads dbUtilites.R from your project directory.
OR Naumovozyma dairenensis [ORGN]
 
OR Zygosaccharomyces rouxii [ORGN]
 
OR Vanderwaltozyma polyspora [ORGN]
 
OR Ashbya gossypii [ORGN]
 
OR Torulaspora delbrueckii [ORGN]
 
OR Lachancea thermotolerans [ORGN]
 
OR Komagataella pastoris [ORGN]
 
OR Candida albicans [ORGN]
 
OR Debaryomyces hansenii [ORGN]
 
OR Yarrowia lipolytica [ORGN]
 
OR Aspergillus niger [ORGN]
 
OR Penicillium chrysogenum [ORGN]
 
OR Magnaporthe oryzae [ORGN]
 
OR Myceliophthora thermophila [ORGN]
 
OR Zymoseptoria tritici [ORGN]
 
OR Phaeosphaeria nodorum [ORGN]
 
OR Schizosaccharomyces pombe [ORGN]
 
OR Ustilago maydis [ORGN]
 
OR Cryptococcus neoformans [ORGN]
 
</source>
 
  
(Consider that this list is quite a bit shorter than the much larger number of species you originally found in the BLAST search report.)
+
# As a desperate last resort, you could uncomment
 +
# the following piece of code and run the update
 +
# without verification...
 +
#
 +
# URL <- "http://steipe.biochemistry.utoronto.ca/abc/images/f/f9/DbUtilities.R"
 +
# download.file(URL, paste(PROJECTDIR, "dbUtilities.R", sep="")), method="auto")
 +
# source(paste(PROJECTDIR, "dbUtilities.R", sep=""))
 +
#
 +
# But be cautious: there is no verification. You yourself need
 +
# to satisfy yourself that this "file from the internet" is what
 +
# it should be, before source()'ing it...
  
  
&nbsp;
+
# After the file has been source()'d,  refDB exists.
==Executing the PSI-BLAST search==
+
ls(refDB)
  
We have a list of species. Goof. Next up: how do we '''use''' it.
 
  
{{task|1=
+
# check the contents of refDB:
 +
refDB$protein$name
 +
refDB$taxonomy
  
  
# Navigate to the BLAST homepage.
+
# list refSeqIDs for saccharomyces cerevisiae genes.
# Select '''protein BLAST'''.
+
refDB$protein[refDB$protein$taxID == 559292, "refSeqID"]
# Paste the APSES domain sequence into the search field.
 
# Select '''refseq''' as the database.
 
# Copy the organism restriction list from above '''and enter the correct name for YFO''' into the list if it is not there already. Obviously, you can't find sequences in YFO if YFO is not included in your search space. Paste the list into the '''Organism''' field.
 
# In the '''Algorithm''' section, select PSI-BLAST.
 
#Click on '''BLAST'''.
 
}}
 
  
  
Evaluate the results carefully. Since we used default parameters, the threshold for inclusion was set at an '''E-value''' of 0.005 by default, and that may be a bit too lenient. If you look at the table of your hits&ndash; in the '''Sequences producing significant alignments...''' section&ndash; there are also quite a few sequences that have a low query coverage. Let's exclude these from the profile initially: not to worry, if they are true positives, the will come back with lower E-values in subsequent iterations. But if they were false positives, their E-values will rise and they should drop out of the profile and not contaminate it.
+
# To add some genes from YFO, I proceed as follows.
 +
# Obviously, you need to adapt this to your YFO
 +
# and the sequences in YFO that you have found
 +
# with your PSI-BLAST search.
  
 +
# Let's assume my YFO is the fly agaric (amanita muscaria)
 +
# and its APSES domain proteins have the following IDs
 +
# (these are not refSeq btw. and thus unlikely
 +
# to be found in UniProt) ...
 +
# KIL68212
 +
# KIL69256
 +
# KIL65817
 +
#
  
{{task|1=
 
#In the header section, click on '''Formatting options''' and in the line "Format for..." set the '''inclusion threshold''' to <code>0.0001</code> (i.e. one more zero. This meansE-values can't be above 1e-04 for the sequence to be included.)
 
# Click on the '''Reformat''' button (top right).
 
# In the table of sequence descriptions (not alignments!), click on the '''Query coverage''' to sort the table by coverage, not by score.
 
# Copy the rows with a coverage of less than 80% and paste them into some text editor so you can compare what happens with these sequences in the next iteration.
 
# '''Deselect''' the check mark next to these sequences. (For me these are three sequences, but with YFO included that may be a bit different.)
 
# Then next to '''Run PSI-BLAST iteration ...''', click on '''<code>Go</code>'''.
 
}}
 
  
 +
# First, I create a copy of the database with a name that
 +
# I will recognize to be associated with my YFO.
 +
amamuDB <- refDB
  
This is now the "real" PSI-BLAST at work: it constructs a profile from all the full-length sequences and searches with the '''profile''', not with any individual sequence. Note that we are controlling what goes into the profile in two ways:
 
# we are explicitly removing sequences with poor coverage; and
 
# we are requiring a minimum E-value for each sequence.
 
  
 +
# Then I fetch my protein data ...
 +
tmp1 <- fetchProteinData("KIL68212")
 +
tmp2 <- fetchProteinData("KIL69256")
 +
tmp3 <- fetchProteinData("KIL65817")
  
{{task|1=
 
#Again, study the table of hits. Sequences marked with green dots were previously included. Sequences labelled with ''new'' have gone below the E-value threshold only in the second iteration. There are quite a few! Sequences without a label were previously excluded.
 
# Let's exclude partial matches one more time. Again, deselect all sequences with less than 80% coverage <small>(Here is where all your video game hours practicing rapid, targeted mouse clicks finally pay off!)</small>) Then run the third iteration.
 
# This time there are only a small number of new sequences, but the number of low-coverage sequences has also decreased somewhat.
 
# Again, deselect all sequences with less than 80% coverage. Note that the longer of these now have '''very''' low e-values. They look like true positives. But excluding them does not need to worry us, the will come back. We are more worried about false positives.
 
# Iterate the search in this way until no more "New" sequences are added to the profile.
 
}}
 
  
 +
# ... and if I am satisfied that it contains what I
 +
# want, I add it to the database.
 +
amamuDB <- addToDB(amamuDB, tmp1)
 +
amamuDB <- addToDB(amamuDB, tmp2)
 +
amamuDB <- addToDB(amamuDB, tmp3)
  
Once no "new" sequences have been added, if we were to repeat the process again and again, we would always get the same result because the profile stays the same. We say that the search has '''converged'''. Good. Time to harvest.
 
  
 +
# Then I make a local backup copy. Note the filename and
 +
# version number  :-)
 +
save(amamuDB, file="amamuDB.01.RData")
 +
  
{{task|1=
+
# Now I can explore my new database ...
# At the header, click on '''Taxonomy reports''' and find YFO in the '''Organism Report''' section. These are your APSES domain homologs. All of them. Actually, perhaps more than all: the report may also include sequences with E-values above the inclusion threshold.
+
amamuDB$protein[amamuDB$protein$taxID == 946122, "refSeqID"]
# From the report copy the sequence identifiers
 
## from YFO,
 
## with E-values above your defined threshold.
 
}}
 
  
For example, the list of ''Saccharomyces'' genes is the following:
 
  
<code>
+
# ... but if anything goes wrong, for example
<b>[http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292 Saccharomyces cerevisiae S288c]</b> [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=4890 [ascomycetes]] taxid 559292<br \>
+
# if I make a mistake in checking which
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6320147&dopt=GenPept ref|NP_010227.1|] Mbp1p [Saccharomyces cerevisiae S288c]          [ 131]  1e-38<br \>
+
# rows contain taxID 946122 ...  
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6320957&dopt=GenPept ref|NP_011036.1|] Swi4p [Saccharomyces cerevisiae S288c]          [ 123]  1e-35<br \>
+
amamuDB$protein$taxID = 946122
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6322808&dopt=GenPept ref|NP_012881.1|] Phd1p [Saccharomyces cerevisiae S288c]          [  91]  1e-25<br \>
 
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6323658&dopt=GenPept ref|NP_013729.1|] Sok2p [Saccharomyces cerevisiae S288c]          [  93]  3e-25<br \>
 
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6322090&dopt=GenPept ref|NP_012165.1|] Xbp1p [Saccharomyces cerevisiae S288c]          [  40]  5e-07<br \>
 
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6320279&dopt=GenPept ref|NP_010359.1|] Tps2p [Saccharomyces cerevisiae S288c]          [  26]  0.011<br \>
 
</code>
 
  
But I believe that Tps2 is a false positive, with low coverage, high E-value, unrelated function<ref>It is a trehalose-6-phosphate synthase/phosphatase.</ref> and a different [http://modbase.compbio.ucsf.edu/modbase-cgi/model_details.cgi?queryfile=1350847045_3197&searchmode=default&displaymode=moddetail&referer=yes&snpflag=& structure]. I ignore this one.
+
# Ooops ... what did I just do wrong?
[[Saccharomyces cerevisiae Xbp1|Xbp1]] is a special case. It has only very low coverage, but that is because it has a long domain insertion and the N-terminal match often is not recognized by alignment because the gap scores for long indels are unrealistically large. For now, I keep that sequence with the others.
+
#      ... wnat happened instead?
  
 +
amamuDB$protein$taxID
  
Next we need to retrieve the sequences. Tedious to retrieve them one by one, but we can get them all at the same time:
 
  
 +
# ... I can simply recover from my backup copy:
 +
load("amamuDB.01.RData")   
 +
amamuDB$protein$taxID
  
{{task|1=
 
  
# Back at the header of BLAST results page, again open the '''Formatting options'''.
+
</source>
# Find the '''Limit results''' section and enter YFO's name into the field. For example <code>Saccharomyces cerevisiae [ORGN]</code>
 
# Click on '''Reformat'''
 
# Scroll to the '''Alignments''' section, check the box next to each sequence you want to keep. At the bottom, click on '''Get selected sequences'''.
 
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand" data-collapsetext="Collapse" style="border:#000000 solid 1px; padding: 10px; margin-left:25px; margin-right:25px;">This will take you to the '''Protein''' database. Note the URL for this page - it contains the GIs for the sequences separated by a comma. You can bookmark the page. You can also select '''Display settings ... FASTA (text)''', copy the sequences and save them in a single file. By the way: we call a file that contains several FASTA formatted sequences a '''Multi FASTA file'''. Also (not required for the assignment) you can explore the options to download sequences via  the URL:
 
<div  class="mw-collapsible-content">
 
  
* http://www.ncbi.nlm.nih.gov/protein/6320147,6320957,6322808,6323658,6322090?report=docsum  - The default report
 
* http://www.ncbi.nlm.nih.gov/protein/6320147,6320957,6322808,6323658,6322090?report=fasta - FASTA sequences with NCBI HTML markup
 
  
But even more flexible is the [http://www.ncbi.nlm.nih.gov/books/NBK25500/#chapter1.Downloading_Full_Records '''eUtils'''] interface to the NCBI databases. For example you can download the dataset in text format by clicking below.
+
&nbsp;
 +
{{task|1=
  
* http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=6320147,6320957,6322808,6323658,6322090&rettype=fasta&retmode=text
+
;Create your own version of the protein database by adding all the genes from YFO that you have discovered with the PSI-BLAST search for the APSES domain. Save it.
  
Note that this utility does not '''show''' anything, but downloads the (multi) fasta file to your default download directory.
 
 
</div>
 
</div>
 
 
}}
 
}}
  
  
;That is all.
+
&nbsp;
  
  
&nbsp;
+
;TBC
 +
 
  
== Links and resources ==
 
  
<!-- {{#pmid: 19957275}} -->
+
&nbsp;
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 
  
<!--
 
  
Add to this assignment:
 
- study the BLAST output format, links, tools, scores ...
 
- compare the improvement in E-values to standard BLAST
 
- examine this in terms of sensitivity and specificity
 
  
-->
 
  
 
&nbsp;
 
&nbsp;
 
{{#lst:BIO_Assignment_Week_1|assignment_footer}}
 
{{#lst:BIO_Assignment_Week_1|assignment_footer}}
  
 +
<table style="width:100%;"><tr>
 +
<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_5|&lt;&nbsp;Assignment&nbsp;5]]</td>
 +
<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[BIO_Assignment_Week_7|Assignment&nbsp;7&nbsp;&gt;]]</td>
 +
</tr></table>
  
 
&nbsp;
 
&nbsp;

Latest revision as of 05:54, 17 November 2015

Assignment for Week 6
Function

< Assignment 5 Assignment 7 >

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.




 

Introduction

 

In this assignment we will first download a number of APSES domain containing sequences into our database - and we will automate the process. Then we will annotate them with domain data. First manually, and then again, we will automate this. Next we will extract the APSES domains from our database according to the annotations. And finally we will align them, and visualize domain conservation in the 3D model to study parts of the protein that are conserved.


 

Downloading Protein Data From the Web

In Assignment 3 we created a schema for a local protein sequence collection, and implemented it as an R list. We added sequences to this database by hand, but since the information should be cross-referenced and available based on a protein's RefSeq ID, we should really have a function that automates this process. It is far too easy to make mistakes and enter erroneous information otherwise.


Task:
Work through the following code examples.

# To begin, we load some libraries with functions
# we need...

# httr sends and receives information via the http
# protocol, just like a Web browser.
if (!require(httr, quietly=TRUE)) { 
	install.packages("httr")
	library(httr)
}

# NCBI's eUtils send information in XML format; we
# need to be able to parse XML.
if (!require(XML, quietly=TRUE)) {
	install.packages("XML")
	library(XML)
}

# stringr has a number of useful utility functions
# to work with strings. E.g. a function that
# strips leading and trailing whitespace from
# strings.
if (!require(stringr, quietly=TRUE)) {
	install.packages("stringr")
	library(stringr)
}


# We will walk through the process with the refSeqID
# of yeast Mbp1
refSeqID <- "NP_010227"


# UniProt.
# The UniProt ID mapping service supports a "RESTful
# API": responses can be obtained simply via a Web-
# browsers request. Such requests are commonly sent
# via the GET or POST verbs that a Webserver responds
# to, when a client asks for data. GET requests are 
# visible in the URL of the request; POST requests
# are not directly visible, they are commonly used
# to send the contents of forms, or when transmitting
# larger, complex data items. The UniProt ID mapping
# sevice can accept long lists of IDs, thus using the
# POST mechanism makes sense.

# R has a POST() function as part of the httr package.

# It's very straightforward to use: just define the URL
# of the server and send a list of items as the 
# body of the request.

# uniProt ID mapping service
URL <- "http://www.uniprot.org/mapping/"
response <- POST(URL, 
                 body = list(from = "P_REFSEQ_AC",
                             to = "ACC",
                             format = "tab",
                             query = refSeqID))

response

# If the query is successful, tabbed text is returned.
# and we capture the fourth element as the requested
# mapped ID.
unlist(strsplit(content(response), "\\s+"))

# If the query can't be fulfilled because of a problem
# with the server, a WebPage is rturned. But the server status
# is also returned and we can check the status code. I have
# lately gotten many "503" status codes: Server Not Available...

if (response$status_code == 200) { # 200: oK
	uniProtID <- unlist(strsplit(content(response), "\\s+"))[4]
	if (is.na(uniProtID)) {
	warning(paste("UniProt ID mapping service returned NA.",
	              "Check your RefSeqID."))
	}
} else {
	uniProtID <- NA
	warning(paste("No uniProt ID mapping available:",
	              "server returned status",
	              response$status_code))
}

uniProtID  # Let's see what we got...
           # This should be "P39678"
           # (or NA if the query failed)


Next, we'll retrieve data from the various NCBI databases.

It is has become unreasonably difficult to screenscrape the NCBI site since the actual page contents are dynamically loaded via AJAX. This may be intentional, or just overengineering. While NCBI offers a subset of their data via the eutils API and that works well enough, some of the data that is available to the Web browser's eyes is not served to a program.

The eutils API returns data in XML format. Have a look at the following URL in your browser to see what that looks like:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=NP_010227


# In order to parse such data, we need tools from the 
# XML package. 

# First we build a query URL...
eUtilsBase <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/"


# Then we assemble an URL that will search for get the
# unique, NCBI internal identifier,  the GI number,
# for our refSeqID...
URL <- paste(eUtilsBase,
             "esearch.fcgi?",     # ...using the esearch program
                                  # that finds an entry in an
                                  # NCBI database
             "db=protein",
             "&term=", refSeqID,
             sep="")
# Copy the URL and paste it into your browser to see
# what the response should look like.
URL

# To fetch a response in R, we use the function htmlParse()
# with our URL as its argument.
response <- htmlParse(URL)
response

# This is XML. We can take the response apart into
# its indvidual components with the xmlToList function.

xmlToList(response)

# Note how the XML "tree" is represented as a list of
# lists of lists ...
# If we know exactly what elelement we are looking for,
# we can extract it from this structure:
xmlToList(response)[["body"]][["esearchresult"]][["idlist"]][["id"]]

# But this is not very robus, it would break with the
# slightest change that the NCBI makes to their response
# and the NCBI changes things A LOT!

# Somewhat more robust is to specify the type of element
# we want - its the text contained in an <id>...</id>
# elelement, and use the XPath XML parsing language to
# retrieve it.

# getNodeSet() lets us fetch tagged contents by 
# applying toString.XMLNode() to it...

node <- getNodeSet(response, "//id/text()")
unlist(lapply(node, toString.XMLNode))  # "6320147 "

# We will be doing this a lot, so we write a function
# for it...
node2string <- function(doc, tag) {
    # an extractor function for the contents of elements
    # between given tags in an XML response.
    # Contents of all matching elements is returned in
    # a vector of strings.
	path <- paste("//", tag, "/text()", sep="")
	nodes <- getNodeSet(doc, path)
	return(unlist(lapply(nodes, toString.XMLNode)))
}

# using node2string() ...
GID <- node2string(response, "id")
GID

# The GI is the pivot for all our data requests at the
# NCBI. 

# Let's first get the associated data for this GI
URL <- paste(eUtilsBase,
             "esummary.fcgi?",
             "db=protein",
             "&id=",
             GID,
             "&version=2.0",
             sep="")
response <- htmlParse(URL)
URL
response

taxID <- node2string(response, "taxid")
organism <- node2string(response, "organism")
taxID
organism


# Next, fetch the actual sequence
URL <- paste(eUtilsBase,
             "efetch.fcgi?",
             "db=protein",
             "&id=",
             GID,
             "&retmode=text&rettype=fasta",
             sep="")
response <- htmlParse(URL)
URL
response

fasta <- node2string(response, "p")
fasta

seq <- unlist(strsplit(fasta, "\\n"))[-1] # Drop the first elelment,
                                          # it is the FASTA header.
seq


# Next, fetch the crossreference to the NCBI Gene
# database
URL <- paste(eUtilsBase,
             "elink.fcgi?",
             "dbfrom=protein",
             "&db=gene",
             "&id=",
             GID,
             sep="")
response <- htmlParse(URL)
URL
response

geneID <- node2string(response, "linksetdb/id")
geneID

# ... and the actual Gene record:
URL <- paste(eUtilsBase,
             "esummary.fcgi?",
             "&db=gene",
             "&id=",
             geneID,
             sep="")
response <- htmlParse(URL)
URL
response

name <- node2string(response, "name")
genome_xref <- node2string(response, "chraccver")
genome_from <- node2string(response, "chrstart")[1]
genome_to <- node2string(response, "chrstop")[1]
name
genome_xref
genome_from
genome_to

# So far so good. But since we need to do this a lot
# we need to roll all of this into a function. 

# I have added the function to the dbUtilities code
# so you can update it easily.

# Run:

updateDbUtilities("55ca561e2944af6e9ce5cf2a558d0a3c588ea9af")

# If that is successful, try these three testcases

myNewDB <- createDB()
tmp <- fetchProteinData("NP_010227") # Mbp1p
tmp
myNewDB <- addToDB(myNewDB, tmp)
myNewDB

tmp <- fetchProteinData("NP_011036") # Swi4p
tmp
myNewDB <- addToDB(myNewDB, tmp)
myNewDB

tmp <- fetchProteinData("NP_012881") # Phd1p
tmp
myNewDB <- addToDB(myNewDB, tmp)
myNewDB


This new fetchProteinData() function seems to be quite convenient. I have compiled a set of APSES domain proteins for ten fungi species and loaded the 48 protein's data into an R database in a few minutes. This "reference database" will be automatically loaded for you with the next dbUtilities update. Note that it will be recreated every time you start up R. This means two things: (i) if you break something in the reference database, it's not a problem. (ii) if you store your own data in it, it will be lost. In order to add your own genes, you need to make a working copy for yourself.


Computer literacy

Digression - some musings on computer literacy and code engineering.

It's really useful to get into a consistent habit of giving your files a meaningful name. The name should include something that tells you what the file contains, and something that tells you the date or version. I give versions major and minor numbers, and - knowing how much things always change - I write major version numbers with a leading zero eg. 04 so that they will be correctly sorted by name in a directory listing. The same goes for dates: always write YYYY-MM-DD to ensure proper sorting.

On the topic of versions: creating the database with its data structures and the functions that operate on them is an ongoing process, and changes in one part of the code may have important consequences for another part. Imagine I made a poor choice of a column name early on: changing that would need to be done in every single function of the code that reads or writes or analyzes data. Once the code reaches a certain level of complexity, organizing it well is just as important as writing it in the first place. In the new update of dbUtilities.R, a database has a $version element, and every function checks that the database version matches the version for which the function was written. Obviously, this also means the developer must provide tools to migrate contents from an older version to a newer version. And since migrating can run into trouble and leave all data in an inconsistent and unfixable state, it's a good time to remind you to back up important data frequently. Of course you will want to save your database once you've done any significant work with it. And you will especially want to save the databases you create for your Term Project. But you should also (and perhaps more importantly) save the script that you use to create the database in the first place. And on that note: when was the last time you made a full backup of your computer's hard-drive? Too long ago? I thought so.

Backup your hard-drive now.


If your last backup at the time of next week's quiz was less than two days ago, you will receive a 0.5 mark bonus.


New Database

Here is some sample code to work with the new database, enter new protein data for YFO, save it and load it again when needed.


# You don't need to load the reference database refDB. If
# everything is set up correctly, it gets loaded at startup.
# (Just so you know: you can turn off that behaviour if you
# ever should want to...)


# First you need to load the newest version of dbUtilities.R

updateDButilities("7bb32ab3d0861ad81bdcb9294f0f6a737b820bf9")

# If you get an error: 
#    Error: could not find function "updateDButilities"
# ... then it seems you didn't do the previous update.

# Try getting the update with the new key but the previous function:
# updateDbUtilities()
#
# If that function is not found either, confirm that your ~/.Rprofile
# actually loads dbUtilites.R from your project directory. 

# As a desperate last resort, you could uncomment
# the following piece of code and run the update
# without verification...
#
# URL <- "http://steipe.biochemistry.utoronto.ca/abc/images/f/f9/DbUtilities.R"
# download.file(URL, paste(PROJECTDIR, "dbUtilities.R", sep="")), method="auto")
# source(paste(PROJECTDIR, "dbUtilities.R", sep=""))
#
# But be cautious: there is no verification. You yourself need
# to satisfy yourself that this "file from the internet" is what 
# it should be, before source()'ing it...


# After the file has been source()'d,  refDB exists.
ls(refDB)


# check the contents of refDB:
refDB$protein$name
refDB$taxonomy


# list refSeqIDs for saccharomyces cerevisiae genes.
refDB$protein[refDB$protein$taxID == 559292, "refSeqID"]


# To add some genes from YFO, I proceed as follows.
# Obviously, you need to adapt this to your YFO
# and the sequences in YFO that you have found
# with your PSI-BLAST search.

# Let's assume my YFO is the fly agaric (amanita muscaria)
# and its APSES domain proteins have the following IDs
# (these are not refSeq btw. and thus unlikely
# to be found in UniProt) ...
# KIL68212
# KIL69256
# KIL65817
#


# First, I create a copy of the database with a name that
# I will recognize to be associated with my YFO.
amamuDB <- refDB


# Then I fetch my protein data ...
tmp1 <- fetchProteinData("KIL68212")
tmp2 <- fetchProteinData("KIL69256")
tmp3 <- fetchProteinData("KIL65817")


# ... and if I am satisfied that it contains what I
# want, I add it to the database.
amamuDB <- addToDB(amamuDB, tmp1)
amamuDB <- addToDB(amamuDB, tmp2)
amamuDB <- addToDB(amamuDB, tmp3)


# Then I make a local backup copy. Note the filename and
# version number  :-)
save(amamuDB, file="amamuDB.01.RData")
 

# Now I can explore my new database ...
amamuDB$protein[amamuDB$protein$taxID == 946122, "refSeqID"]


# ... but if anything goes wrong, for example 
# if I make a mistake in checking which
# rows contain taxID 946122 ... 
amamuDB$protein$taxID = 946122

# Ooops ... what did I just do wrong?
#       ... wnat happened instead? 

amamuDB$protein$taxID


# ... I can simply recover from my backup copy:
load("amamuDB.01.RData")    
amamuDB$protein$taxID


 

Task:

Create your own version of the protein database by adding all the genes from YFO that you have discovered with the PSI-BLAST search for the APSES domain. Save it.


 


TBC


 



 


Footnotes and references


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.



< Assignment 5 Assignment 7 >