Difference between revisions of "BIO Assignment Week 3"

From "A B C"
Jump to navigation Jump to search
m
 
(54 intermediate revisions by the same user not shown)
Line 4: Line 4:
 
<span style="font-size: 70%">Sequence Analysis</span>
 
<span style="font-size: 70%">Sequence Analysis</span>
 
</div>
 
</div>
 +
<table style="width:100%;"><tr>
 +
<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_2|&lt;&nbsp;Assignment&nbsp;2]]</td>
 +
<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[BIO_Assignment_Week_4|Assignment&nbsp;4&nbsp;&gt;]]</td>
 +
</tr></table>
  
 
{{Template:Inactive}}
 
{{Template:Inactive}}
Line 16: Line 20:
 
==Introduction==
 
==Introduction==
  
 +
==Model organism and YFO==
  
In this assignment we will
+
In this assignment we will set out on our exploration of the system that regulates the G1/S transition by focussing first on the Mbp1 protein in selected species from the {{WP|Kingdom_(biology)|kingdom}} of fungi, whose genome has been completely sequenced; our quest is thus also an exercise in ''model-organism reasoning'': the transfer of knowledge from one, well-studied organism to others.  It's reasonable to hypothesize that such central control machinery is conserved in most if not all fungi. But we don't know. Many of the species that we will be working with have not been characterized in great detail, and some of them are new to our class this year. And while we know a fair bit about Mbp1, we probably don't know very much at all about the related genes in other organisms: whether they exist, whether they have similar functional features and whether they might contribute to the ''G1/S checkpoint system'' in a similar way. Thus we might discover things that are new and interesting. This is a quest of discovery.
* use the sequence search program BLAST to retrieve a sequence similar to yeast Mbp1 in YFO.
+
 
* use a number of tools to annotate the sequence.
+
{{Vspace}}
 +
 
 +
==The BCH441 RStudio Project==
 +
 
 +
I have begun to unify '''R'''-scripts and other resources in an RStudio project that will make it easy to update and distribute code. Conveniently, if I push update material to the github repository, all you need to do is to ''pull'' the updated project to receive all updates and new files on your computer. Version control is really good for this. However, there is one limitation to this approach that you need to be aware of. If you create your own, local files and then '''commit''' them, ''git'' will complain that it would be overwriting such local material. As long as you don't '''commit''' your files then all should be fine. This means you'll need to do your own "versioning" by saving your own scripts under a different name from time to time. Once again: in this context:
 +
*'''saving''' your own files is fine;
 +
*'''committing''' your own files to version control will cause problems;
 +
* changes you make to course material files and save '''under the same filename''' (like adding comments and notes) will not persist, these changes will be overwritten with the next update. You need to "Save As..." with a new filename (e.g. just prefix the original name with "<code>my</code>").
 +
 
 +
{{Vspace}}
 +
 
 +
{{task|1=
 +
* Open RStudio and create a '''New Project...''' cloned from a git version control directory. The repository URL is <code><nowiki>https://github.com/hyginn/BCH441_2016</nowiki></code>. Create this in the same way as you did for the [[R_tutorial#Git_Version_control|'''R'''-tutorial]].
 +
* As requested on the console, type <code>init()</code>. This will create a file called <code>.myProfile.R</code> and ask you for your UofT eMail address and Student ID. You need to enter the correct values because other scripts will make assumptions about that these variables exist and are valid.
 +
* Work through the task: <code>"local script"</code> in the <code>BCH441_2016.R</code> script.
 +
* Then load the script for the current assignment as instructed.
 +
 
 +
}}
 +
 
 +
{{Vspace}}
 +
 
 +
==Choosing YFO (Your Favourite Organism)==
 +
 
 +
Since we were trying to find related proteins in a different species, our next task is to find suitable species.
 +
 
 +
For this purpose we create a lottery to assign species at (pseudo) random to students, so that everyone has a good chance to be working on their own species. The technical details are in the R scripts that implement the species search and distribution. In brief, we define a function that picks one species from a long list at random - but to make sure this process is reproducible, we'll set a seed for the random number generator. Obviously, everyone has to use a different seed, or else everyone would end up getting the same species assigned. Thus we'll use your Student Number as the seed. This is an integer, so it can be used as an argument to '''R''''s <code>set.seed()</code> function, and it's unique to each of you. The choice is then random, reproducible and unique.
 +
 
 +
<small>You may notice that this process does not guarantee that everyone gets a different species, and that all species are chosen at least once. I don't think doing that is possible in a "stateless" way (i.e. I don't want to have to remember who chose what species), given that I don't know all of your student numbers. But if anyone can think of a better solution, that would be neat.</small>
 +
 
 +
Is it possible that all of you end up working on the same species anyway? Indeed. That's the problem with randomness. But it is not very likely.
  
&nbsp;
 
  
==Retrieve==
+
What about the "suitable species" though? Where do they come from? For the purposes of the course "quest", we need species
 +
* that actually have transcription factors that are related to Mbp1;
 +
* whose genomes have been sequenced; and
 +
* for which the sequences have been deposited in the RefSeq database, NCBI's unique sequence collection.
  
  
In [[BIO_Assignment_Week_2#Protein|Assignment 2]] you looked at sequences in YFO that are [http://www.ncbi.nlm.nih.gov/protein?LinkName=protein_protein&from_uid=6320147 related to yeast Mbp1], by following a link from the RefSeq record. I mentioned that there are more principled ways to find related proteins: that principle is to search for similar sequences. Exactly how this works will be the subject of later lectures, but the tool that is most commonly used for this task is called '''BLAST''' (Basic Local Alignment And Search Tool). The task of this assignment is to perform a number of sequence annotations to the sequence from YFO that is '''most similar''' to Mbp1, or, more precisely, that contains an APSES domain that is most similar<ref>As you will see later on in the assignment, Mbp1-related proteins contain "Ankyrin" domains, a very widely distributed protein-protein interaction motif that may give rise to false-positive similarities for full-length sequence searches. Therefore, we search only with the DNA binding domain sequence, since this is the functionality that best characterizes the "function" of the protein we are interested in.</ref>.
+
{{task|
 +
 
 +
* Work through PART ONE: YFO Species of the BCH441_A03.R script.
 +
* There are some deep "rabbitholes" that you are encouraged to follow to explore the code that went into generating the YFO species list. The minimal required result however is that you have picked an '''YFO''', and that it's name got written into your personalized profile file.
 +
* Note down the species name and its five letter label on your Student Wiki user page. '''Use this species whenever this or future assignments refer to YFO'''.
 +
}}
  
&nbsp;
 
===Search input===
 
  
 +
{{Vspace}}
  
First, we need to '''define the sequence''' we will search with, as the search input. I have highlighted the extent of the APSES domain sequence in the previous assignment, but when you explored the corresponding structure in VMD, you have seen that the structured protein domain is larger and the additional secondary structure elements are in fact well integrated into the overall domain. This is not surprising: canonical domain definitions are compiled from many species and examples, and they generally comprise only the common core. Looking up the source of the domain annotations for Mbp1 is very easy:
+
===Selecting the YFO "Mbp1"===
  
 +
{{Vspace}}
  
 
{{task|1=
 
{{task|1=
<ol>
 
<li> Access the [http://www.ncbi.nlm.nih.gov/protein/NP_010227 RefSeq record for yeast Mbp1].</li>
 
<li>  Look for the link '''Related Information''' &rarr; '''CDD Search Results''' and  follow it.</li>
 
</ol>
 
  
 +
# Back at the [http://www.ncbi.nlm.nih.gov/protein/NP_010227 Mbp1 protein page] follow the link to [http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins&PROGRAM=blastp&BLAST_PROGRAMS=blastp&QUERY=NP_010227.1&LINK_LOC=protein&PAGE_TYPE=BlastSearch Run BLAST...] under "Analyze this sequence".
 +
# This allows you to perform a sequence similarity search. You need to set two parameters:
 +
## As '''Database''', select '''Reference proteins (refseq_protein)''' from the drop down menu;
 +
## In the '''Organism''' field, type the species you have selected as YFO and select the corresponding taxonomy ID.
 +
# Click on '''Run BLAST''' to start the search. This should find a handful of genes, all of them in YFO. If you find none, or hundreds, or they are not all in the same species, you did something wrong. Ask on the mailing list and make sure to fix the problem.
 +
# Look at the top "hit" in the '''Descriptions''' section. The rightmost column contains sequence IDs unter the '''Accession''' heading. The alignment and alignment score are shown in the '''Alignments''' section a bit further down the page. Look at the result.
 +
# In the header information for each hit is a link to its database entry, right next to '''Sequence ID'''.  It says something like <code>ref&#124;NP_123456789.1</code> or <code>ref&#124;XP_123456789</code> ... follow that link.
 +
# Note the RefSeq ID, and the search results %ID, E-value, whether one or more similar regions were found etc. in your Journal. We will refer to this sequence as "''YFO'' Mbp1" or similar in the future.
 +
# Finally access the [http://www.uniprot.org/uploadlists/ UniProt ID mapping service] to retrieve the UniProt ID for the protein. Paste the RefSeq ID and choose '''RefSeq Protein''' as the '''From:''' option and '''UniProtKB''' as the '''To:''' option.
  
This is a domain annotation: CDD is the NCBI's '''C'''onserved '''D'''omain '''D'''atabase and the annotation was done by a tool that scanned the sequence of Mbp1 for segments that are similar to any of the domain definitions stored in the CDD. We will return to CDD in the next assignment.
+
:<small>If the mapping works, the UniProt ID will be in the '''Entry:''' column of the table that is being returned. Click the link and have a look at the UniProt entry page while you're there.</small>
  
 +
<!-- What could go wrong? Sometimes the mapping does not work:
 +
I don't know why the mapping for some sequences is not available.
 +
If this happens, you can work around the problem as follows.
  
<ol start="3">
+
1. Load the RefSeq protein page
<li>Click on the blue box labeled Kila-N in the graph to access the CDD entry for this domain.</li>
+
2. View the protein as FASTA and copy the sequence.
<li>Read the abstract. You should understand the relationship between Kila-N and APSES domains. One is a subfamily of the other.</li>
+
3. Open the UniProt BLAST page http://www.uniprot.org/blast/
<li>Confirm that the domain definition &ndash; as applied to the Mbp1 sequence (which is labeled as "query") &ndash; corresponds to the region we highlighted in the last assignment.</li>
+
  (Yes, UniProt runs its own BLAST version, and that searches UniProt databases, not Genbank)
</ol>
+
4. Paste the sequence into the search form and run BLAST.
  
What precisely constitutes an APSES domain however is a matter of definition, as you can explore in the following (optional) task.
+
... if the sequence is in UniProt, you will get the top hit with 100% sequence identity. In your case it is:
 +
  H1VQK3  ( http://www.uniprot.org/uniprot/H1VQK3 )
  
 +
I.e. UniProt contains the sequence, but the mapping service does not know.
 +
-->
  
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand" data-collapsetext="Collapse" style="border:#000000 solid 1px; padding: 10px; margin-left:25px; margin-right:25px;">Optional: Load the structure in VMD, like you did in the last assignment and switch on stereo viewing ... (more) <div  class="mw-collapsible-content">
 
<ol start="6">
 
<li>In the '''Graphics''' &rarr; '''Representations''' window, select <code>NewCartoon</code> as the '''Drawing Method'''.
 
<li>Access the '''Interpro''' information page for Mbp1 at the EBI: http://www.ebi.ac.uk/interpro/protein/P39678
 
<li>In the section '''Domains and repeats''', mouse over the red annotations and note down the residue numbers for the annotated domains. Also follow the links to the respective Interpro domain definition pages.
 
</ol>
 
  
At this point we have definitions for the following regions on the Mbp1 protein ...
 
*The range of structural coordinates, i.e. the "implicit" sequence - from the '''Sequence Viewer''' tool of VMD;
 
*The KilA-N (pfam 04383) domain definition as applied to the Mbp1 protein sequence by CDD;
 
*The InterPro ''KilA, N-terminal/APSES-type HTH, DNA-binding (IPR018004)'' definition annotated on the Mbp1 sequence;
 
*The InterPro ''Transcription regulator HTH, APSES-type DNA-binding domain (IPR003163)'' definition annotated on the Mbp1 sequence;
 
*<small>(... in addition &ndash; without following the source here &ndash; the UniProt record for Mbp1 annotates a "HTH APSES-type" domain from residues 5-111)</small>
 
  
... each with its distinct and partially overlapping sequence range. Back to VMD:
+
}}
  
<!-- Fall 2013:
+
{{Vspace}}
1MB1: 3-100
+
 
2BM8: 4-102
+
==Store Data==
CDD KilA-N: 19-93
+
 
InterPro KilA-N: 23-88
+
{{Vspace}}
InterPro APSES: 3-133
+
 
Uniprot HTH/APSES: 5-111
+
===The ''System'' datamodel===
-->
+
 
 +
{{Vspace}}
 +
 
 +
[[Image:SystemDB_Datamodel.png|frame|Entity-Relationship Diagram (ERD) for a data model that stores data for a systems project. Entity names are at the top of each box, attributes are listed below. If you think of the entity as a table, its attributes are the column names and each row stores the values for one particular instance. Semantically related entities are shaded in similar colours; this helps to make the design-principles visible, but one must be careful not to overdo the use of colour. As with all graphical elements in information design: "less is more".  All relationships link to unique primary keys and are thus 1 to (0, n): i.e. as attributes, the relationship does not have to exist but there could be many, as the primary key, exactly one key must exist. The diagram was drawn as a "Google presentation" slide and you can [https://docs.google.com/presentation/d/1_nYWiwQc-9Z4anUAwOeVqWXgXIvM1Zl_AffMTM_No2Q/edit?usp=sharing view it on the Web] and make a copy for your own purposes.]]
 +
 
 +
Here is a first version of a systems data model, based on what we discussed in class:
 +
* The '''<code>feature</code> table''' is at the centre. This was not intentional, but emerged from iterating the model through a number of revisions. It emphasizes that the main purpose of this model is to integrate and annotate data from various sources. ''Feature'' in the way we understand it here is an abstraction of quite disparate categories of information items. This includes domain annotations, system functions, literature references, and cross-references to other databases. The ''type'' attribute will require some thought: this attribute really needs a "controlled vocabulary" to ensure that the same category is described consistently with the same string ("PubMed"? "Lit."? "reference"?). There are a number of ways to achieve this, the best way<ref>Relational databases like MySQL, PostgresQL, and MariaDB offer the datatype "Enum" for this purpose but this is an inferior solution. Enum types need to be fixed at the time the schema is created, they can't store information about their semantics, i.e. how the keywords are defined and when they should be used, and they can't be used in more than one table, since they are metadata of one particular column.</ref> is to store these types in their own "reference" table '''<code>type</code>''' - and link to that table via a ''foreign key''.
 +
* The '''<code>protein</code> table''' is at the centre. Its ''primary key'' is a unique integer. We store the NCBI RefSeq ID and the Uniprot ID in the table. We would not call these "foreign keys", since the information they reference is not in our schema but at the NCBI resp. EBI. For example, we can't guarantee that they are unique keys.
 +
* The '''<code>taxonomy</code> table''' holds information about species. We use the NCBI taxonomy ID as its ''primary key''. The same key appears in the protein table as the ''foreign key'' that links the protein with its proper taxonomy information. This is an instance where the data model actually does not describe reality well. The problem is that particular proteins that we might retrieve from database searches will often be annotated to a specific ''strain'' of a ''species'' – and there is no easy way to reference strains to species. We'll see whether this turns out to be a problem in practice. But it may be that an additional table may be required that stores parent/child relationships of the taxonomic tree of life.
 +
* The '''<code>protein_feature</code> table''' links a  protein with all the features that we annotate it with. ''start'' and ''end'' coordinates identify the region of the sequence we have annotated.
 +
* The '''<code>...Annotation</code> tables''' link feature-information with annotated entities.
 +
* Should the '''<code>system</code> table''' have its own taxonomy attribute? Or should the species in which the system is observed be inferred from the component protein's '''<code>taxonomy.ID</code>'''? What do you think? I decided not to add a taxonomy attribute to that table. How would you argue for or against this decision?
 +
* The '''<code>component</code> table''' links proteins that collaborate together as a system. There is an implicit assumption in this model that only proteins are system components (and not e.g. RNA), and that components are atomic (i.e. we can't link to subsystems). How would you change the model to accommodate more realistic biological complexity?
 +
 
 +
{{Vspace}}
 +
 
 +
===Implementing the Data Model in R===
  
<ol start="9">
+
{{Vspace}}
<li>In the '''Graphical Representations''' window, restrict the selection to the Interpro KilA-N annotation range and colour this fragment red.
 
<li>Click on '''Create Rep''', select the residue range(s) by which the CDD KilA-N definition is larger, and colour that fragment orange.
 
<li>Click on '''Create Rep''', select the residue range(s) by which the InterPro APSES domain definition is larger, and colour that fragment yellow.
 
<li>If the structure contains residues outside these ranges, colour them white.
 
<li>Study this in a side-by-side stereo view and get a sense for how the ''extra'' sequence beyond the Kil-A N domain(s) is part of the structure, and how the integrity of the folded structure would be affected if these fragments were missing.
 
<li>Orient this stereo-image for a clear view of the domain architecture in question and scale it to fill the available window well. Then render it.
 
<li>Convert the resulting image to jpeg format, resize it to be no larger than 600px across and upload it to your Lab notebook on the Wiki.
 
<li>Also add the category tag <code><nowiki>[[Category:BCH441_2013 Notebook]]</nowiki></code> to your page.
 
<li>When you are done, congratulate yourself on having earned a bonus of 10% on the next quiz.
 
</ol>
 
  
</div>
+
To actually implement the data model in '''R''' we will create the tables as data frames, and we will collect them in a list. We don't '''have''' to keep the tables in a list - we could also keep them as independent objects, but with a table named "protein" we should be worried of inadvertently overwriting the table. A list is neater, more flexible (we might want to have several of these), it reflects our intent about the model better, and doesn't require very much more typing. For now, to keep things simple, we will implement only two tables: '''<code>protein</code>''' and '''<code>taxonomy</code>'''. We'll add the rest when we actually need them.
</div>
 
  
 +
{{task|1 =
  
There is a rather important lesson in this: domain definitions may be fluid, and their boundaries may be computationally derived from sequence comparisons across many families, and do not necessarily correspond to individual structures. Make sure you understand this well.
+
* study the code in the <code>Creating two tables</code> section of the '''R''' script
 
}}
 
}}
  
 +
{{Vspace}}
  
Given this, it seems appropriate to search the sequence database with the sequence of an Mbp1 structure&ndash;this being a structured, stable, subdomain of the whole that presumably contains the protein's most unique and specific function. Let us retrieve this sequence. All PDB structures have their sequences stored in the NCBI protein database. They can be accessed simply via the PDB-ID, which serves as an identifier both for the NCBI and the PDB databases. However there is a small catch (isn't there always?). PDB files can contain more than one protein, e.g. if the crystal structure contains a complex<ref>Think of the [http://www.pdb.org/pdb/101/motm.do?momID=121 ribosome] or [http://www.pdb.org/pdb/101/motm.do?momID=3 DNA-polymerase] as extreme examples.</ref>. Each of the individual proteins gets a so-called '''chain ID'''&ndash;a one letter identifier&ndash; to identify them uniquely. To find their unique sequence in the database, you need to know the PDB ID as well as the chain ID. If the file contains only a single protein (as in our case), the chain ID is always '''<code>A</code>'''<ref>Otherwise, you need to study the PDB Web page for the structure, or the text in the PDB file itself, to identify which part of the complex is labeled with which chain ID. For example, immunoglobulin structures some time label the ''light-'' and ''heavy chain'' fragments as "L" and "H", and sometimes as "A" and "B"&ndash;there are no fixed rules. You can also load the structure in VMD, color "by chain" and use the mouse to click on residues in each chain to identify it.</ref>. make sure you understand the concept of protein chains, and chain IDs.
+
As you see we can edit any component  of our data model directly by assigning new values to the element. But in general that's a really bad idea, since it is far too easy to bring the whole model into an inconsistent state. It's much better to write functions that ''get'' and ''set'' data elements, not only to keep the data consistent, but also because it is much easier, if the model ever changes, to simply edit the function code, rather than having to repeat every single data entry by hand.
 +
 
 +
What would an <code>setData</code> function have to look like? It should
 +
*create a new entry if the requested row of a table does not exist yet;
 +
*update data if the protein exists;
 +
*perform consistency checks (i.e. check that the data has the correct type);
 +
*perform sanity checks (i.e. check that data values fall into the expected range);
 +
*perform completeness checks (i.e. handle  incomplete data)
 +
 
 +
Let's start simple, and create a '''set-''' function to
 +
add new values to existing sequence data. Also, for clarity, we'll forgo many checks. The first thing we should do is to add the actual sequence.
 +
 
 +
We only entered a placeholder for the sequence field.
 +
Sequences come in many different flavours when we copy them from a Webpage:
 +
there can be whitespace, carriage returns, numbers, they can be upper-case, lower-case
 +
mixed-case ...
 +
 
 +
What we want in our sequence data field is one string
 +
that contains the entire sequence, and nothing but
 +
upper-case, amino-acid code letters.
 +
 
 +
We'll need to look at how we work with strings in '''R''', how we identify and work with patterns in strings. This is a great time to introduce regular expressions.
 +
 
 +
{{Vspace}}
 +
 
 +
====A Brief First Encounter of Regular Expressions====
 +
 
 +
{{Vspace}}
 +
 
 +
;Regular expressions are a concise description language to define patterns for pattern-matching in strings.
 +
 
 +
Truth be told, many programmers have a love-hate relationship with regular expressions. The syntax of regular expressions is very powerful and expressive, but also terse, not always intuitive, and sometimes hard to understand. I'll introduce you to a few principles here that are quite straightforward and they will probably cover 99% of the cases you will encounter.
 +
 
 +
Here is our test-case: the sequence of Mbp1, copied from the [https://www.ncbi.nlm.nih.gov/protein/NP_010227 NCBI Protein database page for yeast Mbp1].
 +
 
 +
        1 msnqiysary sgvdvyefih stgsimkrkk ddwvnathil kaanfakakr trilekevlk
 +
      61 ethekvqggf gkyqgtwvpl niakqlaekf svydqlkplf dftqtdgsas pppapkhhha
 +
      121 skvdrkkair sastsaimet krnnkkaeen qfqsskilgn ptaaprkrgr pvgstrgsrr
 +
      181 klgvnlqrsq sdmgfprpai pnssisttql psirstmgpq sptlgileee rhdsrqqqpq
 +
      241 qnnsaqfkei dledglssdv epsqqlqqvf nqntgfvpqq qssliqtqqt esmatsvsss
 +
      301 pslptspgdf adsnpfeerf pgggtspiis miprypvtsr pqtsdindkv nkylsklvdy
 +
      361 fisnemksnk slpqvllhpp phsapyidap idpelhtafh wacsmgnlpi aealyeagts
 +
      421 irstnsqgqt plmrsslfhn sytrrtfpri fqllhetvfd idsqsqtvih hivkrksttp
 +
      481 savyyldvvl skikdfspqy rielllntqd kngdtalhia skngdvvffn tlvkmgaltt
 +
      541 isnkegltan eimnqqyeqm miqngtnqhv nssntdlnih vntnnietkn dvnsmvimsp
 +
      601 vspsdyityp sqiatnisrn ipnvvnsmkq masiyndlhe qhdneikslq ktlksisktk
 +
      661 iqvslktlev lkesskdeng eaqtnddfei lsrlqeqntk klrkrliryk rlikqkleyr
 +
      721 qtvllnklie detqattnnt vekdnntler lelaqeltml qlqrknklss lvkkfednak
 +
      781 ihkyrriire gtemnieevd ssldvilqtl iannnknkga eqiitisnan sha
 +
//
  
  
 
{{task|1=
 
{{task|1=
<ol>
+
 
<li> Back at the [http://www.ncbi.nlm.nih.gov/protein/NP_010227 RefSeq record for yeast Mbp1], enter the '''PDB-ID''', an underscore, and the '''chain ID''' for one of the crystal structures into the search field. You can use <code>1MB1_A</code> or <code>1BM8_A</code>, but don't use <code>1L3G</code>: this NMR structure includes a large stretch of unstructured residues.</li>
+
Navigate to http://regexpal.com and paste the sequence into the '''lower''' box. This site is one of a number of online regular expression testers; their immediate, visual feedback is invaluable when you are developing regular expression patterns.
<li> Click on '''Display settings''' and choose '''FASTA (text)'''. You should get something like:
+
 
<source lang="text">
+
Lets try some expressions:
>gi|157830387|pdb|1BM8|A Chain A, Dna-Binding Domain Of Mbp1
+
 
QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKY
+
;Most characters are matched literally.
QGTWVPLNIAKQLAEKFSVYDQLKPLFDF
+
:Type "<code>a</code>" in to the '''upper''' box and you will see all "<code>a</code>" characters matched. Then replace <code>a</code> with <code>q</code>.
</source></li>
+
: Now type "<code>aa</code>" instead. Then <code>krnnkk</code>. ''Sequences'' of characters are also matched literally.
<li> Save this sequence in your notebook, in case we need it later.</li>
+
 
</ol>
+
;The pipe character {{pipe}} that symbolizes logical OR can be used to define that more than one character should match:
 +
:<code>i(s{{pipe}}m{{pipe}}q)n</code> matches <code>isn</code> OR <code>imn</code> OR <code>iqn</code>. Note how we can group with parentheses, and try what would happen without them.
 +
 
 +
;We can more conveniently specify more than one character to match if we place it in square brackets.
 +
:<code>[lq]</code> matches <code>l</code> OR <code>q</code>. <code>[familyvw]</code> matches hydrophobic amino acids.
 +
 
 +
;Within square brackets, we can specify "ranges".
 +
:<code>[1-5]</code> matches digits from 1 to 5.
 +
 
 +
;Within square brackets, we can specify characters that should NOT be matched, with the caret, <code>^</code>.
 +
:<code>[^0-9]</code> matches everything EXCEPT digits. <code>[^a-z]</code> matches everything that is not a lower-case letter. That's what we need (try it).
 +
 
 +
}}
 +
 
 +
One of the '''R''' functions that uses regular expressions is the function <code>gsub()</code>. It replaces characters that match a "regex" with other characters. That is useful for our purpose: we can
 +
#match all characters that are NOT a letter, and
 +
#replace them by - nothing: the empty string <code>""</code>.
 +
This deletes them.
 +
 
 +
{{Vspace}}
 +
 
 +
{{task|1 =
 +
* study the code in the <code>An excursion into regular expressions</code> section of the '''R''' script
 +
}}
 +
 
 +
{{Vspace}}
 +
 
 +
===Updating the database===
 +
 
 +
{{Vspace}}
 +
 
 +
{{task|1 =
 +
* study the code in the <code>Updating the database</code> section of the '''R''' script
 
}}
 
}}
  
 +
{{Vspace}}
  
Next, we use this sequence to find its most similar relative in YFO using BLAST
+
===Add "your" YFO Sequence===
  
 +
{{Vspace}}
  
 
{{task|1=
 
{{task|1=
# Navigate to the [http://www.ncbi.nlm.nih.gov '''BLAST''' entry page at the NCBI].
+
 
# Click on '''protein blast''' as the BLAST program to run.
+
;Add the YFO Mbp1 protein data to the database:
# Paste the sequence of the yeast Mbp1 DNA-binding domain into the search field.
+
 
# Set the following parameters:
+
# Copy the '''code template''' to add a new protein and its taxonomy entry into the script file <code>myCode.R</code> that you created at the very beginning.  
## As '''Database''' option choose '''Reference proteins (refseq_protein)'''
+
# Add your protein to the database by editing a copy of the code template in your script file. Ask on the mailing list if you don't know how (but be specific) or if you don't know how to find particular information items.
## As '''Organism''' enter the binomial name of YFO. Make sure you spell it right, the page will try to autocomplete your entry. Species level is detailed enough, you don't have to specify the strain (e.g. I would specify "''Ustilago maydis''" '''not''' "''Ustilago maydis'' 521").
+
# Add the taxonomy entry to the taxonomy table, again simply modifying a copy of the code template in your own script file.
# Then click on the '''BLAST''' button and wait for the result to appear. You will first see a graph of any conserved domains in your query sequence, this is not yet what you are waiting for...
+
 
# Patience.
 
# Patience. The database is large.
 
# Patience. It took about five minutes to complete for me, but execution times vary greatly by time of day.
 
# The top "hit" on the results page is what you are looking for. Its alignment and alignment score are shown in the '''Alignments''' section a bit further down the page. Your hit should have on the order of more than 40% identities to the query and match at least 80 residues or so. <small>If your match seems less and worse than that, please eMail me to troubleshoot.</small>
 
# The first item for each hit is a link to its database entry, right next to the checkbox.  It says something like <code>ref&#124;NP_123456789</code> or <code>ref&#124;XP_123456789</code> ... follow that link.
 
# Note the RefSeq ID, and save the sequence in FASTA format, as above. For simplicity I will refer to this sequence as "''YFO'' Mbp1" in the future.
 
 
}}
 
}}
  
&nbsp;
+
{{Vspace}}
  
==Annotate==
+
==Analyze==
  
Let us perform a few simple sequence annotations with the EMBOSS package, using the ''Saccharomyces'' and ''YFO'' Mbp1 sequences. As I noted in class, a large number of simple but fundamental sequence analysis tools are combined in the EMBOSS package. This package can be [http://emboss.sourceforge.net/download/ installed locally on your own machine], or run via a public Web interface<ref>Google for [http://www.google.ca/search?q=EMBOSS+explorer EMBOSS explorer], public access points include http://emboss.bioinformatics.nl/ and http://bioinfo.unice.fr/emboss/index.html </ref>.
+
Let us perform a few simple sequence analyses using the online EMBOSS tools. EMBOSS (the European Molecular Biology laboratory Open Software Suite) combines a large number of simple but fundamental sequence analysis tools. The tools can be [http://emboss.sourceforge.net/download/ installed locally on your own machine], or run via a public Web interface. Google for [http://www.google.ca/search?q=EMBOSS+explorer EMBOSS explorer], public access points include http://emboss.bioinformatics.nl/ .
  
Access one of the EMBOSS Explorer services and explore some of the tools:
+
Access an EMBOSS Explorer service and explore some of the tools:
  
  
Line 147: Line 271:
 
;Local composition
 
;Local composition
 
# Find <code>pepinfo</code> under the '''PROTEIN COMPOSITION''' heading.
 
# Find <code>pepinfo</code> under the '''PROTEIN COMPOSITION''' heading.
# Paste the YFO Mbp1 sequence in FASTA format into the input field.
+
# Retrieve the YFO Mbp1 related sequence from your '''R''' database, e.g. with something like <br /><code>  cat(db$protein[db$protein$name == "UMAG_1122"), "sequence"]</code>
 +
# Copy and paste the sequence into the input field.
 
# Run with default parameters.
 
# Run with default parameters.
 +
# Scroll to the figures all the way at the bottom.
 
# Do the same in a separate window for yeast Mbp1.
 
# Do the same in a separate window for yeast Mbp1.
# Try to compare ... kind of hard without reference, overlay and expectation, isn't it?
+
# Try to compare ... <small>(kind of hard without reference, overlay and expectation, isn't it?)</small>
 
}}
 
}}
  
Line 157: Line 283:
 
;Global composition
 
;Global composition
 
# Find <code>pepstats</code> under the '''PROTEIN COMPOSITION''' heading.
 
# Find <code>pepstats</code> under the '''PROTEIN COMPOSITION''' heading.
# Paste the YFO Mbp1 sequence in FASTA format into the input field.
+
# Paste the YFO Mbp1 sequence into the input field.
 
# Run with default parameters.
 
# Run with default parameters.
 
# Do the same in a separate window for yeast Mbp1.
 
# Do the same in a separate window for yeast Mbp1.
Line 166: Line 292:
 
{{task|1=
 
{{task|1=
 
;Motifs
 
;Motifs
# Find <code>patmatmotifs</code> and <code>pepcoils</code> .
+
# Find <code>pepcoil</code>, an algorithm to detect {{WP|coiled coil}} motifs.
# Run these with the YFO Mbp1 sequence and yeast Mbp1.
+
# Run this with the YFO Mbp1 sequence and yeast Mbp1.
# Try to compare ... do both sequences have the same motif annotated in approximately comparable regions of the protein?
+
# Try to compare ... do both sequences have coiled-coil motif predictions? Are they annotated in approximately comparable regions of the respective sequence?
 
}}
 
}}
  
Line 175: Line 301:
 
;Transmembrane sequences
 
;Transmembrane sequences
 
# Find <code>tmap</code>. Also find <code>shuffleseq</code>.
 
# Find <code>tmap</code>. Also find <code>shuffleseq</code>.
# Use your YFO Mbp1 sequence to annotate transmembrane helices for it and for a few shuffled sequences. YFO Mbp1 should have neither, and the shuffled sequences work as negative controls in any case.
+
# Use your YFO sequence to annotate transmembrane helices for your protein and for a few shuffled sequences. The YFO is not expected to have TM helices, nor are the shuffled sequences expected to have any. If you '''do''' find some, these are most likely "''false positives''".
# Also compare the following positive contros:
+
 
<div class="mw-collapsible mw-collapsed" style="margin-left:25px; width:80%; border:#001166 solid 1px; padding: 10px; ">
+
# Also compare the following positive control: Gef1 - a yeast chloride channel with 10 trans-membrane helices and outward localized N-terminus:
Gef1 - a yeast chloride channel with 10 trans-membrane helices and outward localized N-terminus:
 
<div class="mw-collapsible-content">
 
 
<source lang="text">
 
<source lang="text">
 
>gi|6322500|ref|NP_012574.1| Gef1p [Saccharomyces cerevisiae S288c]
 
>gi|6322500|ref|NP_012574.1| Gef1p [Saccharomyces cerevisiae S288c]
Line 195: Line 319:
 
TTNRNGNVI
 
TTNRNGNVI
 
</source>
 
</source>
</div>
+
 
</div>
+
# Evaluate the output: does the algorithm (wrongly) predict TM-helices in your protein? In the shuffled sequences? Does it find all ten TM-helices in Gef1?
# Evaluate the output: does the algorithm (wrongly) predict tm-helices in your protein? In the shuffled sequences? Does it find the tm-helices in Gef1?
 
 
}}
 
}}
  
Line 203: Line 326:
 
Try to familiarize yourself with the offerings in the EMBOSS package. I find some of the nucleic acid tools indispensable in the lab, such as restriction-site mapping tools, and I frequently use the alignment tools <code>Needle</code> and <code>Water</code>, but by and large the utility of many of the components&ndash;while fast, efficient and straightforward to use&ndash; suffers from lack of reference and comparison and from terse output. The routines show their conceptual origin in the 1970s and 1980s. We will encounter alternatives in later assignments.
 
Try to familiarize yourself with the offerings in the EMBOSS package. I find some of the nucleic acid tools indispensable in the lab, such as restriction-site mapping tools, and I frequently use the alignment tools <code>Needle</code> and <code>Water</code>, but by and large the utility of many of the components&ndash;while fast, efficient and straightforward to use&ndash; suffers from lack of reference and comparison and from terse output. The routines show their conceptual origin in the 1970s and 1980s. We will encounter alternatives in later assignments.
  
 +
{{Vspace}}
  
=== Plotting sequence composition in R ===
+
=='''R''' Sequence Analysis Tools==
  
 +
It's interesting to see this collection of tools that were carefully designed some twenty years ago, as an open source replacement for a set of software tools - the GCG package - that was indispensable for molecular biology labs in the 80s and 90s, but whose cost had become prohibitive. Fundamentally this is a building block approach, and the field has turned to programming solutions instead.
  
It is rather straightforward to code a simple amino acid composition measurement in R and I have shown you an example in the lecture. You need to familiarize yourself with
+
As for functionality, much more sophisticated functions are available on the Web: do take a few minutes and browse the [http://bioinformatics.ca/links_directory/tag/protein-sequence-analysis curated Web services directory of bioinformatics.ca].
  
{{task|1=
+
As for versatility, '''R''' certainly has the edge. Let's explore some of the functions available in the <code>seqinr</code> package that you already encountered in the introductory [[R tutorial]]. They are comparatively basic - but it is easy to prepare our own analysis.
# While you already have '''R''' open, access the [[R tutorial|'''R tutorial''']] and work through the sections on
 
## [[R tutorial#Scripts|'''Scripts''']];
 
## [[R tutorial#Variables|'''Variables''']];
 
## [[R tutorial#Scalar data|'''Scalar data''']]; and
 
## [[R tutorial#Vectors|'''Vectors''']]. This should help you understand the following code in principle.
 
# Study the following code:
 
 
 
 
 
<source lang="R">
 
# aaFreq.R
 
# plot amino-acid frequency excess
 
# Boris Steipe for BCH441, Sept. 2012
 
 
 
# install.packages("seqinr")
 
library("seqinr")
 
  
#load a fasta-formatted sequence from your working directory
 
mbp1 <- read.fasta("mbp1.fa", seqtype="AA", set.att=FALSE)
 
  
# define reference composition, a database average
+
{{Vspace}}
# (data taken from http://web.expasy.org/docs/relnotes/relstat.html)
 
  
refData <- c(
+
{{task|1 =
    "A"=8.26,
+
* Study the code in the <code>Sequence Analysis</code> section of the '''R''' script
    "Q"=3.93,
 
    "L"=9.66,
 
    "S"=6.56,
 
    "R"=5.53,
 
    "E"=6.75,
 
    "K"=5.84,
 
    "T"=5.34,
 
    "N"=4.06,
 
    "G"=7.08,
 
    "M"=2.42,
 
    "W"=1.08,
 
    "D"=5.45,
 
    "H"=2.27,
 
    "F"=3.86,
 
    "Y"=2.92,
 
    "C"=1.37,
 
    "I"=5.96,
 
    "P"=4.70,
 
    "V"=6.87
 
    )
 
refData = refData / sum(refData) #Normalize
 
 
 
# how many different characters are in the dataset?
 
lref <- length(refData)
 
   
 
# tabulate  the sequence of interest and normalize
 
obsData <- table(mbp1)            # count occurrences
 
obsData = obsData / sum(obsData)  # Normalize
 
 
 
# initialize a vector of zeros, which will hold our results
 
logRatio <- rep(lref, 0)
 
 
 
# loop over all elements of the reference and calculate log-ratios
 
for (i in 1:lref) {
 
aa <- names(refData)[i] # get the name of that amino acid
 
fObs <- obsData[aa]  # retrieve the frequency for that name
 
fRef <- refData[aa]
 
logRatio[aa] <- log(fObs / fRef) / log(2)
 
}
 
logRatio <- sort(logRatio, decreasing = TRUE) # sort values descending
 
 
 
# generate a y-axis label that also contains a subscript
 
# (see text() for details)
 
label <- expression(paste(log[2],"( f(obs) / f(ref) )", sep = ""))
 
 
 
# define colors for the bars
 
chargePlus  <- "#282A3F"
 
chargeMinus <- "#531523"
 
hydrophilic <- "#47707F"
 
hydrophobic <- "#6F7A79"
 
plain      <- "#EDF7F7"
 
 
 
#hydrophilic <- "#194759"
 
#hydrophobic <- "#3E8C84"
 
#plain      <- "#EDF7F7"
 
 
 
# Assign the colors to the different amino acid types
 
barColors <- rep(length(refData), 0)
 
for (i in 1:length(refData)) {
 
if      (names(logRatio[i]) == "A") { barColors[i] <- hydrophobic }
 
else if (names(logRatio[i]) == "C") { barColors[i] <- plain }
 
else if (names(logRatio[i]) == "D") { barColors[i] <- chargeMinus }
 
else if (names(logRatio[i]) == "E") { barColors[i] <- chargeMinus }
 
else if (names(logRatio[i]) == "F") { barColors[i] <- hydrophobic }
 
else if (names(logRatio[i]) == "G") { barColors[i] <- plain }
 
else if (names(logRatio[i]) == "H") { barColors[i] <- chargePlus }
 
else if (names(logRatio[i]) == "I") { barColors[i] <- hydrophobic }
 
else if (names(logRatio[i]) == "K") { barColors[i] <- chargePlus }
 
else if (names(logRatio[i]) == "L") { barColors[i] <- hydrophobic }
 
else if (names(logRatio[i]) == "M") { barColors[i] <- hydrophobic }
 
else if (names(logRatio[i]) == "N") { barColors[i] <- hydrophilic }
 
else if (names(logRatio[i]) == "P") { barColors[i] <- plain }
 
else if (names(logRatio[i]) == "Q") { barColors[i] <- hydrophilic }
 
else if (names(logRatio[i]) == "R") { barColors[i] <- chargePlus }
 
else if (names(logRatio[i]) == "S") { barColors[i] <- hydrophilic }
 
else if (names(logRatio[i]) == "T") { barColors[i] <- hydrophilic }
 
else if (names(logRatio[i]) == "V") { barColors[i] <- hydrophobic }
 
else if (names(logRatio[i]) == "W") { barColors[i] <- hydrophobic }
 
else if (names(logRatio[i]) == "Y") { barColors[i] <- hydrophobic }
 
else                                { barColors[i] <- plain }
 
}
 
 
 
# create a plot
 
barplot(logRatio, ylim=c(-4,1), col = barColors, ylab = label, cex.names=0.9)
 
 
 
# draw a line at 0
 
abline(h=0)
 
 
 
# create a legend that indicates what the colors mean
 
legend (x = 1,
 
        y = -1,
 
        legend = c("charged (+)",
 
                  "charged (-)",
 
                  "hydrophilic",
 
                  "hydrophobic",
 
                  "plain"),
 
        bty = "n",
 
        fill = c(chargePlus,
 
                chargeMinus,
 
                hydrophilic,
 
                hydrophobic,
 
                plain)
 
        )
 
       
 
</source>
 
 
 
 
 
:3. Copy the code and execute it in R. First for yeast Mbp1 (you need the appropriate FASTA file in your working directory), then edit and run for YFO. The yeast protein is enriched in hydrophilic residues and has a slight excess of (+) charged residues. This might be compatible with a partially natively disordered protein with DNA-binding properties. How is YFO Mbp1 similar or different?
 
 
}}
 
}}
  
 
+
{{Vspace}}
  
 
;That is all.
 
;That is all.
  
 +
{{Vspace}}
  
&nbsp;
+
== Links and resources ==
  
== Links and resources ==
+
* [http://regexpal.com RegEx Pal] - a tool for testing and developing regular expressions.
  
 
<!-- {{#pmid: 19957275}} -->
 
<!-- {{#pmid: 19957275}} -->
Line 361: Line 361:
 
{{#lst:BIO_Assignment_Week_1|assignment_footer}}
 
{{#lst:BIO_Assignment_Week_1|assignment_footer}}
  
 +
<table style="width:100%;"><tr>
 +
<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_2|&lt;&nbsp;Assignment&nbsp;2]]</td>
 +
<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[BIO_Assignment_Week_4|Assignment&nbsp;4&nbsp;&gt;]]</td>
 +
</tr></table>
  
 
&nbsp;
 
&nbsp;
 
[[Category:Bioinformatics]]
 
[[Category:Bioinformatics]]
 
</div>
 
</div>

Latest revision as of 09:10, 5 October 2016

Assignment for Week 3
Sequence Analysis

< Assignment 2 Assignment 4 >

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.



 

Introduction

Model organism and YFO

In this assignment we will set out on our exploration of the system that regulates the G1/S transition by focussing first on the Mbp1 protein in selected species from the kingdom of fungi, whose genome has been completely sequenced; our quest is thus also an exercise in model-organism reasoning: the transfer of knowledge from one, well-studied organism to others. It's reasonable to hypothesize that such central control machinery is conserved in most if not all fungi. But we don't know. Many of the species that we will be working with have not been characterized in great detail, and some of them are new to our class this year. And while we know a fair bit about Mbp1, we probably don't know very much at all about the related genes in other organisms: whether they exist, whether they have similar functional features and whether they might contribute to the G1/S checkpoint system in a similar way. Thus we might discover things that are new and interesting. This is a quest of discovery.


 

The BCH441 RStudio Project

I have begun to unify R-scripts and other resources in an RStudio project that will make it easy to update and distribute code. Conveniently, if I push update material to the github repository, all you need to do is to pull the updated project to receive all updates and new files on your computer. Version control is really good for this. However, there is one limitation to this approach that you need to be aware of. If you create your own, local files and then commit them, git will complain that it would be overwriting such local material. As long as you don't commit your files then all should be fine. This means you'll need to do your own "versioning" by saving your own scripts under a different name from time to time. Once again: in this context:

  • saving your own files is fine;
  • committing your own files to version control will cause problems;
  • changes you make to course material files and save under the same filename (like adding comments and notes) will not persist, these changes will be overwritten with the next update. You need to "Save As..." with a new filename (e.g. just prefix the original name with "my").


 

Task:

  • Open RStudio and create a New Project... cloned from a git version control directory. The repository URL is https://github.com/hyginn/BCH441_2016. Create this in the same way as you did for the R-tutorial.
  • As requested on the console, type init(). This will create a file called .myProfile.R and ask you for your UofT eMail address and Student ID. You need to enter the correct values because other scripts will make assumptions about that these variables exist and are valid.
  • Work through the task: "local script" in the BCH441_2016.R script.
  • Then load the script for the current assignment as instructed.


 

Choosing YFO (Your Favourite Organism)

Since we were trying to find related proteins in a different species, our next task is to find suitable species.

For this purpose we create a lottery to assign species at (pseudo) random to students, so that everyone has a good chance to be working on their own species. The technical details are in the R scripts that implement the species search and distribution. In brief, we define a function that picks one species from a long list at random - but to make sure this process is reproducible, we'll set a seed for the random number generator. Obviously, everyone has to use a different seed, or else everyone would end up getting the same species assigned. Thus we'll use your Student Number as the seed. This is an integer, so it can be used as an argument to R's set.seed() function, and it's unique to each of you. The choice is then random, reproducible and unique.

You may notice that this process does not guarantee that everyone gets a different species, and that all species are chosen at least once. I don't think doing that is possible in a "stateless" way (i.e. I don't want to have to remember who chose what species), given that I don't know all of your student numbers. But if anyone can think of a better solution, that would be neat.

Is it possible that all of you end up working on the same species anyway? Indeed. That's the problem with randomness. But it is not very likely.


What about the "suitable species" though? Where do they come from? For the purposes of the course "quest", we need species

  • that actually have transcription factors that are related to Mbp1;
  • whose genomes have been sequenced; and
  • for which the sequences have been deposited in the RefSeq database, NCBI's unique sequence collection.


Task:


  • Work through PART ONE: YFO Species of the BCH441_A03.R script.
  • There are some deep "rabbitholes" that you are encouraged to follow to explore the code that went into generating the YFO species list. The minimal required result however is that you have picked an YFO, and that it's name got written into your personalized profile file.
  • Note down the species name and its five letter label on your Student Wiki user page. Use this species whenever this or future assignments refer to YFO.


 

Selecting the YFO "Mbp1"

 

Task:

  1. Back at the Mbp1 protein page follow the link to Run BLAST... under "Analyze this sequence".
  2. This allows you to perform a sequence similarity search. You need to set two parameters:
    1. As Database, select Reference proteins (refseq_protein) from the drop down menu;
    2. In the Organism field, type the species you have selected as YFO and select the corresponding taxonomy ID.
  3. Click on Run BLAST to start the search. This should find a handful of genes, all of them in YFO. If you find none, or hundreds, or they are not all in the same species, you did something wrong. Ask on the mailing list and make sure to fix the problem.
  4. Look at the top "hit" in the Descriptions section. The rightmost column contains sequence IDs unter the Accession heading. The alignment and alignment score are shown in the Alignments section a bit further down the page. Look at the result.
  5. In the header information for each hit is a link to its database entry, right next to Sequence ID. It says something like ref|NP_123456789.1 or ref|XP_123456789 ... follow that link.
  6. Note the RefSeq ID, and the search results %ID, E-value, whether one or more similar regions were found etc. in your Journal. We will refer to this sequence as "YFO Mbp1" or similar in the future.
  7. Finally access the UniProt ID mapping service to retrieve the UniProt ID for the protein. Paste the RefSeq ID and choose RefSeq Protein as the From: option and UniProtKB as the To: option.
If the mapping works, the UniProt ID will be in the Entry: column of the table that is being returned. Click the link and have a look at the UniProt entry page while you're there.


 

Store Data

 

The System datamodel

 
Entity-Relationship Diagram (ERD) for a data model that stores data for a systems project. Entity names are at the top of each box, attributes are listed below. If you think of the entity as a table, its attributes are the column names and each row stores the values for one particular instance. Semantically related entities are shaded in similar colours; this helps to make the design-principles visible, but one must be careful not to overdo the use of colour. As with all graphical elements in information design: "less is more". All relationships link to unique primary keys and are thus 1 to (0, n): i.e. as attributes, the relationship does not have to exist but there could be many, as the primary key, exactly one key must exist. The diagram was drawn as a "Google presentation" slide and you can view it on the Web and make a copy for your own purposes.

Here is a first version of a systems data model, based on what we discussed in class:

  • The feature table is at the centre. This was not intentional, but emerged from iterating the model through a number of revisions. It emphasizes that the main purpose of this model is to integrate and annotate data from various sources. Feature in the way we understand it here is an abstraction of quite disparate categories of information items. This includes domain annotations, system functions, literature references, and cross-references to other databases. The type attribute will require some thought: this attribute really needs a "controlled vocabulary" to ensure that the same category is described consistently with the same string ("PubMed"? "Lit."? "reference"?). There are a number of ways to achieve this, the best way[1] is to store these types in their own "reference" table type - and link to that table via a foreign key.
  • The protein table is at the centre. Its primary key is a unique integer. We store the NCBI RefSeq ID and the Uniprot ID in the table. We would not call these "foreign keys", since the information they reference is not in our schema but at the NCBI resp. EBI. For example, we can't guarantee that they are unique keys.
  • The taxonomy table holds information about species. We use the NCBI taxonomy ID as its primary key. The same key appears in the protein table as the foreign key that links the protein with its proper taxonomy information. This is an instance where the data model actually does not describe reality well. The problem is that particular proteins that we might retrieve from database searches will often be annotated to a specific strain of a species – and there is no easy way to reference strains to species. We'll see whether this turns out to be a problem in practice. But it may be that an additional table may be required that stores parent/child relationships of the taxonomic tree of life.
  • The protein_feature table links a protein with all the features that we annotate it with. start and end coordinates identify the region of the sequence we have annotated.
  • The ...Annotation tables link feature-information with annotated entities.
  • Should the system table have its own taxonomy attribute? Or should the species in which the system is observed be inferred from the component protein's taxonomy.ID? What do you think? I decided not to add a taxonomy attribute to that table. How would you argue for or against this decision?
  • The component table links proteins that collaborate together as a system. There is an implicit assumption in this model that only proteins are system components (and not e.g. RNA), and that components are atomic (i.e. we can't link to subsystems). How would you change the model to accommodate more realistic biological complexity?


 

Implementing the Data Model in R

 

To actually implement the data model in R we will create the tables as data frames, and we will collect them in a list. We don't have to keep the tables in a list - we could also keep them as independent objects, but with a table named "protein" we should be worried of inadvertently overwriting the table. A list is neater, more flexible (we might want to have several of these), it reflects our intent about the model better, and doesn't require very much more typing. For now, to keep things simple, we will implement only two tables: protein and taxonomy. We'll add the rest when we actually need them.

Task:

  • study the code in the Creating two tables section of the R script


 

As you see we can edit any component of our data model directly by assigning new values to the element. But in general that's a really bad idea, since it is far too easy to bring the whole model into an inconsistent state. It's much better to write functions that get and set data elements, not only to keep the data consistent, but also because it is much easier, if the model ever changes, to simply edit the function code, rather than having to repeat every single data entry by hand.

What would an setData function have to look like? It should

  • create a new entry if the requested row of a table does not exist yet;
  • update data if the protein exists;
  • perform consistency checks (i.e. check that the data has the correct type);
  • perform sanity checks (i.e. check that data values fall into the expected range);
  • perform completeness checks (i.e. handle incomplete data)

Let's start simple, and create a set- function to add new values to existing sequence data. Also, for clarity, we'll forgo many checks. The first thing we should do is to add the actual sequence.

We only entered a placeholder for the sequence field. Sequences come in many different flavours when we copy them from a Webpage: there can be whitespace, carriage returns, numbers, they can be upper-case, lower-case mixed-case ...

What we want in our sequence data field is one string that contains the entire sequence, and nothing but upper-case, amino-acid code letters.

We'll need to look at how we work with strings in R, how we identify and work with patterns in strings. This is a great time to introduce regular expressions.


 

A Brief First Encounter of Regular Expressions

 
Regular expressions are a concise description language to define patterns for pattern-matching in strings.

Truth be told, many programmers have a love-hate relationship with regular expressions. The syntax of regular expressions is very powerful and expressive, but also terse, not always intuitive, and sometimes hard to understand. I'll introduce you to a few principles here that are quite straightforward and they will probably cover 99% of the cases you will encounter.

Here is our test-case: the sequence of Mbp1, copied from the NCBI Protein database page for yeast Mbp1.

       1 msnqiysary sgvdvyefih stgsimkrkk ddwvnathil kaanfakakr trilekevlk
      61 ethekvqggf gkyqgtwvpl niakqlaekf svydqlkplf dftqtdgsas pppapkhhha
     121 skvdrkkair sastsaimet krnnkkaeen qfqsskilgn ptaaprkrgr pvgstrgsrr
     181 klgvnlqrsq sdmgfprpai pnssisttql psirstmgpq sptlgileee rhdsrqqqpq
     241 qnnsaqfkei dledglssdv epsqqlqqvf nqntgfvpqq qssliqtqqt esmatsvsss
     301 pslptspgdf adsnpfeerf pgggtspiis miprypvtsr pqtsdindkv nkylsklvdy
     361 fisnemksnk slpqvllhpp phsapyidap idpelhtafh wacsmgnlpi aealyeagts
     421 irstnsqgqt plmrsslfhn sytrrtfpri fqllhetvfd idsqsqtvih hivkrksttp
     481 savyyldvvl skikdfspqy rielllntqd kngdtalhia skngdvvffn tlvkmgaltt
     541 isnkegltan eimnqqyeqm miqngtnqhv nssntdlnih vntnnietkn dvnsmvimsp
     601 vspsdyityp sqiatnisrn ipnvvnsmkq masiyndlhe qhdneikslq ktlksisktk
     661 iqvslktlev lkesskdeng eaqtnddfei lsrlqeqntk klrkrliryk rlikqkleyr
     721 qtvllnklie detqattnnt vekdnntler lelaqeltml qlqrknklss lvkkfednak
     781 ihkyrriire gtemnieevd ssldvilqtl iannnknkga eqiitisnan sha
//


Task:
Navigate to http://regexpal.com and paste the sequence into the lower box. This site is one of a number of online regular expression testers; their immediate, visual feedback is invaluable when you are developing regular expression patterns.

Lets try some expressions:

Most characters are matched literally.
Type "a" in to the upper box and you will see all "a" characters matched. Then replace a with q.
Now type "aa" instead. Then krnnkk. Sequences of characters are also matched literally.
The pipe character | that symbolizes logical OR can be used to define that more than one character should match
i(s|m|q)n matches isn OR imn OR iqn. Note how we can group with parentheses, and try what would happen without them.
We can more conveniently specify more than one character to match if we place it in square brackets.
[lq] matches l OR q. [familyvw] matches hydrophobic amino acids.
Within square brackets, we can specify "ranges".
[1-5] matches digits from 1 to 5.
Within square brackets, we can specify characters that should NOT be matched, with the caret, ^.
[^0-9] matches everything EXCEPT digits. [^a-z] matches everything that is not a lower-case letter. That's what we need (try it).

One of the R functions that uses regular expressions is the function gsub(). It replaces characters that match a "regex" with other characters. That is useful for our purpose: we can

  1. match all characters that are NOT a letter, and
  2. replace them by - nothing: the empty string "".

This deletes them.


 

Task:

  • study the code in the An excursion into regular expressions section of the R script


 

Updating the database

 

Task:

  • study the code in the Updating the database section of the R script


 

Add "your" YFO Sequence

 

Task:

Add the YFO Mbp1 protein data to the database
  1. Copy the code template to add a new protein and its taxonomy entry into the script file myCode.R that you created at the very beginning.
  2. Add your protein to the database by editing a copy of the code template in your script file. Ask on the mailing list if you don't know how (but be specific) or if you don't know how to find particular information items.
  3. Add the taxonomy entry to the taxonomy table, again simply modifying a copy of the code template in your own script file.


 

Analyze

Let us perform a few simple sequence analyses using the online EMBOSS tools. EMBOSS (the European Molecular Biology laboratory Open Software Suite) combines a large number of simple but fundamental sequence analysis tools. The tools can be installed locally on your own machine, or run via a public Web interface. Google for EMBOSS explorer, public access points include http://emboss.bioinformatics.nl/ .

Access an EMBOSS Explorer service and explore some of the tools:


Task:

Local composition
  1. Find pepinfo under the PROTEIN COMPOSITION heading.
  2. Retrieve the YFO Mbp1 related sequence from your R database, e.g. with something like
    cat(db$protein[db$protein$name == "UMAG_1122"), "sequence"]
  3. Copy and paste the sequence into the input field.
  4. Run with default parameters.
  5. Scroll to the figures all the way at the bottom.
  6. Do the same in a separate window for yeast Mbp1.
  7. Try to compare ... (kind of hard without reference, overlay and expectation, isn't it?)


Task:

Global composition
  1. Find pepstats under the PROTEIN COMPOSITION heading.
  2. Paste the YFO Mbp1 sequence into the input field.
  3. Run with default parameters.
  4. Do the same in a separate window for yeast Mbp1.
  5. Try to compare ... are there significant and unexpected differences?


Task:

Motifs
  1. Find pepcoil, an algorithm to detect coiled coil motifs.
  2. Run this with the YFO Mbp1 sequence and yeast Mbp1.
  3. Try to compare ... do both sequences have coiled-coil motif predictions? Are they annotated in approximately comparable regions of the respective sequence?


Task:

Transmembrane sequences
  1. Find tmap. Also find shuffleseq.
  2. Use your YFO sequence to annotate transmembrane helices for your protein and for a few shuffled sequences. The YFO is not expected to have TM helices, nor are the shuffled sequences expected to have any. If you do find some, these are most likely "false positives".
  1. Also compare the following positive control: Gef1 - a yeast chloride channel with 10 trans-membrane helices and outward localized N-terminus:
>gi|6322500|ref|NP_012574.1| Gef1p [Saccharomyces cerevisiae S288c]
MPTTYVPINQPIGDGEDVIDTNRFTNIPETQNFDQFVTIDKIAEENRPLSVDSDREFLNSKYRHYREVIW
DRAKTFITLSSTAIVIGCIAGFLQVFTETLVNWKTGHCQRNWLLNKSFCCNGVVNEVTSTSNLLLKRQEF
ECEAQGLWIAWKGHVSPFIIFMLLSVLFALISTLLVKYVAPMATGSGISEIKVWVSGFEYNKEFLGFLTL
VIKSVALPLAISSGLSVGKEGPSVHYATCCGYLLTKWLLRDTLTYSSQYEYITAASGAGVAVAFGAPIGG
VLFGLEEIASANRFNSSTLWKSYYVALVAITTLKYIDPFRNGRVILFNVTYDRDWKVQEIPIFIALGIFG
GLYGKYISKWNINFIHFRKMYLSSWPVQEVLFLATLTALISYFNEFLKLDMTESMGILFHECVKNDNTST
FSHRLCQLDENTHAFEFLKIFTSLCFATVIRALLVVVSYGARVPAGIFVPSMAVGATFGRAVSLLVERFI
SGPSVITPGAYAFLGAAATLSGITNLTLTVVVIMFELTGAFMYIIPLMIVVAITRIILSTSGISGGIADQ
MIMVNGFPYLEDEQDEEEEETLEKYTAEQLMSSKLITINETIYLSELESLLYDSASEYSVHGFPITKDED
KFEKEKRCIGYVLKRHLASKIMMQSVNSTKAQTTLVYFNKSNEELGHRENCIGFKDIMNESPISVKKAVP
VTLLFRMFKELGCKTIIVEESGILKGLVTAKDILRFKRIKYREVHGAKFTYNEALDRRCWSVIHFIIKRF
TTNRNGNVI
  1. Evaluate the output: does the algorithm (wrongly) predict TM-helices in your protein? In the shuffled sequences? Does it find all ten TM-helices in Gef1?


Try to familiarize yourself with the offerings in the EMBOSS package. I find some of the nucleic acid tools indispensable in the lab, such as restriction-site mapping tools, and I frequently use the alignment tools Needle and Water, but by and large the utility of many of the components–while fast, efficient and straightforward to use– suffers from lack of reference and comparison and from terse output. The routines show their conceptual origin in the 1970s and 1980s. We will encounter alternatives in later assignments.


 

R Sequence Analysis Tools

It's interesting to see this collection of tools that were carefully designed some twenty years ago, as an open source replacement for a set of software tools - the GCG package - that was indispensable for molecular biology labs in the 80s and 90s, but whose cost had become prohibitive. Fundamentally this is a building block approach, and the field has turned to programming solutions instead.

As for functionality, much more sophisticated functions are available on the Web: do take a few minutes and browse the curated Web services directory of bioinformatics.ca.

As for versatility, R certainly has the edge. Let's explore some of the functions available in the seqinr package that you already encountered in the introductory R tutorial. They are comparatively basic - but it is easy to prepare our own analysis.


 

Task:

  • Study the code in the Sequence Analysis section of the R script


 
That is all.


 

Links and resources

  • RegEx Pal - a tool for testing and developing regular expressions.


 


Footnotes and references

  1. Relational databases like MySQL, PostgresQL, and MariaDB offer the datatype "Enum" for this purpose but this is an inferior solution. Enum types need to be fixed at the time the schema is created, they can't store information about their semantics, i.e. how the keywords are defined and when they should be used, and they can't be used in more than one table, since they are metadata of one particular column.


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.



< Assignment 2 Assignment 4 >