Difference between revisions of "BIO Assignment Week 2"

From "A B C"
Jump to navigation Jump to search
m
 
(36 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
<div class="b1">
 
<div class="b1">
 
Assignment for Week 2<br />
 
Assignment for Week 2<br />
<span style="font-size: 70%">Scenario, Labnotes on the Wiki, R-functions, Databases, Sequence in Chimera (and optionally: small molecules)</span>
+
<span style="font-size: 70%">Scenario, Labnotes, R-functions,<br />Databases, Data Modelling</span>
 
</div>
 
</div>
 +
<table style="width:100%;"><tr>
 +
<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_1|&lt;&nbsp;Assignment&nbsp;1]]</td>
 +
<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[BIO_Assignment_Week_3|Assignment&nbsp;3&nbsp;&gt;]]</td>
 +
</tr></table>
  
  
{{Template:Active}}
+
{{Inactive}}
 +
 
  
 
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.  
 
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.  
Line 15: Line 20:
  
 
&nbsp;
 
&nbsp;
==The Scenario==
+
==Introduction: Scenario==
  
Baker's yeast, ''Saccharomyces cerevisiae'', is perhaps the most important {{WP|Model_organism|model organism}}. It is a eukaryote that has been studied genetically and biochemically in great detail for many decades, and it is easily manipulated with high-throughput experimental methods. We will use information from this model organism to study the conservation of function and sequence in other fungi whose genomes have been completely sequenced; the assignments are an exercise in model-organism reasoning: the transfer of knowledge from one, well-studied organism to others.
 
  
This and the following assignments will revolve around a {{WP|Transcription factor|transcription factor}} that plays an important role in the regulation of the cell cycle: '''Mbp1''' is a key component of the MBF complex (Mbp1/Swi6). This complex regulates gene expression at the crucial G1/S-phase transition of the mitotic cell cycle and has been shown to bind to the regulatory regions of more than a hundred target genes. It is therefore a DNA binding protein that acts as a control switch for a key cellular process.
+
<div class="colmask doublepage">
 +
  <div class="colleft">
 +
    <div class="col1">
 +
      <!-- Column 1 start -->
 +
I have introduced the concept of "{{WP|cargo cult science}}" in class. The "cargo" in Bioinformatics is to understand biology. This includes understanding how things came to be the way they are, and how they work. Both relate to the concept of '''function''' of biomolecules, and the ''systems''<ref>We have drafted a ''system'' definition in class: '''A system is a collection of collaborating genes that have more significant relationships among each other than to genes that are not system members.'''</ref> they contribute to. But "function" is a rather poorly defined concept and exploring ways to make it rigorous and computable and explore it from the perspective of "collaborating" components, that will be a major objective of this course. The realm of bioinformatics contains many kingdoms and duchies and shires and hidden glades. To find out how they contribute to the whole, we will proceed on a quest. We will take a relatively well-characterized protein that is part of a relatively well-characterized process, and ask what its function is. We will examine the protein's sequence, its structure, its domain composition, its relationship to and interactions with other proteins. Through that we will paint a picture of the "system" that it contributes to.
  
One would speculate that such central control machinery would be conserved in other fungi and it will be your task in these assignments to collect evidence whether related molecular components are present in some of the newly sequenced fungal genomes. Throughout the assignments we will use freely available tools to conduct bioinformatics investigations of sequences, structures and relationships that may ultimately answer questions such as:
+
Our quest will revolve around a <span id="tf"></span>{{WP|Transcription factor|transcription factor}} that plays an important role in the regulation of the cell cycle. The genetic regulation of budding- and fission yeast cell-cycles has been lucidly described in a highly recommended review by McInerny (2011)<ref>{{#pmid: 21310294}}</ref> (see also the short, recent introduction to cell-cycle regulated tranxcription by McInerny (2016)<ref>{{#pmid: 27239285}}</ref>). One transcription factor, '''Mbp1''' is a key component of the MBF complex (Mbp1/Swi6) in yeast. This complex regulates gene expression at the crucial G1/S-phase transition of the mitotic cell cycle and has been shown to bind to the regulatory regions of more than a hundred target genes. It is therefore a DNA binding protein that acts as a control switch for a key cellular process, it is highly conserved across species, and it plays a role in human disease. Surely, understandnig the mechanisms of this system would be "cargo".
  
*Do related proteins exist in other organisms?
+
      <!-- Column 1 end -->
*What functional features can we detect in the related proteins?
+
    </div>
*Do we have evidence that they may bind to similar sequence motifs?
+
    <div class="col2">
*Do we believe they may function in a similar way?
+
      <!-- Column 2 start -->
 +
We will start our quest by exploring the Mbp1 protein of Baker's yeast, ''Saccharomyces cerevisiae'', one of the most important {{WP|Model_organism|model organisms}}. Baker's yeast is a eukaryote that has been studied genetically and biochemically in great detail for many decades, and it is easily manipulated with high-throughput experimental methods. Then, next week, each of you will use this information to study not Baker's yeast, but a related organism about which we know comparatively little from experiments done in the lab. Our reasoning will rely on computational inference.
  
{{task|1=
 
Access the [http://db.yeastgenome.org/cgi-bin/locus.pl?locus=mbp1 information page on Mbp1] at the ''Saccharomyces'' Genome Database and read the summary paragraph on the protein's function!
 
  
(If you would like to brush up on the concepts mentioned above, you could study the corresponding chapter in [http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=mcb.chapter.3432 Lodish's Molecular Cell Biology] and./or read Nobel laureate {{PDFlink|[http://www.cumc.columbia.edu/dept/eukaryotic/nurse.pdf Paul Nurse's review]}} of the key concepts of the eukaryotic cycle. It is not strictly necessary to understand the details of the yeast cell-cycle to complete the assignments, but it's obviously more satisfying to work with concepts that actually make some sense.)
+
Here are the steps of the assignment for this week:
}}
 
  
For reference, this is the FASTA formatted sequence of Mbp1 from ''Saccharomyces cerevisiae'':
+
<div class="emphasis-box">
 +
# Start a "lab journal".
 +
# Explore what kind of data is available for the Mbp1 protein.
 +
# Define what data you want to store and work with, and design a data model.
 +
</div>
  
>gi|6320147|ref|NP_010227.1| Mbp1p [Saccharomyces cerevisiae S288c]
+
      <!-- Column 2 end -->
MSNQIYSARYSGVDVYEF<span style="color:#DD0000;">IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGF
+
    </div>
GKYQGTWVPLNIAKQLAEKFSVY</span>DQLKPLFDFTQTDGSASPPPAPKHHHASKVDRKKAIRSASTSAIMET
+
  </div>
KRNNKKAEENQFQSSKILGNPTAAPRKRGRPVGSTRGSRRKLGVNLQRSQSDMGFPRPAIPNSSISTTQL
+
</div>
PSIRSTMGPQSPTLGILEEERHDSRQQQPQQNNSAQFKEIDLEDGLSSDVEPSQQLQQVFNQNTGFVPQQ
 
QSSLIQTQQTESMATSVSSSPSLPTSPGDFADSNPFEERFPGGGTSPIISMIPRYPVTSRPQTSDINDKV
 
NKYLSKLVDYFISNEMKSNKSLPQVLLHPPPHSAPYIDAPIDPELHTAFHWACSMGNLPIAEALYEAGTS
 
IRSTNSQGQTPLMRSSLFHNSYTRRTFPRIFQLLHETVFDIDSQSQTVIHHIVKRKSTTPSAVYYLDVVL
 
SKIKDFSPQYRIELLLNTQDKNGDTALHIASKNGDVVFFNTLVKMGALTTISNKEGLTANEIMNQQYEQM
 
MIQNGTNQHVNSSNTDLNIHVNTNNIETKNDVNSMVIMSPVSPSDYITYPSQIATNISRNIPNVVNSMKQ
 
MASIYNDLHEQHDNEIKSLQKTLKSISKTKIQVSLKTLEVLKESSKDENGEAQTNDDFEILSRLQEQNTK
 
KLRKRLIRYKRLIKQKLEYRQTVLLNKLIEDETQATTNNTVEKDNNTLERLELAQELTMLQLQRKNKLSS
 
LVKKFEDNAKIHKYRRIIREGTEMNIEEVDSSLDVILQTLIANNNKNKGAEQIITISNANSHA
 
  
I have highlighted the protein's <span style="color:#DD0000;">'''APSES''' domain</span> (also known as a {{WP|KilA-N domain}}), which is the DNA binding element of the sequence. Of course, such colouring is not part of the actual {{WP|FASTA_format|FASTA}} file which contains only a header and sequence letters. This is the domain we will focus on most in the following assignments.
 
  
 +
&nbsp;
  
===Choosing YFO (Your Favourite Organism)===
+
===Keeping Labnotes===
 
 
  
The first task is to choose a species in which to conduct your explorations.
+
<div class="colmask doublepage">
 +
  <div class="colleft">
 +
    <div class="col1">
 +
      <!-- Column 1 start -->
 +
Before we all head off into the Internet: have you thought about how to document your "quest"? How will you keep notes? Obviously, computational research embraces the same ''best-practice'' principles as any wet-lab experiment. We keep notes to document our objectives and activities, we ensure our work is reproducible, and we take great care that our conclusions are supported by data. I think it's pretty obvious that paper notes are not very useful for bioinformatics work. Ideally, you should be able to save results, and link to files, Webpages and other resources.
  
 +
Consider it a part of your assignment to document your activities in electronic form. Here are some applications you might think of - but (!) disclaimer, I myself don't use any of these (yet) <small>(except the Wiki of course)</small>.
  
Many fungal genomes have been sequenced and more are added each year. For the purposes of the course assignments, we need a species
+
*[http://evernote.com '''Evernote'''] - a web hosted, automatically syncing e-notebook.
* that has transcription factors containing APSES domains;
+
*[http://nevernote.sourceforge.net/ '''Nevernote'''] - the Open Source alternative to Evernote.
* whose genome has been completely sequenced;
+
*[https://keep.google.com/ '''Google Keep'''] - if you have a Gmail account, you can simply log in here. Grid-based. Seems a bit awkward for longer notes. But of course you can also use [http://drive.google.com '''Google Docs'''].
* for which records exist in the RefSeq database, NCBI's unique sequence collection.
+
      <!-- Column 1 end -->
 +
    </div>
 +
    <div class="col2">
 +
      <!-- Column 2 start -->
  
 +
*[http://www.onenote.com/ '''Microsoft OneNote'''] - this sounds interesting and if any one is using this, I'd like to hear from you. Syncing across platforms, being able to format contents and organize it sounds great.
 +
*[http://steipe.biochemistry.utoronto.ca/abc/students '''The Student Wiki'''] - of course. Beginning a project notes page is part of this assignment.
 +
*[https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects '''RStudio projects'''] - for development-focussed work – especially (but not exclusively) – in '''R''', an RStudio project may be the right solution to keep your code, results, notes, manuscript drafts, literature and other assets all in one place. The great benefit is that it can all be under version control and it's super easy to share everything with colleagues on a team through [https://github.com '''GitHub''']<ref>Technically, GitHub documents are all publicly accessible if they are stored in repositories of free accounts - but you can commit binary files, so simply keep sensitive material in password-protected .zip files or otherwise encrypt it.</ref>. The only downside that I can think of is that it's not possible to cross-reference and link to material.<ref>Actually, that's not even literally true. You could write a function to use the [https://support.rstudio.com/hc/en-us/articles/202133558-Extending-RStudio-with-the-Viewer-Pane "Viewer Pane"] for very general cross-referencing.</ref>.
  
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand for details" data-collapsetext="Collapse">To prepare such a list of species, I have searched the NCBI's RefSeq database for proteins whose sequences are similar to the APSES domain of Mbp1 and compiled the names of organisms that contain them. 
+
Are you aware of any other solutions? Let us know!
<div class="mw-collapsible-content">
 
  
:(1) Compiled a list of genome-sequenced fungi from information on the [http://www.ncbi.nlm.nih.gov/genome/browse/ NCBI genome browser page] by selecting Eukaryota / Fungi ... and downloading the entire list of species as a text document. An excerpt of the first lines of the document is shown here:
+
'''Keeping such a journal will be helpful, because the assignments are integrated over the entire term''', and later assignments will make use of earlier results. But it is also excellent practice for "real" research. Expand the section below for details - written from a Wiki perspective but generally applicable.
  
#Organism/Name            Kingdom    Group  SubGroup        Size (Mb)
+
       <!-- Column 2 end -->
Aciculosporium take       Eukaryota  Fungi  Ascomycetes    58.8364
+
    </div>
Agaricus bisporus        Eukaryota  Fungi  Basidiomycetes  32.6144
+
  </div>
Ajellomyces capsulatus    Eukaryota  Fungi  Ascomycetes    46.124
 
Ajellomyces dermatitidis  Eukaryota  Fungi  Ascomycetes    75.4047
 
[...]
 
 
 
:(2) Reformatted the document to provide an Entrez species selection command. With this string NCBI search tools can be constrained to a set of species we are interested in. One could type this list by hand, or use search/replace functions of a text editor on the original list. I used the following Perl one-liner which I give here merely for your edification<ref>If you are curious how this works, ask me.</ref>.
 
<br />
 
::<small><code>perl -e 'while(<STDIN>){/^(.+?)\t/;print"\"$1\"[organism] OR \n"}' < genomes_overview.txt
 
</code></small>
 
 
 
... giving me the Entrez selection command (with over 400 species):
 
 
 
"Aciculosporium take"[organism] OR
 
"Agaricus bisporus"[organism] OR
 
"Ajellomyces capsulatus"[organism] OR
 
"Ajellomyces dermatitidis"[organism] ...
 
 
 
 
 
:(3) Performed a [http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&BLAST_PROGRAMS=blastp&PAGE_TYPE=BlastSearch&SHOW_DEFAULTS=on&LINK_LOC=blasthome PSI BLAST] search with the Mbp1 APSES domain sequence shown above, the search database restricted to the '''refseq_protein database''' and an the '''Entrez Query''' created as explained above. This search was iterated a few times and retrieves all sequence-similar proteins from genome sequenced fungi for which entries exist in the RefSeq database.<ref>Actualy, there is a bit of a detour required here: the list of selection commands is too long and had to be broken down into four batches of a bout 100 species to be processed by the BLAST server.</ref>
 
 
 
:(4) In the header of the BLAST results page, there is a link to '''[Taxonomy reports]''' This contains a list of all hits, sorted by species. I copied the species names to a separate file - applying a bit of manual editing: removing duplicate genus entries, and the six reference species ''Saccharomyces cerevisiae'', ''Aspergillus nidulans'', ''Candida albicans'', ''Neurospora crassa'', ''Schizosaccharomyces pombe'', and ''Ustilago maydis'' - these are not being assigned to the class.
 
 
 
 
 
:(5) Finally, I extracted a 5 letter code from the binomial names and formatted everything as '''R''' code to be used below. Again, a Perl one-liner. It applies a regular expression to extract the first three characters of the genus name and the first two characters of the species name and combines these into a short, uppercase label.<br/>
 
::<small><br /><code>perl -e 'while(<STDIN>){m/^((...).+?\s(..).*?)\s/;print("\t\t\"$1 (", uc($2.$3), ")\",\n");}' < BLAST_species.txt</code></small>
 
 
 
This process with its mix of Web service, programmed reformatting and manual cleanup, is a fairly typical example of gathering and collating information across different data sources.
 
</div>
 
 
</div>
 
</div>
  
&nbsp;
 
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand for details" data-collapsetext="Collapse">
 
Next, I would like to assign species from this list to each student. This process should be random, but reproducible.
 
  
<div class="mw-collapsible-content"> Here's an idea: we could use the student ID ( a '''unique identifier''') to pick entries from the list! Indeed, the functions provided in '''R''' can easily be used to randomly but reproducibly choose an element from a list. Essentially we can write a function that creates a many-faced die, with a piece of text&mdash;a species' name&mdash; on every face. It will fall differently for each student ID, but will fall the same every time the same ID is encountered.
+
<div class="mw-collapsible mw-collapsed" style="background-color: #DAE9F5;" data-expandtext="Expand for details" data-collapsetext="Collapse">&nbsp;
 
+
<div class="mw-collapsible-content">
This makes use of the fact that "random" numbers generated by a computer algorithm aren't really random: they are "pseudorandom", generated by a deterministic algorithm. Such an algorithm takes a number&mdash;a ''seed''&mdash; and mangles it until the result has no recognizable connection to the seed. The result is indistinguishable from a random number. However if we use the ''same seed'', we will always get the ''same result''. Such a random pick can be programmed with the following steps:
 
# Create a list
 
# Initialize a random number generator with a student ID as a seed
 
# pick a random integer "''i''" in the range from first to last element of the list
 
# return the ''i''-th list element.
 
</div>
 
</div>
 
 
 
Here is '''R''' code to accomplish this:
 
 
 
{{task|
 
 
 
* Read, try to understand and then execute the following R-code.
 
 
 
<source lang="rsplus">
 
pickSpecies <- function(ID) {
 
# this function randomly picks a fungal species
 
# from a list. It is seeded by a student ID. Therefore
 
# the pick is random, but reproducible.
 
 
# first, define a list of species:
 
Species <- c(
 
"Agaricus bisporus (AGABI)",
 
"Ajellomyces dermatitidis (AJEDE)",
 
"Arthroderma otae (ARTOT)",
 
"Ashbya gossypii (ASHGO)",
 
"Auricularia delicata (AURDE)",
 
"Baudoinia compniacensis (BAUCO)",
 
"Beauveria bassiana (BEABA)",
 
"Bipolaris oryzae (BIPOR)",
 
"Botrytis cinerea (BOTCI)",
 
"Capronia coronata (CAPCO)",
 
"Chaetomium globosum (CHAGL)",
 
"Cladophialophora psammophila (CLAPS)",
 
"Clavispora lusitaniae (CLALU)",
 
"Coccidioides immitis (COCIM)",
 
"Colletotrichum fioriniae (COLFI)",
 
"Coniophora puteana (CONPU)",
 
"Coniosporium apollinis (CONAP)",
 
"Coprinopsis cinerea (COPCI)",
 
"Cryptococcus neoformans (CRYNE)",
 
"Cyphellophora europaea (CYPEU)",
 
"Debaryomyces hansenii (DEBHA)",
 
"Dichomitus squalens (DICSQ)",
 
"Endocarpon pusillum (ENDPU)",
 
"Eutypa lata (EUTLA)",
 
"Exophiala dermatitidis (EXODE)",
 
"Fomitiporia mediterranea (FOMME)",
 
"Fusarium graminearum (FUSGR)",
 
"Glarea lozoyensis (GLALO)",
 
"Gloeophyllum trabeum (GLOTR)",
 
"Kazachstania africana (KAZAF)",
 
"Kluyveromyces lactis (KLULA)",
 
"Komagataella pastoris (KOMPA)",
 
"Laccaria bicolor (LACBI)",
 
"Lachancea thermotolerans (LACTH)",
 
"Leptosphaeria maculans (LEPMA)",
 
"Lodderomyces elongisporus (LODEL)",
 
"Magnaporthe oryzae (MAGOR)",
 
"Malassezia globosa (MALGL)",
 
"Marssonina brunnea (MARBR)",
 
"Melampsora larici-populina (MELLA)",
 
"Metarhizium acridum (METAC)",
 
"Meyerozyma guilliermondii (MEYGU)",
 
"Microsporum gypseum (MICGY)",
 
"Millerozyma farinosa (MILFA)",
 
"Moniliophthora roreri (MONRO)",
 
"Myceliophthora thermophila (MYCTH)",
 
"Naumovozyma castellii (NAUCA)",
 
"Nectria haematococca (NECHA)",
 
"Neofusicoccum parvum (NEOPA)",
 
"Neosartorya fischeri (NEOFI)",
 
"Paracoccidioides sp. (PARSP)",
 
"Pestalotiopsis fici (PESFI)",
 
"Phaeosphaeria nodorum (PHANO)",
 
"Phanerochaete carnosa (PHACA)",
 
"Pneumocystis murina (PNEMU)",
 
"Podospora anserina (PODAN)",
 
"Postia placenta (POSPL)",
 
"Pseudocercospora fijiensis (PSEFI)",
 
"Pseudozyma flocculosa (PSEFL)",
 
"Puccinia graminis (PUCGR)",
 
"Punctularia strigosozonata (PUNST)",
 
"Pyrenophora tritici-repentis (PYRTR)",
 
"Scheffersomyces stipitis (SCHST)",
 
"Schizophyllum commune (SCHCO)",
 
"Sclerotinia sclerotiorum (SCLSC)",
 
"Serpula lacrymans (SERLA)",
 
"Setosphaeria turcica (SETTU)",
 
"Sordaria macrospora (SORMA)",
 
"Spathaspora passalidarum (SPAPA)",
 
"Stereum hirsutum (STEHI)",
 
"Talaromyces marneffei (TALMA)",
 
"Tetrapisispora blattae (TETBL)",
 
"Thielavia terrestris (THITE)",
 
"Togninia minima (TOGMI)",
 
"Torulaspora delbrueckii (TORDE)",
 
"Trametes versicolor (TRAVE)",
 
"Tremella mesenterica (TREME)",
 
"Trichoderma reesei (TRIRE)",
 
"Trichophyton rubrum (TRIRU)",
 
"Tuber melanosporum (TUBME)",
 
"Uncinocarpus reesii (UNCRE)",
 
"Vanderwaltozyma polyspora (VANPO)",
 
"Verticillium alfalfae (VERAL)",
 
"Wallemia sebi (WALSE)",
 
"Yarrowia lipolytica (YARLI)",
 
"Zygosaccharomyces rouxii (ZYGRO)",
 
"Zymoseptoria tritici (ZYMTR)"
 
)
 
l <- length(Species)    # number of elements in the list
 
set.seed(ID)            # seed the random number generator
 
                        # with the student ID
 
i <- runif(1, 0, 1)    # pick one random number between 0 and 1
 
i <- l * i              # multiply with number of elements
 
i <- ceiling(i)        # round up to nearest integer
 
choice <- Species[i]    # pick the i'th element from list
 
return(choice)
 
}
 
</source>
 
 
 
* Execute the function <code>pickSpecies()</code> with your student ID as its parameter. Example:
 
 
 
<source lang="text">
 
> pickSpecies(991234567)
 
[1] "Coccidioides immitis (COCIM)"
 
</source>
 
* Note down the species name and its five letter label on your student Wiki page. '''Use this species whenever this or future assignments refer to YFO'''.
 
}}
 
 
 
 
 
 
 
{{task|
 
* While you already have '''R''' open, access the  [[R tutorial|'''R tutorial''']] and work through the section on [[R tutorial#Simple_commands|'''Simple commands''']]. It is short, and will help you understand the code above.
 
}}
 
  
 +
<div class="colmask doublepage">
 +
  <div class="colleft">
 +
    <div class="col1">
 +
      <!-- Column 1 start -->
 +
Remember you are writing a lab notebook&mdash;not a formal lab report: a point-form record of your actual activities. Write such documentation as notes to your (future) self.
  
  
&nbsp;
+
Create a lab-notes page as a subpage of your User space on [http://steipe.biochemistry.utoronto.ca/abc/students '''the Student Wiki'''].
  
===Keeping a notebook on your Wiki===
 
 
 
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand for details" data-collapsetext="Collapse">Consider it a part of your assignment to document your activities on your Wiki page.
 
<div class="mw-collapsible-content">
 
You should write your documentation like a lab notebook&mdash;not a formal lab report, but a point-form record of your actual activities. Write such documentation as notes to your (future) self. Obviously, since much of the work will be done on the Web, an electronic notebook makes more sense than a paper notebook.
 
  
 
For each task:
 
For each task:
 
*;Write a header and give it a unique number.
 
*;Write a header and give it a unique number.
:: This is useful so you can refer to the header number in later text. Obviously, you should "hard-code" the number and not use the Wiki's automatic section numbering scheme, since the numbers should be stable over time, not change when you add or delete a section.
+
:: This is useful so you can refer to the header number in later text. Obviously, you should "hard-code" the number and not use the Wiki's automatic section numbering scheme, since the numbers should be stable over time, not change when you add or delete a section. It may be useful to add any new contents at the top of the page. If the page is in ''reverse chronological order'', you don't have to scroll to the bottom of the page every time you add new material. The sections do not actually have to be in strict chronological order, like we would have them in a paper notebook. It may be advantageous to give different subprojects their own page, or at least their own section on one page. Just remember that things that are on the same page are easy to find. Incidentally: the material in such a notebook is "permanent", since earlier versions of pages are always available via the history function. The Wiki never forgets. And that's actually a step beyond paper labnotes.
  
 
*;State the objective.
 
*;State the objective.
Line 263: Line 107:
  
 
*;Document the procedure.
 
*;Document the procedure.
:: Note what you have done, as concisely as possible. Give enough information so that anyone could reproduce unambiguously what you have done&mdash; your future project student, or even your future self.
+
:: Note what you have done, as concisely as possible but with sufficient detail. I am often asked: "What is sufficient detail"? The answer is easy: detailed enough so that someone can reproduce what you have done. In practice that guy will often be you, yourself, in the future. I hope that you won't be constantly cursing your past-self because of omissions!
  
 
*;Document your results.
 
*;Document your results.
 
: You can distinguish different types of results -
 
: You can distinguish different types of results -
  
**'''Static data''' does not change over time and it may be sufficient to note a '''reference''' to the result. For example, there is no need to copy a genbank record into your documentation, it is sufficient to note the accession number or the GI number.
+
**'''Static data''' does not change over time and it may be sufficient to note a '''reference''' to the result. For example, there is no need to copy a GenBank record into your documentation, it is sufficient to note the accession number the refSeq or UniProt ID, or even better, to link to it's page on the database server.
 
**'''Variable data''' can change over time. For example the results of a BLAST search depend on the sequences in the database. A list of similar structures may change as new structures get solved. In principle you want to record such data, to be able to reproduce at a later time what your conclusions were based on. But be '''selective''' in what you record. For example you should not paste the entire set of results of a BLAST search into your document, but only those matches that were important for your conclusions. '''Indiscriminate pasting of irrelevant information will make your notes unusable.'''
 
**'''Variable data''' can change over time. For example the results of a BLAST search depend on the sequences in the database. A list of similar structures may change as new structures get solved. In principle you want to record such data, to be able to reproduce at a later time what your conclusions were based on. But be '''selective''' in what you record. For example you should not paste the entire set of results of a BLAST search into your document, but only those matches that were important for your conclusions. '''Indiscriminate pasting of irrelevant information will make your notes unusable.'''
 
**'''Analysis results'''
 
**'''Analysis results'''
Line 274: Line 118:
  
 
*;Note your conclusions.
 
*;Note your conclusions.
::An analysis is not complete unless you conclude something from the results. (Remember what we said about "Cargo Cult Science". If there is no conclusion possible, your activities are quite pointless.) Are two sequences likely homologues, or not? Does your protein contain a signal-sequence or does it not? Is a binding site conserved, or not? The analysis provides the data. In your '''conclusion''' you provide the interpretation of what the data means '''in the context of your objective'''. Sometimes your assignment task will ask you to elaborate on an analysis and conclusion. But this does not mean that when the assignment does not explicitly mention it, you don't need to interpret your data.
+
::'''An analysis is not complete unless you conclude something from the results.''' (Remember what we said about "Cargo Cult Science". If there is no conclusion, your activities are quite pointless.)  
 +
:::*Are two sequences likely homologues, or not? Just pasting the BLAST output is not enough.
 +
:::*Does your protein contain a signal-sequence or does it not? SignalP will give you a probability, but '''you''' must make the final call.
 +
:::*Is a binding site conserved, or not? The programs can only point out sections of similarity or dissimilarity. '''You''' are the one who interprets these numbers in their biological context.
 +
 
 +
::The analysis provides the data. In your '''conclusion''' you provide the interpretation of what the data means '''in the context of your objective'''. Were you expecting a signal-sequence but there isn't one? What could that mean? Sometimes your assignment task in this course will ask you to elaborate on an analysis and conclusion. But this does not mean that when I don't explicitly mention it, you can skip the interpretation.
 +
 
 +
      <!-- Column 1 end -->
 +
    </div>
 +
    <div class="col2">
 +
      <!-- Column 2 start -->
 +
*;Add cross-references.
 +
::Cross-references to other information are super valuable as your documentation grows. It's easy to see how to format a link to a section of your Wiki-page: just look at the link under the Table of Contents at the top. But you can also place "anchors" for linking anywhere on an HTML page: just use the following syntax. <code>&lt;span id="{some-label}"&gt;&lt;\span&gt;</code> for the anchor, and append <code>#{some-label}</code> to the page URL.
  
 
*;Use discretion when uploading images
 
*;Use discretion when uploading images
Line 284: Line 140:
  
 
*;Prepare your images well
 
*;Prepare your images well
::Don't upload uncompressed screendumps. Save images in a compressed file format on your own computer. Then use the '''Special:Upload''' link in the left-hand menu to upload images. The Wiki will only accept <code>.jpeg</code> or <code>.png</code> images.
+
::Don't upload uncompressed screen dumps. Save images in a compressed file format on your own computer. Then use the '''Special:Upload''' link in the left-hand menu to upload images. The Wiki will only accept <code>.jpeg</code> or <code>.png</code> images.
  
 
*;Use the correct image types.
 
*;Use the correct image types.
::In principle, images can be stored ''uncompressed'' as <code>.tiff</code> or <code>.bmp</code>, or ''compressed'' as <code>.gif</code> or <code>.jpg</code> or <code>.png</code>. {{WP|GIF|<code>.gif</code>}} is useful for images with large, monochrome areas and sharp, high-contrast edges because the LZW compression algorithm it uses works especially well on such data; {{WP|JPEG|'''<code>.jpg</code>'''}} (or <code>.jpeg</code>) is preferred for images with shades and halftones such as the structure views you should prepare for several assignments, '''JPEG''' has excellent application support and is the most versatile general purpose image file format currently in use; {{WP|Tagged_Image_File_Format|'''<code>.tiff</code>'''}} (or <code>.tif</code>) is preferred to archive master copies of images in a lossless fashion, use LZW compression for TIFF files if your system/application supports it; The {{WP|Portable_Network_Graphics|'''<code>.png</code>'''}} format is an {{WP|Open_source|open source}} alternative for lossless, compressed images. Application support is growing but still variable. {{WP|BMP_file_format|'''<code>.bmp</code>'''}} is not preferred for really anything, it is bloated in its (default) uncompressed form and primarily used only because it is simple to code and ubiquitous on Windows computers.
+
::In principle, images can be stored ''uncompressed'' as <code>.tiff</code> or <code>.bmp</code>, or ''compressed'' as <code>.gif</code> or <code>.jpg</code> or <code>.png</code>. {{WP|GIF|<code>.gif</code>}} is useful for images with large, monochrome areas and sharp, high-contrast edges because the LZW compression algorithm it uses works especially well on such data; {{WP|JPEG|'''<code>.jpg</code>'''}} (or <code>.jpeg</code>) is preferred for images with shades and halftones such as the structure views you should prepare for several assignments, '''JPEG''' has excellent application support and is the most versatile general purpose image file format currently in use; {{WP|Tagged_Image_File_Format|'''<code>.tiff</code>'''}} (or <code>.tif</code>) is preferred to archive master copies of images in a lossless fashion, use LZW compression for TIFF files if your system/application supports it; The {{WP|Portable_Network_Graphics|'''<code>.png</code>'''}} format is an {{WP|Open_source|open source}} alternative for lossless, compressed images.
 +
{{WP|BMP_file_format|'''<code>.bmp</code>'''}} is not preferred for really anything, it is bloated in its (default) uncompressed form and primarily used only because it is simple to code and ubiquitous on Windows computers.
  
 
;Image dimensions and resolution
 
;Image dimensions and resolution
Line 299: Line 156:
 
;Keep your images uncluttered and expressive
 
;Keep your images uncluttered and expressive
 
:Scale the molecular model to fill the available space of your image well. Orient views so they illustrate a point you are trying to make. Emphasize residues that you are writing about with a contrasting colouring scheme. Add labels, where residue identities are not otherwise obvious. Turn off side-chains for residues that are not important. The more you practice these small details, the more efficient you will become in the use of your tools.
 
:Scale the molecular model to fill the available space of your image well. Orient views so they illustrate a point you are trying to make. Emphasize residues that you are writing about with a contrasting colouring scheme. Add labels, where residue identities are not otherwise obvious. Turn off side-chains for residues that are not important. The more you practice these small details, the more efficient you will become in the use of your tools.
 +
 +
:If you have technical difficulties, post your questions to the list and/or contact me.      <!-- Column 2 end -->
 +
    </div>
 +
  </div>
 +
</div>
 +
  
  
:If you have technical difficulties, post your questions to the list and/or contact me.
 
 
</div>
 
</div>
 
</div>
 
</div>
  
Keeping such a journal will be helpful, because the assignment is more or less integrated over the entire term, and later assignments will make use of earlier results. But it is also excellent practice for "real" research.
 
  
 +
&nbsp;
 +
 +
==Data Sources==
 +
 +
 +
===SGD - a Yeast Model Organism Database===
 +
<div class="colmask doublepage">
 +
  <div class="colleft">
 +
    <div class="col1">
 +
      <!-- Column 1 start -->
 +
Yeast happens to have a very well maintained '''model organism database''' - a Web resource dedicated to ''Saccharomyces cerevisiae''. Where such resources are available, they are very useful for the community. For the general case however, we need to work with one of the large, general data providers - the NCBI and the EBI. But in order to get a sense of the type of data that is available, let's visit the SGD database first.
 +
 +
{{task|1=
 +
Access the [http://db.yeastgenome.org/cgi-bin/locus.pl?locus=mbp1 information page on Mbp1] at the ''Saccharomyces'' Genome Database.
 +
 +
<ol>
 +
<li>Browse through the '''Summary''' page and note the available information: you should see:
 +
  <ul>
 +
    <li>information about the gene and the protein;
 +
    <li>Information about it's roles in the cell curated at the Gene Ontology database;
 +
    <li>Information about knock-out phenotypes; <small>(Amazing. Would you have imagined that this is a non-essential gene?)</small>
 +
    <li>Information about protein-protein interactions;
 +
    <li>Regulation and expression;
 +
    <li>'''A curators' summary of our understanding of the protein.''' Mandatory reading.
 +
    <li>And key references.
 +
  </ul>
 +
<li>Access the [http://www.yeastgenome.org/locus/S000002214/protein '''Protein''' tab] and note the much more detailed information.
 +
  <ul>
 +
    <li>Domains and their classification;
 +
    <li>Sequence;
 +
    <li>Shared domains;
 +
    <li>and much more...
 +
  </ul>
 +
 +
</ol>
 +
 +
}}
 +
 +
      <!-- Column 1 end -->
 +
    </div>
 +
    <div class="col2">
 +
      <!-- Column 2 start -->
 +
You will notice that some of this information relates to the molecule itself, and some of it relates to its relationship with other molecules. Some of it is stored at SGD, and some of it is cross-referenced from other databases. And we have textual data, numeric data, and images.
 +
 +
How would you store such data to use it in your project? We will work on this question at the end of the assignment.
  
 
&nbsp;
 
&nbsp;
  
==NCBI databases==
+
<hr style="width:33%; text-align:right; margin-right:0; height:1px;border-width:0;background-color:#999999;">
  
Let us explore some of the offerings of the NCBI that can contribute to our objective of studying a particular gene in a newly sequenced organism.
+
&nbsp;
  
 +
If we would be working on yeast, most data we need is right here: curated, kept current and consistent, referenced to the literature and ready to use. But you'll be working on a different species as of the next assignment, and you will need to integrate data yourself, from data sources such as the NCBI, or UniProt. The upside is that most of the information like this '''is available''' for many, many species. The downside is that we'll have to integrate information from many different sources essentially "by hand".
 +
 +
      <!-- Column 2 end -->
 +
    </div>
 +
  </div>
 +
</div>
 +
 +
 +
 +
&nbsp;
 +
 +
===NCBI databases===
 +
 +
 +
The [http://www.ncbi.nlm.nih.gov/guide/sitemap/ '''NCBI''' (National Center for Biotechnology Information)] is the largest international provider of data for genomics and molecular biology. With its annual budget of several hundred million dollars, it organizes a challenging program of data management at the largest scale, it makes its data freely and openly available over the Internet, worldwide, and it runs significant in-house research projects.
 +
 +
Let us explore some of the offerings of the NCBI that can contribute to our objective of studying a particular gene in an organism of interest.
 +
 +
 +
&nbsp;
 +
 +
====Entrez====
  
===Entrez===
 
  
 
{{task|1=
 
{{task|1=
<small>Remember to document your activities.</small>
+
<small>Remember to '''document''' your activities as lab-notes on  your Wiki.</small>
  
# Access the '''NCBI''' website at http://www.ncbi.nlm.nih.gov/  
+
# Access the '''NCBI''' website at http://www.ncbi.nlm.nih.gov/ <ref>If you find this URL hard to remember, consider the acronyms:<br />
 +
:ncbi.nlm.nih.gov
 +
:NCBI: National Center for Biotechnology Information<br />
 +
:NLM: National Library of Medicine<br />
 +
:NIH: National Institutes of Health<br />
 +
:GOV: the US GOVernment top-level domain<br />
 +
</ref>
 
# In the search bar, enter <code>mbp1</code> and click '''Search'''.
 
# In the search bar, enter <code>mbp1</code> and click '''Search'''.
# On the resulting page, look for the '''Protein''' section and click on it. What do you find?
+
# On the resulting page, look for the '''Protein''' section and click on the link. What do you find?
 
}}
 
}}
  
  
The result page of your search in "All Databases" is the "Global Query Result Page" of the Entrez system. If you follow the "Protein" link, you get taken to the 200 or so sequences in the NCBI Protein database. But looking at that page, you see that the result is quite non-specific: searching only by gene name retrieves an ''Arabidopsis'' protein, a ''Saccharomyces'' protein (presumably one that we might be interested in), Maltose Binding Proteins - and much more. There must be a more specific way to search, and indeed there is. Time to read up on the Entrez system.
+
The result page of your search in "All Databases" is the "Global Query Result Page" of the Entrez system. If you follow the "Protein" link, you get taken to the more than 530 sequences in the NCBI Protein database that contain the keyword "mbp1". But when you look more closely at the results, you see that the result is quite non-specific: searching only by keyword retrieves a multiubiquitin chain binding protein in ''Arabidopsis'', bacterial mannose binding proteins, a ''Saccharomyces'' protein (perhaps one that we are actually interested in), maltose binding proteins, myelin basic proteins - and much more. There must be a more specific way to search, and indeed there is. Time to read up on the Entrez system.
  
  
Line 341: Line 274:
 
## How to restrict a search to a particular organism.
 
## How to restrict a search to a particular organism.
  
Don't skip this part, you don't need to know the options by heart, but you should know they exist and how to find them.
+
Don't skip this part, you should know the more common options and how to find the others. It would be great to have a synopsis of the important fields for reference, wouldn't it? Why don't you go and make one: I have put a template page on the Student Wiki ([http://steipe.biochemistry.utoronto.ca/abc/students/index.php/Entrez '''A synopsis of Entrez codes''']). Contributors and editors welcome!
 
}}
 
}}
  
  
Keyword and organism searches are pretty universal, but apart from that, each NCBI database has its own set of specific fields. You can access them via the Advanced Search interface of any of the database pages.
+
Keyword and organism searches are pretty universal, but apart from that, each NCBI database has its own set of specific fields. You can access the keywords via the '''Advanced Search''' interface of any of the database pages.
  
  
===Protein===
+
&nbsp;
  
 +
====Protein Sequence====
  
 +
<div class="colmask doublepage">
 +
  <div class="colleft">
 +
    <div class="col1">
 +
      <!-- Column 1 start -->
 
&nbsp;
 
&nbsp;
 
{{task|1=
 
{{task|1=
  
Now try the search for Mbp1 in Baker's Yeast alone. Return to the Global Search page and enter:
+
With this knowledge we can restrict the search to proteins called "Mbp1" that occur in Baker's Yeast. Return to the [http://www.ncbi.nlm.nih.gov Global Search page] and in the search field, type:
  
  Mbp1[protein name] AND "Saccharomyces cerevisiae"[organism]
+
  [http://www.ncbi.nlm.nih.gov/gquery/?term=Mbp1%5Bprotein+name%5D+AND+%22Saccharomyces+cerevisiae%22%5Borganism%5D Mbp1&#91;protein name&#93; AND<br /> "Saccharomyces cerevisiae"&#91;organism&#93;]
  
 
}}
 
}}
  
  
This should find one and only one protein. Follow the link into the protein database: since this is only one record, the link takes you directly to the result&mdash;a data record in Genbank Flat File (GFF) format, not to a list of hits, as before. Explore the record and familiarize yourself with the information that is there.
+
This finds two proteins. Follow the link to the result <code>CAA98618.1</code>&mdash;a data record in Genbank Flat File (GFF) format<ref>If there is only a single match, you will be been taken directly to the page.</ref>. The database identifier <code>CAA98618.1</code> tells you that this is a record in the GenPept database. There are actually several, identical versions of this sequence in the NCBI's holdings. A link to [http://www.ncbi.nlm.nih.gov/protein/1431055?report=ipg "Identical Proteins"] near the top of the record shows you what these are:
  
All well and good - but didn't we want to find '''RefSeq''' entries, since that is expected to be the database of unique, curated sequence records? I can't tell you why the RefSeq result was not listed among the search results. But I can at least tell you how to find it:
 
  
 +
Some of the sequences represent duplicate entries of the same gene (Mbp1) in the same strain (S288c) of the same species (''S. cerevisiae''). In particular:
  
{{task|1=
 
  
# In the right-hand margin of the record, you will find a section of '''Identical proteins ...''': click on '''See all..."" to list them all. Among these, find the entry with an accession number like <code>NP_123456</code>. This is a RefSeq ID. Follow the link.
+
* there are seven records for which the source is [http://www.insdc.org/ the INSDC], these are archival entries, submitted by independent yeast genome research projects;
# Explore the resulting page. You will notice that the information elements are not identical, even though these are sequence records for one and the same yeast gene product, in two similar databases, at the same data provider!
 
# Note down the RefSeq ID, you will probably need it later on.
 
}}
 
  
 +
* there two entries in the '''RefSeq''' database linking to the same protein: [http://www.ncbi.nlm.nih.gov/protein/NP_010227.1 <code>NP_010227.1</code>]. One is derived from genome sequence, the other from mRNA. This RefSeq entry is the preferred version of the sequence for us to work with. RefSeq is a curated, non-redundant database which solves a number of problems of archival databases. You can recognize RefSeq identifiers &ndash; they always look like NP_12345.1, NM_12345.1, XP_12345.1, NC_12345.1 etc. This reflects whether the sequence is protein, mRNA or genomic, and inferred or obtained through experimental evidence. The RefSeq ID <code>NP_010227.1</code> actually appears twice, once linked to its genomic sequence, and once to its mRNA.
  
All well and good, and the Mbp1 protein is going to accompany us throughout the term&mdash;but we were actually trying to find related proteins in YFO. Let's give that a try.
+
* there is a '''SwissProt''' sequence [http://www.ncbi.nlm.nih.gov/protein/P39678.1 <code>P39678.1</code>]<ref>Actually the "real" SwissProt identifier would be patterned like <code>MBP1_YEAST</code>. <code>P39678</code> is the corresponding UniProt identifier.</ref>. This link is kind of a big deal. It's a cross-reference into [http://www.uniprot.org/uniprot/P39678 '''UniProt'''], the huge protein sequence database maintained by the [http://www.ebi.ac.uk/ '''EBI''' (European Bioinformatics Institute)], which is the NCBI's counterpart in Europe. SwissProt entries have the highest annotation standard overall and are expertly curated. Many Webservices that we will encounter, work with UniProt ID's (e.g. <code>P39678.1</code>), rather than RefSeq. But it used to be until recently that the two databases did not link to each other, mostly for reasons of funding politics. It's great to see that this divide has now been overcome.
  
 +
      <!-- Column 1 end -->
 +
    </div>
 +
    <div class="col2">
 +
      <!-- Column 2 start -->
  
{{task|1=
 
  
# Again in the right hand margin, find the section on '''Related Information''' and follow the link to '''Related Sequences'''. There are many. More than 21,000 actually<ref>21,000 related, non-identical sequences! What a treasure trove of information, the successful results of millennia of experimentation by nature. Now, if we could only read and understand this information ...</ref>. Definitely more than you would like to browse through to find the sequences in YFO. Let's use a filter on these results.
+
*Note that the entries of the same sequence in different yeast strains. These don't '''have''' to be identical, they just happen to be. Sometimes we find identical sequences in quite divergent species. Therefore I would not actually consider [http://www.ncbi.nlm.nih.gov/protein/EIW11153.1 <code>EIW11153.1</code>], [http://www.ncbi.nlm.nih.gov/protein/AJU86440.1 <code>AJU86440.1</code>], [http://www.ncbi.nlm.nih.gov/protein/AJU58508.1 <code>AJU58508.1</code>], and [http://www.ncbi.nlm.nih.gov/protein/AJU61971.1 <code>AJU61971.1</code>] to be identical proteins, although they have the same sequence.  
# Click on the '''Advanced''' link to access the search history that brought you here. Since you have read the Entrez page, you should be able to understand clearly that you can type something like
 
#4 AND "Schizosaccharomyces pombe"[organism]
 
... or whatever your command-history number resp. YFO name suggests.
 
  
You should find a handful of genes, all of them in YFO. If you find none, or hundreds, you did something wrong. Ask on the mailing list and make sure to fix the problem.
 
}}
 
  
 +
Note all the <code>.1</code> suffixes of the sequence identifiers. These are version numbers. Two observations:
 +
# It's great that version numbers are now used throughout the NCBI database. This is good database engineering practice because it's really important for reproducible research that updates to database records are possible, but recognizable. When working with data you always '''must''' provide for the possibility of updates, and manage the changes transparently and explicitly. Proper versioning should be a part of '''all''' datamodels. In fact, the NCBI is currently phasing out its internal unique identifiers – the GI number – in favour of [https://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/ accession-number.version IDs]
 +
# When searching, or for general use, you can (and should) '''omit the version number''', i.e. use <code>NP_010227</code> or <code>P39678</code> not <code>NP_010227.1</code> resp. <code>P39678.1</code>. This way the database system will resolve the identifier to the most current, highest version number (unless you '''want''' the older one, of course).
  
This is '''one''' way to find related sequences: by accessing precomputed results at the NCBI. We will however explore much more principled approaches in a later assignment. Let's leave the sequence searches for the moment, and explore other information on Yeast Mbp1 that may be useful for annotating the related sequences in YFO.
 
  
===PubMed===
+
{{task|1=
  
 +
# Note down the RefSeq ID and the UniProt (SwissProt) ID of Mbp1 in your journal.
 +
# Follow the link to the RefSeq entry [http://www.ncbi.nlm.nih.gov/protein/NP_010227.1 <code>NP_010227.1</code>].
 +
# Explore the page and follow these links (note the contents in your journal):
 +
## Under "Analyze this Sequence": [http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi?INPUT_TYPE=live&SEQUENCE=NP_010227.1 Identify Conserved Domains]
 +
## Under "Protein 3D Structure": [http://www.ncbi.nlm.nih.gov/protein?Db=structure&DbFrom=protein&Cmd=Link&LinkName=protein_structure&LinkReadableName=Structure&IdsFromResult=6320147 See all 3 structures...]
 +
## Under "Pathways for the MBP1 gene": [http://www.ncbi.nlm.nih.gov/biosystems/958?Sel=geneid:851503#show=genes Cell cycle - yeast]
 +
## Under "Related information" [http://www.ncbi.nlm.nih.gov/Structure/seqr/link.cgi?gi=6320147 Proteins with Similar Sequence]
 +
}}
  
Arguably one of the most important databases in the life sciences is PubMed and this is a good time to look at PubMed in a bit more detail than you might have done previously.
+
As we see, this is a good start page to explore all kinds of databases at the NCBI via cross-references.  
  
 +
      <!-- Column 2 end -->
 +
    </div>
 +
  </div>
 +
</div>
  
{{task|1=
 
  
# Return back to the '''MBP1''' RefSeq record. If you have already closed it, simply enter the RefSeq ID into the search field for a Protein database search and find it again.
 
#  Find the '''PubMed''' links under '''Related information''' in the right-hand margin and explore them. One will take you only to information related to the actual RefSeq record, the others find more broadly relatd information. PubMed(weighted) appears to give a pretty good overview of systems-biology type, cross-sectional and functional information. But neither of the searches finds '''all''' Mbp1 related literature.
 
# Again, enter the '''Advanced''' query page and study the keywords that apply to PubMed searches. These are actually quite important and useful to remember.  Make yourself familiar with the section on [http://www.ncbi.nlm.nih.gov/books/NBK3827/ '''Search field descriptions and tags'''] in the PubMed help document, (in particular <tt>[DP]</tt>, <tt>[AU]</tt>, <tt>[TI]</tt>, and <tt>[TA]</tt>), how you use the ''History'' to combine searches, and the use of <tt>AND</tt>, <tt>OR</tt>, <tt>NOT</tt> and brackets. Understand how you can restrict a search to ''reviews'' only, and what the link to '''Related citations...''' is useful for.
 
# Now find publications with Mbp1 '''in the title'''. In the result list, follow the links for the two Biochemistry papers by Taylor ''et al.'' (2000) and by Deleeuw ''et al.'' (2008). Download the PDFs, we will need them later.
 
  
}}
+
&nbsp;
  
 +
====PubMed====
  
==Structure search==
 
  
 +
Arguably one of the most important databases in the life sciences is [http://www.ncbi.nlm.nih.gov/pubmed/ '''PubMed'''] and this is a good time to look at PubMed in a bit more detail.
  
The search options in the PDB structure database are as sophisticated as those at the NCBI. For now, we will try a simple keyword search to get us started.
 
  
 +
{{task|1=
  
{{task|
+
# Return back to the [http://www.ncbi.nlm.nih.gov/protein/NP_010227.1 '''MBP1''' RefSeq record].
# Visit the RCSB PDB website at http://www.pdb.org/
+
# Find the [http://www.ncbi.nlm.nih.gov/pubmed?LinkName=protein_pubmed_weighted&from_uid=1431055 '''PubMed'''] link under '''Related information''' in the right-hand margin and explore it. "PubMed (Weighted)" applies a weighting algorithm to find broadly relevant information - an example of literature data mining. PubMed(weighted) appears to give a pretty good overview of systems-biology type, cross-sectional and functional information.
# Briefly orient yourself regarding the database contents and its information offerings and services.
 
# Enter <code>Mbp1</code> into the search field.
 
# In your journal, note down the PDB IDs for the three ''Saccharomyces cerevisiae'' Mbp1 transcription factor structures your search has retrieved.
 
# Click on one of the entries and explore the information and services linked from that page.
 
}}
 
  
&nbsp;
+
But it does not find '''all''' Mbp1 related literature.
  
==Chimera==
+
# On any of the PubMed pages open the '''Advanced''' query page and study the keywords that apply to PubMed searches. These are actually quite important and useful to remember.  Make yourself familiar with the section on [http://www.ncbi.nlm.nih.gov/books/NBK3827/ '''Search field descriptions and tags'''] in the PubMed help document, (in particular <tt>[DP]</tt>, <tt>[AU]</tt>, <tt>[TI]</tt>, and <tt>[TA]</tt>), how you use the ''History'' to combine searches, and the use of <tt>AND</tt>, <tt>OR</tt>, <tt>NOT</tt> and brackets. Understand how you can restrict a search to ''reviews'' only, and what the link to '''Related citations...''' is useful for<ref>A good way to consolidate your knowledge is to summarize it for everyone on the Entrez page of the Student Wiki, or enhance the information you find there.</ref>.
 +
# Now find publications from anywhere in PubMed with Mbp1 '''in the title'''. In the result list, follow the links for the two ''Biochemistry'' papers, by Taylor ''et al.'' (2000) and by Deleeuw ''et al.'' (2008). Download the PDFs, we will need them later.
  
In this task we will explore the sequence interface of Chimera, use it to select specific parts of a molecule, and colour specific regions (or residues) of a molecule separately.
 
 
&nbsp;
 
{{task|
 
# Open Chimera.
 
# One of the three yeast Mbp1 fragment structures has the PDB ID <code>1BM8</code>. Load it in Chimera (simply enter the ID into the appropriate field of the '''File''' &rarr; '''Fetch by ID...''' window).
 
# Display the protein in '''Presets''' &rarr; '''Interactive&nbsp;1''' mode and familiarize yourself with its topology of helices and strands.
 
# Open the sequence tool: '''Tools''' &rarr; '''Sequence''' &rarr; '''Sequence'''. You will see the sequence for each chain - here there is only one chain. By default, coloured rectangles overlay the secondary structure elements of the sequence.
 
# Hover the mouse over some residues and note that the sequence number and chain is shown at the bottom of the window.
 
# Click/drag one residue to select it. <small>(Simply a click wont work, you need to drag a little bit for the selection to catch on.)</small> Note that the residue gets a green overlay in the sequence window, as it also gets selected with a green border in the graphics window.
 
# In the bottom of the sequence window, there are instructions how to select (multiple) regions. Try this: colour the protein white ('''Select''' &rarr; '''Select&nbsp;All'''; '''Actions''' &rarr; '''Color''' &rarr; '''light&nbsp;gray'''). Clear the selection. Now select all the helical regions (pale yellow boxes) by click/dragging and using the shift key. Color them red. Then select all the strands by clicking into any of the pale green boxes and color them green.
 
# Finally, generate a stereo-view that shows the molecule well, in which the domain is coloured dark grey, and the APSES domain residues (as defined in the FASTA listing above, from I19 to Y93) are coloured with a colour ramp ('''Tools''' &rarr; '''Depiction''' &rarr; '''Rainbow''')<ref>The [https://www.cgl.ucsf.edu/chimera/1.2065/docs/ContributedSoftware/rainbow/rainbow.html Rainbow tool] can only create color ramps for an entire molecule. In order to achieve this effect: color the molecule with a color ramp, then select the APSES domain, then '''invert the selection''' and color the new selection dark grey.</ref>
 
# Show the first and last residue's CA atom<ref>See [https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/midas/frameatom_spec.html '''here'''] for details of the specification syntax.</ref> as a sphere and colour the first one blue (to mark the N-terminus) and the last one red. E.g.:
 
##'''Select''' &rarr; '''Atom&nbsp;specifier''' &rarr; <code>:4@CA</code>
 
##'''Actions''' &rarr; '''Ribbon''' &rarr; '''hide'''
 
##'''Actions''' &rarr; '''Atoms/bonds''' &rarr; '''show'''
 
##'''Actions''' &rarr; '''Atoms/bonds''' &rarr; '''sphere'''
 
##'''Actions''' &rarr; '''Color''' &rarr; '''cornflower&nbsp;blue'''
 
##Then click on the selection inspector (the green button with the magnifying glass at the lower right of the graphics window) and set the sphere radius to 1.0Å.
 
# Save the image in your Wiki journal in JPEG format ('''File''' &rarr; '''Save&nbsp;Image''' and upload it to the Student Wiki).
 
 
}}
 
}}
  
Line 449: Line 368:
 
&nbsp;
 
&nbsp;
  
== Stereo vision ==
+
==Data Storage==
  
{{task|
+
Now that we have a better sense of what our data is, we need to consider ways to model and store it. Let's talk about storage first.
Continue with your stereo practice.
+
 
 +
{{#lst:Data modelling|data_storage}}
  
Practice at least ...
 
* two times daily,
 
* for 3-5 minutes each session.
 
  
* Measure your interocular distance and your fusion distance as explained '''[http://biochemistry.utoronto.ca/steipe/abc/students/index.php/Stereo_vision_data here on the Student Wiki]''' and add it to the table.
+
&nbsp;
}}
 
  
Keep up your practice throughout the course. '''Once again: do not go through your practice sessions mechanically. If you are not making constant progress in your practice sessions, contact me so we can help you on the right track.'''
+
==Data modelling==
  
== Modeling small molecules (optional) ==
+
{{#lst:Data modelling|data_modelling}}
  
  
As an optional part of the assignment, here is a small tutorial for modeling and visualizing "small-molecule" structures.
+
Time to put this into practice: design your own data model.
  
 +
{{task|1=
  
 +
*Use your imagination about what kind of data you think should be stored to study a system, such as the collaborating proteins that define the G1/S transition in the cell cycle.
  
=== Defining a molecule ===
+
*Write down what you would like to store.
  
 +
*Sketch a relational data model for that data. Put it on paper, or print it out. '''Bring it to class for Tuesday's quiz.''' Your sketch will be handed in and graded by me. <small>(Probably worth 2 marks.)</small>
  
A number of public repositories make small molecule information available, such as [http://pubchem.ncbi.nlm.nih.gov/ PubChem] at the NCBI, the ligand collection at the [http://pdb.org '''PDB'''], the [http://www.ebi.ac.uk/chebi/ ChEBI] database at the European Bioinformatics Institute, or the [http://cactus.nci.nih.gov/ncidb2.2/ NCI database browser] at the US National Cancer Institute. One general way to export topology information from these services is to use {{WP|SMILES|SMILES strings}}&mdash;a shorthand notation for the composition and topology of chemical compounds.
+
}}
  
  
{{task|
+
{{Vspace}}
# Access each of the databases mentioned above.
 
# Enter "caffeine" as a search term.
 
# Explore the contents of the result, in particular note and copy the SMILES string for the compound.
 
}}
 
  
 +
=='''R'''==
  
Alternatively, you can sketch your own compound. Versions of Peter Ertl's {{WP|JME_editor|Java Molecular Editor (JME)}} are offered on several websites (e.g. click on '''Transfer to Java Editor''' on a NCI results page), and PubChem offers this functionality via its '''Sketcher''' tool.
+
There is still some material left from our introduction to '''R''':
  
 
{{task|
 
{{task|
# Navigate to [http://pubchem.ncbi.nlm.nih.gov/ PubChem].
+
* Access the [[R tutorial|'''R tutorial''']] on this site.
# Follow the link to '''Chemical structure search''' (in the right hand menu).
+
* Work carefully through the following sections:
# Click on the '''3D conformer''' tab and on the '''Launch''' button to launch the molecular editor in its own window.
+
**[[R tutorial#Control_structures|'''Control structures''']];
# Sketch the structure of caffeine. I find the editor quite intuitive but if you need help, just use the '''Help''' button in the editor.
+
**[[R tutorial#Writing your own functions|'''Writing your own functions''']].
# Save the SMILES string of your compound.
 
# Also '''Export''' your result in SMILES format as a file.
 
 
}}
 
}}
  
=== Translating SMILES to structure ===
+
{{Vspace}}
  
 +
;That is all.
  
Online services exist to translate SMILES to (idealized) coordinates.
+
{{Vspace}}
  
{{task|
+
== Links and resources ==
# Access the [http://cactus.nci.nih.gov/translate/ online SMILES translation service] at the NCI.
 
# Paste a caffeine SMILES string into the form, choose the '''PDB''' radio button, click on '''Translate''' and download your file.
 
# Load the molecule in Chimera.
 
}}
 
  
Chimera also has a function to translate SMILES to coordinates.
 
  
{{task|
+
{{#pmid: 27239285}}
# In Chimera:
+
{{#pmid: 21310294}}
##'''File''' &rarr; '''Close&nbsp;Session'''.
+
{{#pmid: 10747782}}
##'''Tools''' &rarr; '''Structure&nbsp;Editing''' &rarr; '''Build&nbsp;Structure'''.
+
{{#pmid: 18491920}}
##Select '''SMILES string''', paste the string and click '''Apply'''.
 
# The caffeine molecule will be generated and visualized in the graphics window.
 
}}
 
  
;That is all.
+
<!-- {{WWW|WWW_GMOD}} -->
 +
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 +
* [[Media:02-Data_LectureNotes.pdf|Lecture 02: Annotated Notes]]
  
  
&nbsp;
+
{{Vspace}}
  
== Links and resources ==
+
;Further reading
 +
{{#pmid: 19907790}}
 +
:{{WP|Database normalization}}
  
<!-- {{#pmid: 19957275}} -->
+
{{Vspace}}
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 
  
  
&nbsp;
 
 
{{#lst:BIO_Assignment_Week_1|assignment_footer}}
 
{{#lst:BIO_Assignment_Week_1|assignment_footer}}
  
  
&nbsp;
+
<table style="width:100%;"><tr>
 +
<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_1|&lt;&nbsp;Assignment&nbsp;1]]</td>
 +
<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[BIO_Assignment_Week_3|Assignment&nbsp;3&nbsp;&gt;]]</td>
 +
</tr></table>
 +
 
 +
{{Vspace}}
 +
 
 
[[Category:Bioinformatics]]
 
[[Category:Bioinformatics]]
 
</div>
 
</div>

Latest revision as of 11:26, 3 October 2016

Assignment for Week 2
Scenario, Labnotes, R-functions,
Databases, Data Modelling

< Assignment 1 Assignment 3 >


Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 


Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.



 

Introduction: Scenario

I have introduced the concept of "cargo cult science" in class. The "cargo" in Bioinformatics is to understand biology. This includes understanding how things came to be the way they are, and how they work. Both relate to the concept of function of biomolecules, and the systems[1] they contribute to. But "function" is a rather poorly defined concept and exploring ways to make it rigorous and computable and explore it from the perspective of "collaborating" components, that will be a major objective of this course. The realm of bioinformatics contains many kingdoms and duchies and shires and hidden glades. To find out how they contribute to the whole, we will proceed on a quest. We will take a relatively well-characterized protein that is part of a relatively well-characterized process, and ask what its function is. We will examine the protein's sequence, its structure, its domain composition, its relationship to and interactions with other proteins. Through that we will paint a picture of the "system" that it contributes to.

Our quest will revolve around a transcription factor that plays an important role in the regulation of the cell cycle. The genetic regulation of budding- and fission yeast cell-cycles has been lucidly described in a highly recommended review by McInerny (2011)[2] (see also the short, recent introduction to cell-cycle regulated tranxcription by McInerny (2016)[3]). One transcription factor, Mbp1 is a key component of the MBF complex (Mbp1/Swi6) in yeast. This complex regulates gene expression at the crucial G1/S-phase transition of the mitotic cell cycle and has been shown to bind to the regulatory regions of more than a hundred target genes. It is therefore a DNA binding protein that acts as a control switch for a key cellular process, it is highly conserved across species, and it plays a role in human disease. Surely, understandnig the mechanisms of this system would be "cargo".

We will start our quest by exploring the Mbp1 protein of Baker's yeast, Saccharomyces cerevisiae, one of the most important model organisms. Baker's yeast is a eukaryote that has been studied genetically and biochemically in great detail for many decades, and it is easily manipulated with high-throughput experimental methods. Then, next week, each of you will use this information to study not Baker's yeast, but a related organism about which we know comparatively little from experiments done in the lab. Our reasoning will rely on computational inference.


Here are the steps of the assignment for this week:

  1. Start a "lab journal".
  2. Explore what kind of data is available for the Mbp1 protein.
  3. Define what data you want to store and work with, and design a data model.


 

Keeping Labnotes

Before we all head off into the Internet: have you thought about how to document your "quest"? How will you keep notes? Obviously, computational research embraces the same best-practice principles as any wet-lab experiment. We keep notes to document our objectives and activities, we ensure our work is reproducible, and we take great care that our conclusions are supported by data. I think it's pretty obvious that paper notes are not very useful for bioinformatics work. Ideally, you should be able to save results, and link to files, Webpages and other resources.

Consider it a part of your assignment to document your activities in electronic form. Here are some applications you might think of - but (!) disclaimer, I myself don't use any of these (yet) (except the Wiki of course).

  • Evernote - a web hosted, automatically syncing e-notebook.
  • Nevernote - the Open Source alternative to Evernote.
  • Google Keep - if you have a Gmail account, you can simply log in here. Grid-based. Seems a bit awkward for longer notes. But of course you can also use Google Docs.
  • Microsoft OneNote - this sounds interesting and if any one is using this, I'd like to hear from you. Syncing across platforms, being able to format contents and organize it sounds great.
  • The Student Wiki - of course. Beginning a project notes page is part of this assignment.
  • RStudio projects - for development-focussed work – especially (but not exclusively) – in R, an RStudio project may be the right solution to keep your code, results, notes, manuscript drafts, literature and other assets all in one place. The great benefit is that it can all be under version control and it's super easy to share everything with colleagues on a team through GitHub[4]. The only downside that I can think of is that it's not possible to cross-reference and link to material.[5].

Are you aware of any other solutions? Let us know!

Keeping such a journal will be helpful, because the assignments are integrated over the entire term, and later assignments will make use of earlier results. But it is also excellent practice for "real" research. Expand the section below for details - written from a Wiki perspective but generally applicable.


 

Remember you are writing a lab notebook—not a formal lab report: a point-form record of your actual activities. Write such documentation as notes to your (future) self.


Create a lab-notes page as a subpage of your User space on the Student Wiki.


For each task:

  • Write a header and give it a unique number.
This is useful so you can refer to the header number in later text. Obviously, you should "hard-code" the number and not use the Wiki's automatic section numbering scheme, since the numbers should be stable over time, not change when you add or delete a section. It may be useful to add any new contents at the top of the page. If the page is in reverse chronological order, you don't have to scroll to the bottom of the page every time you add new material. The sections do not actually have to be in strict chronological order, like we would have them in a paper notebook. It may be advantageous to give different subprojects their own page, or at least their own section on one page. Just remember that things that are on the same page are easy to find. Incidentally: the material in such a notebook is "permanent", since earlier versions of pages are always available via the history function. The Wiki never forgets. And that's actually a step beyond paper labnotes.
  • State the objective.
In one brief sentence, restate what your task is supposed to achieve.
  • Document the procedure.
Note what you have done, as concisely as possible but with sufficient detail. I am often asked: "What is sufficient detail"? The answer is easy: detailed enough so that someone can reproduce what you have done. In practice that guy will often be you, yourself, in the future. I hope that you won't be constantly cursing your past-self because of omissions!
  • Document your results.
You can distinguish different types of results -
    • Static data does not change over time and it may be sufficient to note a reference to the result. For example, there is no need to copy a GenBank record into your documentation, it is sufficient to note the accession number the refSeq or UniProt ID, or even better, to link to it's page on the database server.
    • Variable data can change over time. For example the results of a BLAST search depend on the sequences in the database. A list of similar structures may change as new structures get solved. In principle you want to record such data, to be able to reproduce at a later time what your conclusions were based on. But be selective in what you record. For example you should not paste the entire set of results of a BLAST search into your document, but only those matches that were important for your conclusions. Indiscriminate pasting of irrelevant information will make your notes unusable.
    • Analysis results
The results of sequence analyses, alignments etc. in general get recorded in your documentation. Again: be selective. Record what is important.
  • Note your conclusions.
An analysis is not complete unless you conclude something from the results. (Remember what we said about "Cargo Cult Science". If there is no conclusion, your activities are quite pointless.)
  • Are two sequences likely homologues, or not? Just pasting the BLAST output is not enough.
  • Does your protein contain a signal-sequence or does it not? SignalP will give you a probability, but you must make the final call.
  • Is a binding site conserved, or not? The programs can only point out sections of similarity or dissimilarity. You are the one who interprets these numbers in their biological context.
The analysis provides the data. In your conclusion you provide the interpretation of what the data means in the context of your objective. Were you expecting a signal-sequence but there isn't one? What could that mean? Sometimes your assignment task in this course will ask you to elaborate on an analysis and conclusion. But this does not mean that when I don't explicitly mention it, you can skip the interpretation.
  • Add cross-references.
Cross-references to other information are super valuable as your documentation grows. It's easy to see how to format a link to a section of your Wiki-page: just look at the link under the Table of Contents at the top. But you can also place "anchors" for linking anywhere on an HTML page: just use the following syntax. <span id="{some-label}"><\span> for the anchor, and append #{some-label} to the page URL.
  • Use discretion when uploading images
I have enabled image uploading with some reservations, we'll see how it goes. You must not:
  • upload images that are irrelevant for this course;
  • upload copyrighted images;
  • upload any images that are larger than 500 kb. I may silently remove large images when I encounter them.
Moreover, understand that any of your uploaded images may be deleted at any time. If they are valuable to you, keep backups on your own machine.
  • Prepare your images well
Don't upload uncompressed screen dumps. Save images in a compressed file format on your own computer. Then use the Special:Upload link in the left-hand menu to upload images. The Wiki will only accept .jpeg or .png images.
  • Use the correct image types.
In principle, images can be stored uncompressed as .tiff or .bmp, or compressed as .gif or .jpg or .png. .gif is useful for images with large, monochrome areas and sharp, high-contrast edges because the LZW compression algorithm it uses works especially well on such data; .jpg (or .jpeg) is preferred for images with shades and halftones such as the structure views you should prepare for several assignments, JPEG has excellent application support and is the most versatile general purpose image file format currently in use; .tiff (or .tif) is preferred to archive master copies of images in a lossless fashion, use LZW compression for TIFF files if your system/application supports it; The .png format is an open source alternative for lossless, compressed images.

.bmp is not preferred for really anything, it is bloated in its (default) uncompressed form and primarily used only because it is simple to code and ubiquitous on Windows computers.

Image dimensions and resolution
Stereo images should have equivalent points approximately 6cm apart. It depends on your monitor how many pixels this corresponds to. The dimensions of an image are stated in pixels (width x height). My notebook screen has a native display resolution of 1440 x 900 pixels/23.5 x 21 cm. Therefore a 6cm separation on my notebook corresponds to approximately 260 pixels. However on my desktop monitor, 260 pixels is 6.7 cm across. And on a high-resolution iPad display, at 227 ppi (pixels per inch), 260 pixels are just 2.9 cm across. For the assignments: adjust your stereo images so they are approximately at the right separation and are approximately 500 to 600 pixels wide. Also, scale your molecules so they fill the available window and - if you have depth cueing enabled - move them close to the front clipping plane so the molecule is are not just a dim blob, lost in murky shadows.
Considerations for print (manuscripts etc.) are slightly different: for print output you can specify the output resolution in dpi (dots per inch). A typical print resolution is about 300 dpi: 6 cm separation at 300dpi is about 700 pixels. Print images should therefore be about three times as large in width and height as screen images.
Preparation of stereo views
When assignments ask you to create molecular images, always create stereo views.
Keep your images uncluttered and expressive
Scale the molecular model to fill the available space of your image well. Orient views so they illustrate a point you are trying to make. Emphasize residues that you are writing about with a contrasting colouring scheme. Add labels, where residue identities are not otherwise obvious. Turn off side-chains for residues that are not important. The more you practice these small details, the more efficient you will become in the use of your tools.
If you have technical difficulties, post your questions to the list and/or contact me.



 

Data Sources

SGD - a Yeast Model Organism Database

Yeast happens to have a very well maintained model organism database - a Web resource dedicated to Saccharomyces cerevisiae. Where such resources are available, they are very useful for the community. For the general case however, we need to work with one of the large, general data providers - the NCBI and the EBI. But in order to get a sense of the type of data that is available, let's visit the SGD database first.

Task:
Access the information page on Mbp1 at the Saccharomyces Genome Database.

  1. Browse through the Summary page and note the available information: you should see:
    • information about the gene and the protein;
    • Information about it's roles in the cell curated at the Gene Ontology database;
    • Information about knock-out phenotypes; (Amazing. Would you have imagined that this is a non-essential gene?)
    • Information about protein-protein interactions;
    • Regulation and expression;
    • A curators' summary of our understanding of the protein. Mandatory reading.
    • And key references.
  2. Access the Protein tab and note the much more detailed information.
    • Domains and their classification;
    • Sequence;
    • Shared domains;
    • and much more...

You will notice that some of this information relates to the molecule itself, and some of it relates to its relationship with other molecules. Some of it is stored at SGD, and some of it is cross-referenced from other databases. And we have textual data, numeric data, and images.

How would you store such data to use it in your project? We will work on this question at the end of the assignment.

 


 

If we would be working on yeast, most data we need is right here: curated, kept current and consistent, referenced to the literature and ready to use. But you'll be working on a different species as of the next assignment, and you will need to integrate data yourself, from data sources such as the NCBI, or UniProt. The upside is that most of the information like this is available for many, many species. The downside is that we'll have to integrate information from many different sources essentially "by hand".


 

NCBI databases

The NCBI (National Center for Biotechnology Information) is the largest international provider of data for genomics and molecular biology. With its annual budget of several hundred million dollars, it organizes a challenging program of data management at the largest scale, it makes its data freely and openly available over the Internet, worldwide, and it runs significant in-house research projects.

Let us explore some of the offerings of the NCBI that can contribute to our objective of studying a particular gene in an organism of interest.


 

Entrez

Task:
Remember to document your activities as lab-notes on your Wiki.

  1. Access the NCBI website at http://www.ncbi.nlm.nih.gov/ [6]
  2. In the search bar, enter mbp1 and click Search.
  3. On the resulting page, look for the Protein section and click on the link. What do you find?


The result page of your search in "All Databases" is the "Global Query Result Page" of the Entrez system. If you follow the "Protein" link, you get taken to the more than 530 sequences in the NCBI Protein database that contain the keyword "mbp1". But when you look more closely at the results, you see that the result is quite non-specific: searching only by keyword retrieves a multiubiquitin chain binding protein in Arabidopsis, bacterial mannose binding proteins, a Saccharomyces protein (perhaps one that we are actually interested in), maltose binding proteins, myelin basic proteins - and much more. There must be a more specific way to search, and indeed there is. Time to read up on the Entrez system.


Task:

  1. Navigate to the Entrez Help Page and read about the Entrez system, especially about:
    1. Boolean operators,
    2. wildcards,
    3. limits, and
    4. filters.
  2. You should minimally understand:
    1. How to search by keyword;
    2. How to search by gene or protein name;
    3. How to restrict a search to a particular organism.

Don't skip this part, you should know the more common options and how to find the others. It would be great to have a synopsis of the important fields for reference, wouldn't it? Why don't you go and make one: I have put a template page on the Student Wiki (A synopsis of Entrez codes). Contributors and editors welcome!


Keyword and organism searches are pretty universal, but apart from that, each NCBI database has its own set of specific fields. You can access the keywords via the Advanced Search interface of any of the database pages.


 

Protein Sequence

 

Task:
With this knowledge we can restrict the search to proteins called "Mbp1" that occur in Baker's Yeast. Return to the Global Search page and in the search field, type:

Mbp1[protein name] AND
"Saccharomyces cerevisiae"[organism]


This finds two proteins. Follow the link to the result CAA98618.1—a data record in Genbank Flat File (GFF) format[7]. The database identifier CAA98618.1 tells you that this is a record in the GenPept database. There are actually several, identical versions of this sequence in the NCBI's holdings. A link to "Identical Proteins" near the top of the record shows you what these are:


Some of the sequences represent duplicate entries of the same gene (Mbp1) in the same strain (S288c) of the same species (S. cerevisiae). In particular:


  • there are seven records for which the source is the INSDC, these are archival entries, submitted by independent yeast genome research projects;
  • there two entries in the RefSeq database linking to the same protein: NP_010227.1. One is derived from genome sequence, the other from mRNA. This RefSeq entry is the preferred version of the sequence for us to work with. RefSeq is a curated, non-redundant database which solves a number of problems of archival databases. You can recognize RefSeq identifiers – they always look like NP_12345.1, NM_12345.1, XP_12345.1, NC_12345.1 etc. This reflects whether the sequence is protein, mRNA or genomic, and inferred or obtained through experimental evidence. The RefSeq ID NP_010227.1 actually appears twice, once linked to its genomic sequence, and once to its mRNA.
  • there is a SwissProt sequence P39678.1[8]. This link is kind of a big deal. It's a cross-reference into UniProt, the huge protein sequence database maintained by the EBI (European Bioinformatics Institute), which is the NCBI's counterpart in Europe. SwissProt entries have the highest annotation standard overall and are expertly curated. Many Webservices that we will encounter, work with UniProt ID's (e.g. P39678.1), rather than RefSeq. But it used to be until recently that the two databases did not link to each other, mostly for reasons of funding politics. It's great to see that this divide has now been overcome.


  • Note that the entries of the same sequence in different yeast strains. These don't have to be identical, they just happen to be. Sometimes we find identical sequences in quite divergent species. Therefore I would not actually consider EIW11153.1, AJU86440.1, AJU58508.1, and AJU61971.1 to be identical proteins, although they have the same sequence.


Note all the .1 suffixes of the sequence identifiers. These are version numbers. Two observations:

  1. It's great that version numbers are now used throughout the NCBI database. This is good database engineering practice because it's really important for reproducible research that updates to database records are possible, but recognizable. When working with data you always must provide for the possibility of updates, and manage the changes transparently and explicitly. Proper versioning should be a part of all datamodels. In fact, the NCBI is currently phasing out its internal unique identifiers – the GI number – in favour of accession-number.version IDs
  2. When searching, or for general use, you can (and should) omit the version number, i.e. use NP_010227 or P39678 not NP_010227.1 resp. P39678.1. This way the database system will resolve the identifier to the most current, highest version number (unless you want the older one, of course).


Task:

  1. Note down the RefSeq ID and the UniProt (SwissProt) ID of Mbp1 in your journal.
  2. Follow the link to the RefSeq entry NP_010227.1.
  3. Explore the page and follow these links (note the contents in your journal):
    1. Under "Analyze this Sequence": Identify Conserved Domains
    2. Under "Protein 3D Structure": See all 3 structures...
    3. Under "Pathways for the MBP1 gene": Cell cycle - yeast
    4. Under "Related information" Proteins with Similar Sequence

As we see, this is a good start page to explore all kinds of databases at the NCBI via cross-references.


 

PubMed

Arguably one of the most important databases in the life sciences is PubMed and this is a good time to look at PubMed in a bit more detail.


Task:

  1. Return back to the MBP1 RefSeq record.
  2. Find the PubMed link under Related information in the right-hand margin and explore it. "PubMed (Weighted)" applies a weighting algorithm to find broadly relevant information - an example of literature data mining. PubMed(weighted) appears to give a pretty good overview of systems-biology type, cross-sectional and functional information.

But it does not find all Mbp1 related literature.

  1. On any of the PubMed pages open the Advanced query page and study the keywords that apply to PubMed searches. These are actually quite important and useful to remember. Make yourself familiar with the section on Search field descriptions and tags in the PubMed help document, (in particular [DP], [AU], [TI], and [TA]), how you use the History to combine searches, and the use of AND, OR, NOT and brackets. Understand how you can restrict a search to reviews only, and what the link to Related citations... is useful for[9].
  2. Now find publications from anywhere in PubMed with Mbp1 in the title. In the result list, follow the links for the two Biochemistry papers, by Taylor et al. (2000) and by Deleeuw et al. (2008). Download the PDFs, we will need them later.


 

Data Storage

Now that we have a better sense of what our data is, we need to consider ways to model and store it. Let's talk about storage first.


Any software project requires modelling on many levels - data-flow models, logic models, user interaction models and more. But all of these ultimately rely on a data model that defines how the world is going to be represented in the computer for the project's purpose. The process of abstraction of data entities and defining their relationships can (and should) take up a major part of the project definition, often taking several iterations until you get it right. Whether your data can be completely described, consistently stored and efficiently retrieved is determined to a large part by your data model.

Databases can take many forms, from memories in your brain, to shoe-cartons under your bed, to software applications on your computer, or warehouse-sized data centres. Fundamentally, these all do the same thing: collect information and make it available.

Let us consider collecting information on APSES-domain transcription factors in various fungi, with the goal of being able to compare them. Let's specify this as follows:

Store data on APSES-domain proteins so that we can
  • cross reference the source databases;
  • study if they have the same features (e.g. domains);
  • and compare the features.

The underlying information can easily be retrieved for a protein from its RefSeq or UniProt entry.


Text files

A first attempt to organize the data might be simply to write it down in a large text file:

name: Mbp1
refseq ID: NP_010227
uniprot ID: P39678
species: Saccharomyces cerevisiae
taxonomy ID: 4392
sequence:
MSNQIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKR 
TRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLF 
DFTQTDGSASPPPAPKHHHASKVDRKKAIRSASTSAIMETKRNNKKAEEN 
QFQSSKILGNPTAAPRKRGRPVGSTRGSRRKLGVNLQRSQSDMGFPRPAI 
PNSSISTTQLPSIRSTMGPQSPTLGILEEERHDSRQQQPQQNNSAQFKEI 
DLEDGLSSDVEPSQQLQQVFNQNTGFVPQQQSSLIQTQQTESMATSVSSS 
PSLPTSPGDFADSNPFEERFPGGGTSPIISMIPRYPVTSRPQTSDINDKV 
NKYLSKLVDYFISNEMKSNKSLPQVLLHPPPHSAPYIDAPIDPELHTAFH 
WACSMGNLPIAEALYEAGTSIRSTNSQGQTPLMRSSLFHNSYTRRTFPRI 
FQLLHETVFDIDSQSQTVIHHIVKRKSTTPSAVYYLDVVLSKIKDFSPQY 
RIELLLNTQDKNGDTALHIASKNGDVVFFNTLVKMGALTTISNKEGLTAN 
EIMNQQYEQMMIQNGTNQHVNSSNTDLNIHVNTNNIETKNDVNSMVIMSP 
VSPSDYITYPSQIATNISRNIPNVVNSMKQMASIYNDLHEQHDNEIKSLQ 
KTLKSISKTKIQVSLKTLEVLKESSKDENGEAQTNDDFEILSRLQEQNTK 
KLRKRLIRYKRLIKQKLEYRQTVLLNKLIEDETQATTNNTVEKDNNTLER 
LELAQELTMLQLQRKNKLSSLVKKFEDNAKIHKYRRIIREGTEMNIEEVD 
SSLDVILQTLIANNNKNKGAEQIITISNANSHA    
length: 833
Kila-N domain: 21-93
Ankyrin domains: 369-455, 505-549

...

... and save it all in one large text file and whenever you need to look something up, you just open the file, look for e.g. the name of the protein and read what's there. Or - for a more structured approach, you could put this into several files in a folder.[10] This is a perfectly valid approach and for some applications it might not be worth the effort to think more deeply about how to structure the data, and store it in a way that it is robust and scales easily to large datasets. Alas, small projects have a tendency to grow into large projects and if you work in this way, it's almost guaranteed that you will end up doing many things by hand that could easily be automated. Imagine asking questions like:

  • How many proteins do I have?
  • What's the sequence of the Kila-N domain?
  • What percentage of my proteins have an Ankyrin domain?
  • Or two ...?

Answering these questions "by hand" is possible, but tedious.

Spreadsheets

Data for three yeast APSES domain proteins in an Excel spreadsheet.


Many serious researchers keep their project data in spreadsheets. Often they use Excel, or an alternative like the free OpenOffice Calc, or Google Sheets, both of which are compatible with Excel and have some interesting advantages. Here, all your data is in one place, easy to edit. You can even do simple calculations - although you should never use Excel for statistics[11]. You could answer What percentage of my proteins have an Ankyrin domain? quite easily[12].

There are two major downsides to spreadsheets. For one, complex queries need programming. There is no way around this. You can program inside Excel with Visual Basic. But you might as well export your data so you can work on it with a "real" programming language. The other thing is that Excel does not scale very well. Once you have more than a hundred proteins in your spreadsheet, you can see how finding anything can become tedious.

However, just because it was built for business applications, and designed for use by office assistants, does not mean it is intrinsically unsuitable for our domain. It's important to be pragmatic, not dogmatic, when choosing tools: choose according to your real requirements. Sometimes "quick and dirty" is just fine, because quick.


 

R

R can keep complex data in data frames and lists. If we do data analysis with R, we have to load the data first. We can use any of the read.table() functions for structured data, read lines of raw text with readLines(), or slurp in entire files with scan(). But we could also keep the data in an R object in the first place that we can read from disk, analyze, modify, and write back. In this case, R becomes our database engine.

# Sample construction of an R database table as a dataframe

# Data for the Mbp1 protein
proteins <- data.frame(  
    name     = "Mbp1",
    refSeq   = "NP_010227",
    uniProt  = "P39678",
    species  = "Saccharomyces cerevisiae",
    taxId    = "4392",
    sequence = paste(
                    "MSNQIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKR",
                    "TRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLF",
                    "DFTQTDGSASPPPAPKHHHASKVDRKKAIRSASTSAIMETKRNNKKAEEN",
                    "QFQSSKILGNPTAAPRKRGRPVGSTRGSRRKLGVNLQRSQSDMGFPRPAI",
                    "PNSSISTTQLPSIRSTMGPQSPTLGILEEERHDSRQQQPQQNNSAQFKEI",
                    "DLEDGLSSDVEPSQQLQQVFNQNTGFVPQQQSSLIQTQQTESMATSVSSS",
                    "PSLPTSPGDFADSNPFEERFPGGGTSPIISMIPRYPVTSRPQTSDINDKV",
                    "NKYLSKLVDYFISNEMKSNKSLPQVLLHPPPHSAPYIDAPIDPELHTAFH",
                    "WACSMGNLPIAEALYEAGTSIRSTNSQGQTPLMRSSLFHNSYTRRTFPRI",
                    "FQLLHETVFDIDSQSQTVIHHIVKRKSTTPSAVYYLDVVLSKIKDFSPQY",
                    "RIELLLNTQDKNGDTALHIASKNGDVVFFNTLVKMGALTTISNKEGLTAN",
                    "EIMNQQYEQMMIQNGTNQHVNSSNTDLNIHVNTNNIETKNDVNSMVIMSP",
                    "VSPSDYITYPSQIATNISRNIPNVVNSMKQMASIYNDLHEQHDNEIKSLQ",
                    "KTLKSISKTKIQVSLKTLEVLKESSKDENGEAQTNDDFEILSRLQEQNTK",
                    "KLRKRLIRYKRLIKQKLEYRQTVLLNKLIEDETQATTNNTVEKDNNTLER",
                    "LELAQELTMLQLQRKNKLSSLVKKFEDNAKIHKYRRIIREGTEMNIEEVD",
                    "SSLDVILQTLIANNNKNKGAEQIITISNANSHA",
                    sep=""),
    seqLen   = 833,
    KilAN    = "21-93",  
    Ankyrin  = "369-455, 505-549",
    stringsAsFactors = FALSE)

# add data for the Swi4 protein
proteins <- rbind(proteins,
                  data.frame(  
    name     = "Swi4",
    refSeq   = "NP_011036",
    uniProt  = "P25302",
    species  = "Saccharomyces cerevisiae",
    taxId    = "4392",
    sequence = paste(
                    "MPFDVLISNQKDNTNHQNITPISKSVLLAPHSNHPVIEIATYSETDVYEC",
                    "YIRGFETKIVMRRTKDDWINITQVFKIAQFSKTKRTKILEKESNDMQHEK",
                    "VQGGYGRFQGTWIPLDSAKFLVNKYEIIDPVVNSILTFQFDPNNPPPKRS",
                    "KNSILRKTSPGTKITSPSSYNKTPRKKNSSSSTSATTTAANKKGKKNASI",
                    "NQPNPSPLQNLVFQTPQQFQVNSSMNIMNNNDNHTTMNFNNDTRHNLINN",
                    "ISNNSNQSTIIQQQKSIHENSFNNNYSATQKPLQFFPIPTNLQNKNVALN",
                    "NPNNNDSNSYSHNIDNVINSSNNNNNGNNNNLIIVPDGPMQSQQQQQHHH",
                    "EYLTNNFNHSMMDSITNGNSKKRRKKLNQSNEQQFYNQQEKIQRHFKLMK",
                    "QPLLWQSFQNPNDHHNEYCDSNGSNNNNNTVASNGSSIEVFSSNENDNSM",
                    "NMSSRSMTPFSAGNTSSQNKLENKMTDQEYKQTILTILSSERSSDVDQAL",
                    "LATLYPAPKNFNINFEIDDQGHTPLHWATAMANIPLIKMLITLNANALQC",
                    "NKLGFNCITKSIFYNNCYKENAFDEIISILKICLITPDVNGRLPFHYLIE",
                    "LSVNKSKNPMIIKSYMDSIILSLGQQDYNLLKICLNYQDNIGNTPLHLSA",
                    "LNLNFEVYNRLVYLGASTDILNLDNESPASIMNKFNTPAGGSNSRNNNTK",
                    "ADRKLARNLPQKNYYQQQQQQQQPQNNVKIPKIIKTQHPDKEDSTADVNI",
                    "AKTDSEVNESQYLHSNQPNSTNMNTIMEDLSNINSFVTSSVIKDIKSTPS",
                    "KILENSPILYRRRSQSISDEKEKAKDNENQVEKKKDPLNSVKTAMPSLES",
                    "PSSLLPIQMSPLGKYSKPLSQQINKLNTKVSSLQRIMGEEIKNLDNEVVE",
                    "TESSISNNKKRLITIAHQIEDAFDSVSNKTPINSISDLQSRIKETSSKLN",
                    "SEKQNFIQSLEKSQALKLATIVQDEESKVDMNTNSSSHPEKQEDEEPIPK",
                    "STSETSSPKNTKADAKFSNTVQESYDVNETLRLATELTILQFKRRMTTLK",
                    "ISEAKSKINSSVKLDKYRNLIGITIENIDSKLDDIEKDLRANA",
                    sep=""),
    seqLen   = 1093,
    KilAN    = "56-122",  
    Ankyrin  = "516-662",
    stringsAsFactors = FALSE)
    )

# how many proteins?
nrow(proteins)

#what are their names?
proteins[,"name"]

# how many do not have an Ankyrin domain?
sum(proteins[,"Ankyrin"] == "")
    
# save it to file
save(proteins, file="proteinData.Rda")

# delete it from memory
rm(proteins)

# check...
proteins  # ... yes, it's gone


# read it back in:
load("proteinData.Rda")

# did this work?
sum(proteins[,"seqLen"])   # 1926 amino acids

# add another protein: Phd1
proteins <- rbind(proteins,
                  data.frame(  
    name     = "Phd1",
    refSeq   = "NP_012881",
    uniProt  = "P39678",
    species  = "Saccharomyces cerevisiae",
    taxId    = "4392",
    sequence = paste(
                    "MPFDVLISNQKDNTNHQNITPISKSVLLAPHSNHPVIEIATYSETDVYEC",
                    "MYHVPEMRLHYPLVNTQSNAAITPTRSYDNTLPSFNELSHQSTINLPFVQ",
                    "RETPNAYANVAQLATSPTQAKSGYYCRYYAVPFPTYPQQPQSPYQQAVLP",
                    "YATIPNSNFQPSSFPVMAVMPPEVQFDGSFLNTLHPHTELPPIIQNTNDT",
                    "SVARPNNLKSIAAASPTVTATTRTPGVSSTSVLKPRVITTMWEDENTICY",
                    "QVEANGISVVRRADNNMINGTKLLNVTKMTRGRRDGILRSEKVREVVKIG",
                    "SMHLKGVWIPFERAYILAQREQILDHLYPLFVKDIESIVDARKPSNKASL",
                    "TPKSSPAPIKQEPSDNKHEIATEIKPKSIDALSNGASTQGAGELPHLKIN",
                    "HIDTEAQTSRAKNELS",
                    sep=""),
    seqLen   = 366,
    KilAN    = "209-285",  
    Ankyrin  = "",    # No ankyrin domains annotated here
    stringsAsFactors = FALSE)
    )

# check:
proteins[,"name"]                #"Mbp1" "Swi4" "Phd1"
sum(proteins[,"Ankyrin"] == "")  # Now there is one...
sum(proteins[,"seqLen"])         # 2292 amino acids

# [END]


 

The third way to use R for data is to connect it to a "real" database:

  • a relational database like mySQL, MariaDB, or PostgreSQL;
  • an object/document database like {{WP|MongoDB};
  • or even a graph-database like Neo4j.

R "drivers" are available for all of these. However all of these require installing extra software on your computer: the actual database, which runs as an independent application. If you need a rock-solid database with guaranteed integrity, industry standard performance, and scalability to even very large datasets and hordes of concurrent users, don't think of rolling your own solution. One of the above is the way to go.


 

MySQL and friends

A "Schema" for a table that stores data for APSES domain proteins. This is a screenshot of the free MySQL Workbench application.

MySQL is a free, open relational database that powers some of the largest corporations as well as some of the smallest laboratories. It is based on a client-server model. The database engine runs as a daemon in the background and waits for connection attempts. When a connection is established, the server process establishes a communication session with the client. The client sends requests, and the server responds. One can do this interactively, by running the client program /usr/local/mysql/bin/mysql (on Unix systems). Or, when you are using a program such as R, Python, Perl, etc. you use the appropriate method calls or functions—the driver—to establish the connection.

These types of databases use their own language to describe actions: SQL - which handles data definition, data manipulation, and data control.

Just for illustration, the Figure above shows a table for our APSES domain protein data, built as a table in the MySQL workbench application and presented as an Entity Relationship Diagram (ERD). There is only one entity though - the protein "table". The application can generate the actual code that implements this model on a SQL compliant database:


CREATE TABLE IF NOT EXISTS `mydb`.`proteins` (
  `name` VARCHAR(20) NULL,
  `refSeq` VARCHAR(20) NOT NULL,
  `uniProt` VARCHAR(20) NULL,
  `species` VARCHAR(45) NOT NULL COMMENT '	',
  `taxId` VARCHAR(10) NULL,
  `sequence` BLOB NULL,
  `seqLen` INT NULL,
  `KilA-N` VARCHAR(45) NULL,
  `Ankyrin` VARCHAR(45) NULL,
  PRIMARY KEY (`refSeq`))
ENGINE = InnoDB


This looks at least as complicated as putting the model into R in the first place. Why then would we do this, if we need to load it into R for analysis anyway. There are several important reasons.

  • Scalability: these systems are built to work with very large datasets and optimized for performance. In theory R has very good performance with large data objects, but not so when the data becomes larger than what the computer can keep in memory all at once.
  • Concurrency: when several users need to access the data potentially at the same time, you must use a "real" database system. Handling problems of concurrent access is what they are made for.
  • ACID compliance. ACID describes four aspects that make a database robust, these are crucial for situations in which you have only partial control over your system or its input, and they would be quite laborious to implement for your hand built R data model:
    • Atomicity: Atomicity requires that each transaction is handled "indivisibly": it either succeeds fully, with all requested elements, or not at all.
    • Consistency: Consistency requires that any transaction will bring the database from one valid state to another. In particular any data-validation rules have to be enforced.
    • Isolation: Isolation ensures that any concurrent execution of transactions results in the exact same database state as if transactions would have been executed serially, one after the other.
    • Durability: The Durability requirement ensures that a committed transaction remains permanently committed, even in the event that the database crashes or later errors occur. You can think of this like an "autosave" function on every operation.

All the database systems I have mentioned above are ACID compliant[13].



 

Data modelling

As you have seen above, the actual specification of a data model in R or as a sequence of SQL statements is quite technical and not well suited to obtain an overview for the model's main features that we would need for its design. We'll thus introduce a modelling convention: the Entity-Relationship Diagram (ERD). These are semi-formal diagrams that show the key features of the model. Currently we have only a single table defined, with a number of attributes.

If we think a bit about our model and its intended use, it should become clear that there are a number of problems. They have to do with efficiency, and internal consistency.

Problems include:

  • We don't have a unique identifier. The name "Mbp1" could appear more than once in our table. The database IDs for RefSeq and UniProt are unique (up to versions) but they mean something else and that can be very confusing.
  • The relationship between species name and the taxonomy ID does not depend on the gene. In fact we could claim it to be different in different genes records. This would make our database inconsistent.
  • We can't guarantee that the length of the sequence is correct, we might have made an error while updating. Since seqLen depends on the contents of sequence it is redundant to store it separately.
  • A major issue is that we can't easily accommodate other features in our model. What about AT-hooks, disordered segments, phosphorylation sites, coiled-coil domains? ... (It may be I know something you don't know, yet.). And what about different versions of the Kil-A N domain? Different databases annotate slightly different boundaries.
  • The way we have treated the Ankyrin domain ranges so far is really awkward. We should be able to represent more than one domain in our model.
An ERD Diagram for our data model so far, the "attributes" for our Protein table are shown.
Problems with our data model.


A first set of changes.
Unique identifier
Every entity in our data model should have its own, unique identifier. Typically this will simply be an integer that we should automatically increment with every new entry. Automatically. We have to be sure we don't make a mistake.
Move species/tax_id to separate table
If the relationship between two attributes does not actually depend on our protein, we move them to their own table. One identifier remains in our protein table. We call this a "foreign key". The relationship between the two tables is drawn as a line, and the cardinalities of the relationship are identified. "Cardinalities" means: how many entities of one table can be associated with one entity of the other table. Here, 0, n on the left side means: a given tax_ID does not have to actually occur in the protein table i.e. we can put species in the table for which we actually have no proteins. There could also be many ("n") proteins for one species in our database. On the right hand side 1 means: there is exactly one species annotated for each protein. No more, no less.
Remove redundant data

This is almost always a good idea. It's usually better just to compute seqLen or similar from the data. The exception is if something is expensive to compute and/or used often. Then we may store the reult in our datamodel, while making our procedures watertight so we store the correct values.

These are relatively easy repairs. Treating the domain annotations correctly requires a bit more surgery.

An improved model of protein features. We store features in a table where we can describe them in general terms. We then create a separate table that annotates the start and end amino acid of each feature for a specific protein.

It's already awkward to work with a string like "21-93" when we need integer start and end values. We can parse them out, but it would be much more convenient if we can store them directly. But something like "369-455, 505-549" is really terrible. First of all it becomes an effort to tell how many domains there are in the first place, and secondly, the parsing code becomes quite involved. And that creates opportunities for errors in our logic and bugs in our code. And finally, what about if we have more features that we want to annotate? Should we have attributes like Kil-A N start, Kil-A N end, Ankyrin 01 start, Ankyrin 01 end, Ankyrin 02 start, Ankyrin 02 end, Ankyrin 03 start, Ankyrin 30 end... No, that would be absurdly complicated and error prone. There is a much better approach that solves all three problems at the same time. Just like with our species, we create a table that describes features. We can put any number of features there, even slightly different representations of canonical features from different data sources. Then we create a table that stores every feature occurrence in every protein. We call this a junction table and this is an extremely common pattern in data models. Each entry in this table links exactly one protein with exactly one feature. Each protein can have 0, n features. And each feature can be found in 0, n proteins.

With this simple schematic, we obtain an excellent overview about the logical structure of our data and how to represent it in code. Such models are essential for the design and documentation of any software project.



Time to put this into practice: design your own data model.

Task:

  • Use your imagination about what kind of data you think should be stored to study a system, such as the collaborating proteins that define the G1/S transition in the cell cycle.
  • Write down what you would like to store.
  • Sketch a relational data model for that data. Put it on paper, or print it out. Bring it to class for Tuesday's quiz. Your sketch will be handed in and graded by me. (Probably worth 2 marks.)


 

R

There is still some material left from our introduction to R:

Task:


 
That is all.


 

Links and resources

McInerny (2016) Cell cycle regulated transcription: from yeast to cancer. F1000Res 5:. (pmid: 27239285)

PubMed ] [ DOI ] Recent studies have revealed exciting new functions for forkhead transcription factors in cell proliferation and development. Cell proliferation is a fundamental process controlled by multiple overlapping mechanisms, and the control of gene expression plays a major role in the orderly and timely division of cells. This occurs through transcription factors regulating the expression of groups of genes at particular phases of the cell division cycle. In this way, the encoded gene products are present when they are required. This review outlines recent advances in our understanding of this process in yeast model systems and describes how this knowledge has informed analysis in more developmentally complex eukaryotes, particularly where it is relevant to human disease.

McInerny (2011) Cell cycle regulated gene expression in yeasts. Adv Genet 73:51-85. (pmid: 21310294)

PubMed ] [ DOI ] The regulation of gene expression through the mitotic cell cycle, so that genes are transcribed at particular cell cycle times, is widespread among eukaryotes. In some cases, it appears to be important for control mechanisms, as deregulated expression results in uncontrolled cell divisions, which can cause cell death, disease, and malignancy. In this review, I describe the current understanding of such regulated gene expression in two established simple eukaryotic model organisms, the budding yeast Saccharomyces cerevisiae and the fission yeast Schizosaccharomyces pombe. In these two yeasts, the global pattern of cell cycle gene expression has been well described, and most of the transcription factors that control the various waves of gene expression, and how they are in turn themselves regulated, have been characterized. As related mechanisms occur in all other eukaryotes, including humans, yeasts offer an excellent paradigm to understand this important molecular process.

Taylor et al. (2000) Characterization of the DNA-binding domains from the yeast cell-cycle transcription factors Mbp1 and Swi4. Biochemistry 39:3943-54. (pmid: 10747782)

PubMed ] [ DOI ] The minimal DNA-binding domains of the Saccharomyces cerevisiae transcription factors Mbp1 and Swi4 have been identified and their DNA binding properties have been investigated by a combination of methods. An approximately 100 residue region of sequence homology at the N-termini of Mbp1 and Swi4 is necessary but not sufficient for full DNA binding activity. Unexpectedly, nonconserved residues C-terminal to the core domain are essential for DNA binding. Proteolysis of Mbp1 and Swi4 DNA-protein complexes has revealed the extent of these sequences, and C-terminally extended molecules with substantially enhanced DNA binding activity compared to the core domains alone have been produced. The extended Mbp1 and Swi4 proteins bind to their cognate sites with similar affinity [K(A) approximately (1-4) x 10(6) M(-)(1)] and with a 1:1 stoichiometry. However, alanine substitution of two lysine residues (116 and 122) within the C-terminal extension (tail) of Mbp1 considerably reduces the apparent affinity for an MCB (MluI cell-cycle box) containing oligonucleotide. Both Mbp1 and Swi4 are specific for their cognate sites with respect to nonspecific DNA but exhibit similar affinities for the SCB (Swi4/Swi6 cell-cycle box) and MCB consensus elements. Circular dichroism and (1)H NMR spectroscopy reveal that complex formation results in substantial perturbations of base stacking interactions upon DNA binding. These are localized to a central 5'-d(C-A/G-CG)-3' region common to both MCB and SCB sequences consistent with the observed pattern of specificity. Changes in the backbone amide proton and nitrogen chemical shifts upon DNA binding have enabled us to experimentally define a DNA-binding surface on the core N-terminal domain of Mbp1 that is associated with a putative winged helix-turn-helix motif. Furthermore, significant chemical shift differences occur within the C-terminal tail of Mbp1, supporting the notion of two structurally distinct DNA-binding regions within these proteins.

Deleeuw et al. (2008) Thermodynamics and specificity of the Mbp1-DNA interaction. Biochemistry 47:6378-85. (pmid: 18491920)

PubMed ] [ DOI ] The DNA binding domain of the yeast transcription factor Mbp1 is a winged helix-turn-helix structure, with an extended DNA binding site involving C-terminal "tail" residues. The thermodynamics of the interaction of the DNA binding domain with its target DNA sequence have been determined using fluorescence anisotropy and calorimetry. The dissociation constant was determined as a function of pH and ionic strength in assessing the relative importance of specific and nonspecific ionic interactions. Mutational analysis of the residues in the binding site was used to determine their contributions to binding. The three tail histidine residues and His 63 in the recognition helix accounted for most of the pH dependence of the DNA binding. The tail histidine residues, along with two previously identified lysine residues, account for a major part of the polyelectrolyte contribution to binding and for the nonspecific affinity of Mbp1 for DNA. Gln67 was shown to be a very important residue, which interacts in the minor groove of the target DNA. Systematic mutations of the DNA consensus binding sites showed that the CGCG core contributes most to recognition. Isothermal titration calorimetry revealed a strong temperature-dependent enthalpy change, with a Delta Cp of -1.3kJ mol(-1) K(-1), consistent with a specific binding mode and burial of surface area. Parsing the free energy contributions demonstrates that polyelectrolyte effects account for half of the total free energy at the physiological pH and salt concentration. We present a model for the origin of the sequence specificity and overall affinity of the protein that accounts for the observed thermodynamics.


 
Further reading
Chernatynskaya et al. (2009) Structural analysis of the DNA target site and its interaction with Mbp1. Org Biomol Chem 7:4981-91. (pmid: 19907790)

PubMed ] [ DOI ] The solution structure of a 14 base-pair non-self complementary DNA duplex containing the consensus-binding site of the yeast transcription factor Mbp1 has been determined by NMR using a combination of scalar coupling analysis, time-dependent NOEs, residual dipolar couplings and 13C-edited NMR spectroscopy of a duplex prepared with one strand uniformly labeled with 13C-nucleotides. As expected, the free DNA duplex is within the B-family of structures, and within experimental limits is straight. However, there are clear local structural variations associated with the consensus CGCG element in the binding sequence that are important for sequence recognition. In the complex, the DNA bends around the protein, which also undergoes some conformational rearrangement in the C-terminal region. Structural constraints derived from paramagnetic perturbation experiments with spin-labeled DNA, chemical shift perturbation experiments of the DNA, previous cross-saturation, chemical shift perturbation experiments on the protein, information from mutational analysis, and electrostatics calculations have been used to produce a detailed docked structure using the known solution conformation of the free protein and other spectroscopic information about the Mbp1:DNA complex. A Monte Carlo-based docking procedure with restrained MD in a fully solvated system subjected to available experimental constraints produced models that account for the available structural data, and can rationalize the extensive thermodynamic data about the Mbp1:DNA complex. The protein:DNA interface is closely packed and is associated with a small number of specific contacts. The structure shows an extensive positively charged surface that accounts for the high polyelectrolyte contribution to binding.

Database normalization


 



Footnotes and references

  1. We have drafted a system definition in class: A system is a collection of collaborating genes that have more significant relationships among each other than to genes that are not system members.
  2. McInerny (2011) Cell cycle regulated gene expression in yeasts. Adv Genet 73:51-85. (pmid: 21310294)

    PubMed ] [ DOI ] The regulation of gene expression through the mitotic cell cycle, so that genes are transcribed at particular cell cycle times, is widespread among eukaryotes. In some cases, it appears to be important for control mechanisms, as deregulated expression results in uncontrolled cell divisions, which can cause cell death, disease, and malignancy. In this review, I describe the current understanding of such regulated gene expression in two established simple eukaryotic model organisms, the budding yeast Saccharomyces cerevisiae and the fission yeast Schizosaccharomyces pombe. In these two yeasts, the global pattern of cell cycle gene expression has been well described, and most of the transcription factors that control the various waves of gene expression, and how they are in turn themselves regulated, have been characterized. As related mechanisms occur in all other eukaryotes, including humans, yeasts offer an excellent paradigm to understand this important molecular process.

  3. McInerny (2016) Cell cycle regulated transcription: from yeast to cancer. F1000Res 5:. (pmid: 27239285)

    PubMed ] [ DOI ] Recent studies have revealed exciting new functions for forkhead transcription factors in cell proliferation and development. Cell proliferation is a fundamental process controlled by multiple overlapping mechanisms, and the control of gene expression plays a major role in the orderly and timely division of cells. This occurs through transcription factors regulating the expression of groups of genes at particular phases of the cell division cycle. In this way, the encoded gene products are present when they are required. This review outlines recent advances in our understanding of this process in yeast model systems and describes how this knowledge has informed analysis in more developmentally complex eukaryotes, particularly where it is relevant to human disease.

  4. Technically, GitHub documents are all publicly accessible if they are stored in repositories of free accounts - but you can commit binary files, so simply keep sensitive material in password-protected .zip files or otherwise encrypt it.
  5. Actually, that's not even literally true. You could write a function to use the "Viewer Pane" for very general cross-referencing.
  6. If you find this URL hard to remember, consider the acronyms:
    ncbi.nlm.nih.gov
    NCBI: National Center for Biotechnology Information
    NLM: National Library of Medicine
    NIH: National Institutes of Health
    GOV: the US GOVernment top-level domain
  7. If there is only a single match, you will be been taken directly to the page.
  8. Actually the "real" SwissProt identifier would be patterned like MBP1_YEAST. P39678 is the corresponding UniProt identifier.
  9. A good way to consolidate your knowledge is to summarize it for everyone on the Entrez page of the Student Wiki, or enhance the information you find there.
  10. Your operating system can help you keep the files organized. The "file system" is a database.
  11. For real: Excel is miserable and often wrong on statistics, and it makes horrible, ugly plots. See here and here why Excel problems are not merely cosmetic.
  12. At the bottom of the window there is a menu that says "sum = ..." by default. This provides simple calculations on the selected range. Set the choice to "count", select all Ankyrin domain entries, and the count shows you how many cells actually have a value.
  13. For a list of relational Database Management Systems, see here.


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.




< Assignment 1 Assignment 3 >