Difference between revisions of "BIO Assignment Week 2"

From "A B C"
Jump to navigation Jump to search
m
m
Line 2: Line 2:
 
<div class="b1">
 
<div class="b1">
 
Assignment for Week 2<br />
 
Assignment for Week 2<br />
<span style="font-size: 70%">Scenario, Labnotes on the Wiki, R-functions, Databases, Sequence in Chimera (and optionally: small molecules)</span>
+
<span style="font-size: 70%">Scenario, Labnotes, R-functions, Databases, Data Modeling</span>
 
</div>
 
</div>
  
  
{{Template:Inactive}}
+
{{Template:active}}
  
 
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.  
 
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.  
Line 16: Line 16:
 
&nbsp;
 
&nbsp;
 
==The Scenario==
 
==The Scenario==
 +
<div class="colmask doublepage">
 +
  <div class="colleft">
 +
    <div class="col1">
 +
      <!-- Column 1 start -->
 +
I have introduced the concept of "{{WP|cargo cult science}}" in class. The "cargo" in Bioinformatics is to understand biology. This includes understanding how things came to be the way they are, and how they work. Both relate to the concept of '''function''' of biomolecules, and the systems they contribute to. But "function" is a rather poorly defined concept and exploring ways to make it rigorous and computable will be the major objective of this course. The realm of bioinformatics contains many kingdoms and duchies and shires and hidden glades. To find out how they contribute to the whole, we will proceed on a quest. We will take a relatively well-characterized protein that is part of a relatively well-characterized process, and ask what its function is. We will examine the protein's sequence, its structure, its domain composition, its relationship to and interactions with other proteins, and through that paint a picture of a "system" that it contributes to.
  
Baker's yeast, ''Saccharomyces cerevisiae'', is perhaps the most important {{WP|Model_organism|model organism}}. It is a eukaryote that has been studied genetically and biochemically in great detail for many decades, and it is easily manipulated with high-throughput experimental methods. We will use information from this model organism to study the conservation of function and sequence in other fungi whose genomes have been completely sequenced; the assignments are an exercise in model-organism reasoning: the transfer of knowledge from one, well-studied organism to others.  
+
Our quest will revolve around a <span id="tf"></span>{{WP|Transcription factor|transcription factor}} that plays an important role in the regulation of the cell cycle: '''Mbp1''' is a key component of the MBF complex (Mbp1/Swi6) in yeast. This complex regulates gene expression at the crucial G1/S-phase transition of the mitotic cell cycle and has been shown to bind to the regulatory regions of more than a hundred target genes. It is therefore a DNA binding protein that acts as a control switch for a key cellular process.
  
This and the following assignments will revolve around a {{WP|Transcription factor|transcription factor}} that plays an important role in the regulation of the cell cycle: '''Mbp1''' is a key component of the MBF complex (Mbp1/Swi6). This complex regulates gene expression at the crucial G1/S-phase transition of the mitotic cell cycle and has been shown to bind to the regulatory regions of more than a hundred target genes. It is therefore a DNA binding protein that acts as a control switch for a key cellular process.
+
We will start our quest with information about the Mbp1 protein of Baker's yeast, ''Saccharomyces cerevisiae'', one of the most important {{WP|Model_organism|model organisms}}. Baker's yeast is a eukaryote that has been studied genetically and biochemically in great detail for many decades, and it is easily manipulated with high-throughput experimental methods. But each of you will use this information to study not Baker's yeast, but a related organism. You will explore the function of the Mbp1 protein in some other species from the {{WP|Kingdom_(biology)|kingdom}} of fungi, whose genome has been completely sequenced; thus our quest is also an exercise in ''model-organism reasoning'': the transfer of knowledge from one, well-studied organism to others.
 +
      <!-- Column 1 end -->
 +
    </div>
 +
    <div class="col2">
 +
      <!-- Column 2 start -->
 +
It's reasonable to hypothesize that such central control machinery is conserved in most if not all fungi. But we don't know. Many of the species that we will be working with have not been characterized in great detail, and some of them are new to our class this year. And while we know a fair bit about Mbp1, we probably don't know very much at all about the related genes in other organisms: whether they exist, whether they have similar functional features and whether they might contribute to the ''G1/S checkpoint system'' in a similar way. Thus we might discover things that are new and interesting. This is a quest of discovery.  
  
One would speculate that such central control machinery would be conserved in other fungi and it will be your task in these assignments to collect evidence whether related molecular components are present in some of the newly sequenced fungal genomes. Throughout the assignments we will use freely available tools to conduct bioinformatics investigations of sequences, structures and relationships that may ultimately answer questions such as:
+
Here are the steps of the assignment for this week:
  
*Do related proteins exist in other organisms?
+
<div class="task">
*What functional features can we detect in the related proteins?
+
# We'll need to explore what data is available for the Mbp1 protein.
*Do we have evidence that they may bind to similar sequence motifs?
+
# We'll need to pick a species to adopt for exploration.
*Do we believe they may function in a similar way?
+
# We'll need to define what data we want to store and design a datamodel.  
 
 
{{task|1=
 
Access the [http://db.yeastgenome.org/cgi-bin/locus.pl?locus=mbp1 information page on Mbp1] at the ''Saccharomyces'' Genome Database and read the summary paragraph on the protein's function!
 
 
 
(If you would like to brush up on the concepts mentioned above, you could study the corresponding chapter in [http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=mcb.chapter.3432 Lodish's Molecular Cell Biology] and./or read Nobel laureate {{PDFlink|[http://www.cumc.columbia.edu/dept/eukaryotic/nurse.pdf Paul Nurse's review]}} of the key concepts of the eukaryotic cycle. It is not strictly necessary to understand the details of the yeast cell-cycle to complete the assignments, but it's obviously more satisfying to work with concepts that actually make some sense.)
 
}}
 
 
 
For reference, this is the  FASTA formatted sequence of Mbp1 from ''Saccharomyces cerevisiae'':
 
 
 
>gi|6320147|ref|NP_010227.1| Mbp1p [Saccharomyces cerevisiae S288c]
 
MSNQIYSARYSGVDVYEF<span style="color:#DD0000;">IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGF
 
GKYQGTWVPLNIAKQLAEKFSVY</span>DQLKPLFDFTQTDGSASPPPAPKHHHASKVDRKKAIRSASTSAIMET
 
KRNNKKAEENQFQSSKILGNPTAAPRKRGRPVGSTRGSRRKLGVNLQRSQSDMGFPRPAIPNSSISTTQL
 
PSIRSTMGPQSPTLGILEEERHDSRQQQPQQNNSAQFKEIDLEDGLSSDVEPSQQLQQVFNQNTGFVPQQ
 
QSSLIQTQQTESMATSVSSSPSLPTSPGDFADSNPFEERFPGGGTSPIISMIPRYPVTSRPQTSDINDKV
 
NKYLSKLVDYFISNEMKSNKSLPQVLLHPPPHSAPYIDAPIDPELHTAFHWACSMGNLPIAEALYEAGTS
 
IRSTNSQGQTPLMRSSLFHNSYTRRTFPRIFQLLHETVFDIDSQSQTVIHHIVKRKSTTPSAVYYLDVVL
 
SKIKDFSPQYRIELLLNTQDKNGDTALHIASKNGDVVFFNTLVKMGALTTISNKEGLTANEIMNQQYEQM
 
MIQNGTNQHVNSSNTDLNIHVNTNNIETKNDVNSMVIMSPVSPSDYITYPSQIATNISRNIPNVVNSMKQ
 
MASIYNDLHEQHDNEIKSLQKTLKSISKTKIQVSLKTLEVLKESSKDENGEAQTNDDFEILSRLQEQNTK
 
KLRKRLIRYKRLIKQKLEYRQTVLLNKLIEDETQATTNNTVEKDNNTLERLELAQELTMLQLQRKNKLSS
 
LVKKFEDNAKIHKYRRIIREGTEMNIEEVDSSLDVILQTLIANNNKNKGAEQIITISNANSHA
 
 
 
I have highlighted the protein's <span style="color:#DD0000;">'''APSES''' domain</span> (also known as a {{WP|KilA-N domain}}), which is the DNA binding element of the sequence. Of course, such colouring is not part of the actual {{WP|FASTA_format|FASTA}} file which contains only a header and sequence letters. This is the domain we will focus on most in the following assignments.
 
 
 
 
 
===Choosing YFO (Your Favourite Organism)===
 
 
 
 
 
The first task is to choose a species in which to conduct your explorations.
 
 
 
 
 
Many fungal genomes have been sequenced and more are added each year. For the purposes of the course assignments, we need a species
 
* that has transcription factors containing APSES domains;
 
* whose genome has been completely sequenced;
 
* for which records exist in the RefSeq database, NCBI's unique sequence collection.
 
 
 
 
 
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand for details" data-collapsetext="Collapse">To prepare such a list of species, I have searched the NCBI's RefSeq database for proteins whose sequences are similar to the APSES domain of Mbp1 and compiled the names of organisms that contain them.
 
<div class="mw-collapsible-content">
 
 
 
:(1) Compiled a list of genome-sequenced fungi from information on the [http://www.ncbi.nlm.nih.gov/genome/browse/ NCBI genome browser page] by selecting Eukaryota / Fungi ... and downloading the entire list of species as a text document. An excerpt of the first lines of the document is shown here:
 
 
 
#Organism/Name            Kingdom    Group  SubGroup        Size (Mb)
 
Aciculosporium take      Eukaryota  Fungi  Ascomycetes    58.8364
 
Agaricus bisporus        Eukaryota  Fungi  Basidiomycetes  32.6144
 
Ajellomyces capsulatus    Eukaryota  Fungi  Ascomycetes    46.124
 
Ajellomyces dermatitidis  Eukaryota  Fungi  Ascomycetes    75.4047
 
[...]
 
 
 
:(2) Reformatted the document to provide an Entrez species selection command. With this string NCBI search tools can be constrained to a set of species we are interested in. One could type this list by hand, or use search/replace functions of a text editor on the original list. I used the following Perl one-liner which I give here merely for your edification<ref>If you are curious how this works, ask me.</ref>.
 
<br />
 
::<small><code>perl -e 'while(<STDIN>){/^(.+?)\t/;print"\"$1\"[organism] OR \n"}' < genomes_overview.txt
 
</code></small>
 
 
 
... giving me the Entrez selection command (with over 400 species):
 
 
 
"Aciculosporium take"[organism] OR
 
"Agaricus bisporus"[organism] OR
 
"Ajellomyces capsulatus"[organism] OR
 
"Ajellomyces dermatitidis"[organism] ...
 
 
 
 
 
:(3) Performed a [http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&BLAST_PROGRAMS=blastp&PAGE_TYPE=BlastSearch&SHOW_DEFAULTS=on&LINK_LOC=blasthome PSI BLAST] search with the Mbp1 APSES domain sequence shown above, the search database restricted to the '''refseq_protein database''' and an the '''Entrez Query''' created as explained above. This search was iterated a few times and retrieves all sequence-similar proteins from genome sequenced fungi for which entries exist in the RefSeq database.<ref>Actualy, there is a bit of a detour required here: the list of selection commands is too long and had to be broken down into four batches of a bout 100 species to be processed by the BLAST server.</ref>
 
 
 
:(4) In the header of the BLAST results page, there is a link to '''[Taxonomy reports]''' This contains a list of all hits, sorted by species. I copied the species names to a separate file - applying a bit of manual editing: removing duplicate genus entries, and the six reference species ''Saccharomyces cerevisiae'', ''Aspergillus nidulans'', ''Candida albicans'', ''Neurospora crassa'', ''Schizosaccharomyces pombe'', and ''Ustilago maydis'' - these are not being assigned to the class.
 
 
 
 
 
:(5) Finally, I extracted a 5 letter code from the binomial names and formatted everything as '''R''' code to be used below. Again, a Perl one-liner. It applies a regular expression to extract the first three characters of the genus name and the first two characters of the species name and combines these into a short, uppercase label.<br/>
 
::<small><br /><code>perl -e 'while(<STDIN>){m/^((...).+?\s(..).*?)\s/;print("\t\t\"$1 (", uc($2.$3), ")\",\n");}' < BLAST_species.txt</code></small>
 
 
 
This process with its mix of Web service, programmed reformatting and manual cleanup, is a fairly typical example of gathering and collating information across different data sources.
 
</div>
 
 
</div>
 
</div>
  
&nbsp;
+
However, before we head off into the Internet: have you thought about how to document such a "quest"? How will you keep notes? Obviously, computational research proceeds with the same best-practice principles as any wet-lab experiment. We have to keep notes, ensure our work is reproducible, and that our conclusions are supported by data. I think it's pretty obvious that paper notes are not very useful for bioinformatics work. Ideally, you should be able to save results, and link to files and Webpages.
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand for details" data-collapsetext="Collapse">
+
      <!-- Column 2 end -->
Next, I would like to assign species from this list to each student. This process should be random, but reproducible.
+
    </div>
 
+
  </div>
<div class="mw-collapsible-content"> Here's an idea: we could use the student ID ( a '''unique identifier''') to pick entries from the list! Indeed, the functions provided in '''R''' can easily be used to randomly but reproducibly choose an element from a list. Essentially we can write a function that creates a many-faced die, with a piece of text&mdash;a species' name&mdash; on every face. It will fall differently for each student ID, but will fall the same every time the same ID is encountered.
 
 
 
This makes use of the fact that "random" numbers generated by a computer algorithm aren't really random: they are "pseudorandom", generated by a deterministic algorithm. Such an algorithm takes a number&mdash;a ''seed''&mdash; and mangles it until the result has no recognizable connection to the seed. The result is indistinguishable from a random number. However if we use the ''same seed'', we will always get the ''same result''. Such a random pick can be programmed with the following steps:
 
# Create a list
 
# Initialize a random number generator with a student ID as a seed
 
# pick a random integer "''i''" in the range from first to last element of the list
 
# return the ''i''-th list element.
 
</div>
 
 
</div>
 
</div>
  
Here is '''R''' code to accomplish this:
 
  
{{task|
+
&nbsp;
  
* Read, try to understand and then execute the following R-code.
+
===Keeping Labnotes===
  
<source lang="rsplus">
+
<div class="colmask doublepage">
pickSpecies <- function(ID) {
+
  <div class="colleft">
# this function randomly picks a fungal species
+
    <div class="col1">
# from a list. It is seeded by a student ID. Therefore
+
      <!-- Column 1 start -->
# the pick is random, but reproducible.
+
Consider it a part of your assignment to document your activities in electronic form. Here are some applications you might think of - but (!) disclaimer, I myself don't use any of these (yet) <small>(except the Wiki of course)</small>.
 
# first, define a list of species:
 
Species <- c(
 
"Agaricus bisporus (AGABI)",
 
"Ajellomyces dermatitidis (AJEDE)",
 
"Arthroderma otae (ARTOT)",
 
"Ashbya gossypii (ASHGO)",
 
"Auricularia delicata (AURDE)",
 
"Baudoinia compniacensis (BAUCO)",
 
"Beauveria bassiana (BEABA)",
 
"Bipolaris oryzae (BIPOR)",
 
"Botrytis cinerea (BOTCI)",
 
"Capronia coronata (CAPCO)",
 
"Chaetomium globosum (CHAGL)",
 
"Cladophialophora psammophila (CLAPS)",
 
"Clavispora lusitaniae (CLALU)",
 
"Coccidioides immitis (COCIM)",
 
"Colletotrichum fioriniae (COLFI)",
 
"Coniophora puteana (CONPU)",
 
"Coniosporium apollinis (CONAP)",
 
"Coprinopsis cinerea (COPCI)",
 
"Cryptococcus neoformans (CRYNE)",
 
"Cyphellophora europaea (CYPEU)",
 
"Debaryomyces hansenii (DEBHA)",
 
"Dichomitus squalens (DICSQ)",
 
"Endocarpon pusillum (ENDPU)",
 
"Eutypa lata (EUTLA)",
 
"Exophiala dermatitidis (EXODE)",
 
"Fomitiporia mediterranea (FOMME)",
 
"Fusarium graminearum (FUSGR)",
 
"Glarea lozoyensis (GLALO)",
 
"Gloeophyllum trabeum (GLOTR)",
 
"Kazachstania africana (KAZAF)",
 
"Kluyveromyces lactis (KLULA)",
 
"Komagataella pastoris (KOMPA)",
 
"Laccaria bicolor (LACBI)",
 
"Lachancea thermotolerans (LACTH)",
 
"Leptosphaeria maculans (LEPMA)",
 
"Lodderomyces elongisporus (LODEL)",
 
"Magnaporthe oryzae (MAGOR)",
 
"Malassezia globosa (MALGL)",
 
"Marssonina brunnea (MARBR)",
 
"Melampsora larici-populina (MELLA)",
 
"Metarhizium acridum (METAC)",
 
"Meyerozyma guilliermondii (MEYGU)",
 
"Microsporum gypseum (MICGY)",
 
"Millerozyma farinosa (MILFA)",
 
"Moniliophthora roreri (MONRO)",
 
"Myceliophthora thermophila (MYCTH)",
 
"Naumovozyma castellii (NAUCA)",
 
"Nectria haematococca (NECHA)",
 
"Neofusicoccum parvum (NEOPA)",
 
"Neosartorya fischeri (NEOFI)",
 
"Paracoccidioides sp. (PARSP)",
 
"Pestalotiopsis fici (PESFI)",
 
"Phaeosphaeria nodorum (PHANO)",
 
"Phanerochaete carnosa (PHACA)",
 
"Pneumocystis murina (PNEMU)",
 
"Podospora anserina (PODAN)",
 
"Postia placenta (POSPL)",
 
"Pseudocercospora fijiensis (PSEFI)",
 
"Pseudozyma flocculosa (PSEFL)",
 
"Puccinia graminis (PUCGR)",
 
"Punctularia strigosozonata (PUNST)",
 
"Pyrenophora tritici-repentis (PYRTR)",
 
"Scheffersomyces stipitis (SCHST)",
 
"Schizophyllum commune (SCHCO)",
 
"Sclerotinia sclerotiorum (SCLSC)",
 
"Serpula lacrymans (SERLA)",
 
"Setosphaeria turcica (SETTU)",
 
"Sordaria macrospora (SORMA)",
 
"Spathaspora passalidarum (SPAPA)",
 
"Stereum hirsutum (STEHI)",
 
"Talaromyces marneffei (TALMA)",
 
"Tetrapisispora blattae (TETBL)",
 
"Thielavia terrestris (THITE)",
 
"Togninia minima (TOGMI)",
 
"Torulaspora delbrueckii (TORDE)",
 
"Trametes versicolor (TRAVE)",
 
"Tremella mesenterica (TREME)",
 
"Trichoderma reesei (TRIRE)",
 
"Trichophyton rubrum (TRIRU)",
 
"Tuber melanosporum (TUBME)",
 
"Uncinocarpus reesii (UNCRE)",
 
"Vanderwaltozyma polyspora (VANPO)",
 
"Verticillium alfalfae (VERAL)",
 
"Wallemia sebi (WALSE)",
 
"Yarrowia lipolytica (YARLI)",
 
"Zygosaccharomyces rouxii (ZYGRO)",
 
"Zymoseptoria tritici (ZYMTR)"
 
)
 
l <- length(Species)    # number of elements in the list
 
set.seed(ID)           # seed the random number generator
 
                        # with the student ID
 
i <- runif(1, 0, 1)    # pick one random number between 0 and 1
 
i <- l * i              # multiply with number of elements
 
i <- ceiling(i)         # round up to nearest integer
 
choice <- Species[i]    # pick the i'th element from list
 
return(choice)
 
}
 
</source>
 
  
* Execute the function <code>pickSpecies()</code> with your student ID as its parameter. Example:
+
*[http://evernote.com '''Evernote'''] - a web hosted, automatically syncing e-notebook.
 +
*[http://nevernote.sourceforge.net/ '''Nevernote'''] - the Open Source alternative to Evernote.
 +
*[https://keep.google.com/ '''Google Keep'''] - if you have a Gmail account, you can simply log in here. Grid-based. Seems a bit awkward for longer notes. But of course you can also use [http://drive.google.com '''Google Docs'''].
 +
*[http://www.onenote.com/ '''Microsoft OneNote'''] - this sounds interesting and even though I have had my share of problems with Microsoft products, I'll probably give this a try. Syncing across platforms, being able to format contents and organize it sounds great.
 +
*[http://steipe.biochemistry.utoronto.ca/abc/students '''The Student Wiki'''] - of course. You can keep your course notes with your User pages.
  
<source lang="text">
+
Are you aware of any other solutions? Let us know!
> pickSpecies(991234567)
 
[1] "Coccidioides immitis (COCIM)"
 
</source>
 
* Note down the species name and its five letter label on your student Wiki page. '''Use this species whenever this or future assignments refer to YFO'''.
 
}}
 
  
 +
      <!-- Column 1 end -->
 +
    </div>
 +
    <div class="col2">
 +
      <!-- Column 2 start -->
  
 +
'''Keeping such a journal will be helpful, because the assignments are integrated over the entire term''', and later assignments will make use of earlier results. But it is also excellent practice for "real" research. Expand the section below for details - written from a Wiki perspective but generally applicable.
  
{{task|
+
      <!-- Column 2 end -->
* While you already have '''R''' open, access the  [[R tutorial|'''R tutorial''']] and work through the section on [[R tutorial#Simple_commands|'''Simple commands''']]. It is short, and will help you understand the code above.
+
    </div>
}}
+
  </div>
 +
</div>
  
  
 +
<div class="mw-collapsible mw-collapsed" style="background-color: #DAE9F5;" data-expandtext="Expand for details" data-collapsetext="Collapse">&nbsp;
 +
<div class="mw-collapsible-content">
  
&nbsp;
+
<div class="colmask doublepage">
 +
  <div class="colleft">
 +
    <div class="col1">
 +
      <!-- Column 1 start -->
 +
Remember you are writing a lab notebook&mdash;not a formal lab report: a point-form record of your actual activities. Write such documentation as notes to your (future) self.
  
===Keeping a notebook on your Wiki===
 
  
 +
Create a lab-notes page as a subpage of your User space on [http://steipe.biochemistry.utoronto.ca/abc/students '''the Student Wiki'''].
  
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand for details" data-collapsetext="Collapse">Consider it a part of your assignment to document your activities on your Wiki page.
 
<div class="mw-collapsible-content">
 
You should write your documentation like a lab notebook&mdash;not a formal lab report, but a point-form record of your actual activities. Write such documentation as notes to your (future) self. Obviously, since much of the work will be done on the Web, an electronic notebook makes more sense than a paper notebook.
 
  
 
For each task:
 
For each task:
 
*;Write a header and give it a unique number.
 
*;Write a header and give it a unique number.
:: This is useful so you can refer to the header number in later text. Obviously, you should "hard-code" the number and not use the Wiki's automatic section numbering scheme, since the numbers should be stable over time, not change when you add or delete a section.
+
:: This is useful so you can refer to the header number in later text. Obviously, you should "hard-code" the number and not use the Wiki's automatic section numbering scheme, since the numbers should be stable over time, not change when you add or delete a section. It may be useful to add new contents at the top, so you don't have to scroll to the bottom of the page evry time you add new material. This does not have to be in strict chronological order, like we would have it in a paper notebook. It may be advantageous to give different subprojects their own page, or at least order them on one page. Just remember that things that are on the same page are easy to find.
  
 
*;State the objective.
 
*;State the objective.
Line 263: Line 98:
  
 
*;Document the procedure.
 
*;Document the procedure.
:: Note what you have done, as concisely as possible. Give enough information so that anyone could reproduce unambiguously what you have done&mdash; your future project student, or even your future self.
+
:: Note what you have done, as concisely as possible but with sufficient detail. I am often asked: "What is sufficient detail"? The answer is easy: detailed enough so that someone can reproduce what you have done. In practice that guy will often be you, yourself, in the future. I hope that you won't be constantly cursing your past-self because of omissions!
  
 
*;Document your results.
 
*;Document your results.
 
: You can distinguish different types of results -
 
: You can distinguish different types of results -
  
**'''Static data''' does not change over time and it may be sufficient to note a '''reference''' to the result. For example, there is no need to copy a genbank record into your documentation, it is sufficient to note the accession number or the GI number.
+
**'''Static data''' does not change over time and it may be sufficient to note a '''reference''' to the result. For example, there is no need to copy a GenBank record into your documentation, it is sufficient to note the accession number or the GI number, or better, to link to it.
 
**'''Variable data''' can change over time. For example the results of a BLAST search depend on the sequences in the database. A list of similar structures may change as new structures get solved. In principle you want to record such data, to be able to reproduce at a later time what your conclusions were based on. But be '''selective''' in what you record. For example you should not paste the entire set of results of a BLAST search into your document, but only those matches that were important for your conclusions. '''Indiscriminate pasting of irrelevant information will make your notes unusable.'''
 
**'''Variable data''' can change over time. For example the results of a BLAST search depend on the sequences in the database. A list of similar structures may change as new structures get solved. In principle you want to record such data, to be able to reproduce at a later time what your conclusions were based on. But be '''selective''' in what you record. For example you should not paste the entire set of results of a BLAST search into your document, but only those matches that were important for your conclusions. '''Indiscriminate pasting of irrelevant information will make your notes unusable.'''
 
**'''Analysis results'''
 
**'''Analysis results'''
Line 274: Line 109:
  
 
*;Note your conclusions.
 
*;Note your conclusions.
::An analysis is not complete unless you conclude something from the results. (Remember what we said about "Cargo Cult Science". If there is no conclusion possible, your activities are quite pointless.) Are two sequences likely homologues, or not? Does your protein contain a signal-sequence or does it not? Is a binding site conserved, or not? The analysis provides the data. In your '''conclusion''' you provide the interpretation of what the data means '''in the context of your objective'''. Sometimes your assignment task will ask you to elaborate on an analysis and conclusion. But this does not mean that when the assignment does not explicitly mention it, you don't need to interpret your data.
+
::'''An analysis is not complete unless you conclude something from the results.''' (Remember what we said about "Cargo Cult Science". If there is no conclusion, your activities are quite pointless.) Are two sequences likely homologues, or not? Does your protein contain a signal-sequence or does it not? Is a binding site conserved, or not? The analysis provides the data. In your '''conclusion''' you provide the interpretation of what the data means '''in the context of your objective'''. Were you expecting a signal-sequence but there isn't one? What could that mean? Sometimes your assignment task in this course will ask you to elaborate on an analysis and conclusion. But this does not mean that when I don't explicitly mention it, you can skip the interpretation.
  
 +
*;Add cross-references.
 +
::Cross-reference to other information are super valuable as your documentation grows. It's easy to see how to format a link to a section of your Wiki-page: just look at the link under the Table of Contents at the top. But you can also place "anchors" for linking anywhere on an HTML page: just use the following syntax. <code>&lt;span id="{some-label}"&gt;&lt;\span&gt;</code> for the anchor, and append <code>#{some-label}</code> to the page URL. Try this here: <small>(http://steipe.biochemistry.utoronto.ca/abc/Assignment_2#tf) </small>.
 +
      <!-- Column 1 end -->
 +
    </div>
 +
    <div class="col2">
 +
      <!-- Column 2 start -->
 
*;Use discretion when uploading images
 
*;Use discretion when uploading images
 
::I have enabled image uploading with some reservations, we'll see how it goes. You must '''not''':
 
::I have enabled image uploading with some reservations, we'll see how it goes. You must '''not''':
Line 284: Line 125:
  
 
*;Prepare your images well
 
*;Prepare your images well
::Don't upload uncompressed screendumps. Save images in a compressed file format on your own computer. Then use the '''Special:Upload''' link in the left-hand menu to upload images. The Wiki will only accept <code>.jpeg</code> or <code>.png</code> images.
+
::Don't upload uncompressed screen dumps. Save images in a compressed file format on your own computer. Then use the '''Special:Upload''' link in the left-hand menu to upload images. The Wiki will only accept <code>.jpeg</code> or <code>.png</code> images.
  
 
*;Use the correct image types.
 
*;Use the correct image types.
Line 299: Line 140:
 
;Keep your images uncluttered and expressive
 
;Keep your images uncluttered and expressive
 
:Scale the molecular model to fill the available space of your image well. Orient views so they illustrate a point you are trying to make. Emphasize residues that you are writing about with a contrasting colouring scheme. Add labels, where residue identities are not otherwise obvious. Turn off side-chains for residues that are not important. The more you practice these small details, the more efficient you will become in the use of your tools.
 
:Scale the molecular model to fill the available space of your image well. Orient views so they illustrate a point you are trying to make. Emphasize residues that you are writing about with a contrasting colouring scheme. Add labels, where residue identities are not otherwise obvious. Turn off side-chains for residues that are not important. The more you practice these small details, the more efficient you will become in the use of your tools.
 +
 +
:If you have technical difficulties, post your questions to the list and/or contact me.      <!-- Column 2 end -->
 +
    </div>
 +
  </div>
 +
</div>
 +
  
  
:If you have technical difficulties, post your questions to the list and/or contact me.
 
 
</div>
 
</div>
 
</div>
 
</div>
 
Keeping such a journal will be helpful, because the assignment is more or less integrated over the entire term, and later assignments will make use of earlier results. But it is also excellent practice for "real" research.
 
  
  
 
&nbsp;
 
&nbsp;
  
==NCBI databases==
+
==Data Sources==
 
 
Let us explore some of the offerings of the NCBI that can contribute to our objective of studying a particular gene in a newly sequenced organism.
 
  
  
===Entrez===
+
===SGD - a Yeast Model Organism Database===
 +
<div class="colmask doublepage">
 +
  <div class="colleft">
 +
    <div class="col1">
 +
      <!-- Column 1 start -->
 +
Yeast happens to have a very well maintained '''model organism database''' - a Web resource dedicated to ''Saccharomyces cerevisiae''. Where such resources are available, they are very useful for the community. For the general case however, we need to work with one of the large, general data providers - the NCBI and the EBI. But in order to get a sense of the type of data that is available, let's visit the SGD database first.
  
 
{{task|1=
 
{{task|1=
<small>Remember to document your activities.</small>
+
Access the [http://db.yeastgenome.org/cgi-bin/locus.pl?locus=mbp1 information page on Mbp1] at the ''Saccharomyces'' Genome Database.
 
 
# Access the '''NCBI''' website at http://www.ncbi.nlm.nih.gov/  
 
# In the search bar, enter <code>mbp1</code> and click '''Search'''.
 
# On the resulting page, look for the '''Protein''' section and click on it. What do you find?
 
}}
 
 
 
 
 
The result page of your search in "All Databases" is the "Global Query Result Page" of the Entrez system. If you follow the "Protein" link, you get taken to the 200 or so sequences in the NCBI Protein database. But looking at that page, you see that the result is quite non-specific: searching only by gene name retrieves an ''Arabidopsis'' protein, a ''Saccharomyces'' protein (presumably one that we might be interested in), Maltose Binding Proteins - and much more. There must be a more specific way to search, and indeed there is. Time to read up on the Entrez system.
 
 
 
  
{{task|1=
+
<ol>
 +
<li>Browse through the '''Summary''' page and note the available information: you should see:
 +
  <ul>
 +
    <li>information about the gene and the protein;
 +
    <li>Information about it's roles in the cell curated at the Gene Ontology database;
 +
    <li>Information about knock-out phenotypes; <small>(Amazing. Would you have imagined that this is a non-essential gene?)</small>
 +
    <li>Information about protein-protein interactions;
 +
    <li>Regulation and expression;
 +
    <li>'''A curators' summary of our understanding of the protein.''' Mandatory reading.
 +
    <li>And key references.
 +
  </ul>
 +
<li>Access the [http://www.yeastgenome.org/locus/S000002214/protein '''Protein''' tab] and note the much more detailed information.
 +
  <ul>
 +
    <li>Domains and their classification;
 +
    <li>Sequence;
 +
    <li>Shared domains;
 +
    <li>and much more...
 +
  </ul>
  
# Navigate to the [http://www.ncbi.nlm.nih.gov/books/NBK3837/ Entrez Help Page] and read about the Entrez system, especially about:
+
</ol>
##Boolean operators,
 
##wildcards,
 
##limits, and
 
##filters.
 
# You should minimally understand:
 
## How to search by keyword;
 
## How to search by gene or protein name;
 
## How to restrict a search to a particular organism.
 
  
Don't skip this part, you don't need to know the options by heart, but you should know they exist and how to find them.
 
 
}}
 
}}
  
 +
      <!-- Column 1 end -->
 +
    </div>
 +
    <div class="col2">
 +
      <!-- Column 2 start -->
 +
You will notice that some of this information relates to the molecule itself, and some of it relates to its relationship with other molecules. Some of it is stored at SGD, and some of it is cross-referenced from other databases. And we have textual data, numeric data, and images.
  
Keyword and organism searches are pretty universal, but apart from that, each NCBI database has its own set of specific fields. You can access them via the Advanced Search interface of any of the database pages.
+
How would you store such data to use it in your project? We will work on this question at the end of the assignment.
 
 
 
 
===Protein===
 
 
 
  
 
&nbsp;
 
&nbsp;
{{task|1=
 
 
Now try the search for Mbp1 in Baker's Yeast alone. Return to the Global Search page and enter:
 
 
Mbp1[protein name] AND "Saccharomyces cerevisiae"[organism]
 
 
}}
 
 
 
This should find one and only one protein. Follow the link into the protein database: since this is only one record, the link takes you directly to the result&mdash;a data record in Genbank Flat File (GFF) format, not to a list of hits, as before. Explore the record and familiarize yourself with the information that is there.
 
 
All well and good - but didn't we want to find '''RefSeq''' entries, since that is expected to be the database of unique, curated sequence records? I can't tell you why the RefSeq result was not listed among the search results. But I can at least tell you how to find it:
 
 
 
{{task|1=
 
 
# In the right-hand margin of the record, you will find a section of '''Identical proteins ...''': click on '''See all..."" to list them all. Among these, find the entry with an accession number like <code>NP_123456</code>. This is a RefSeq ID. Follow the link.
 
# Explore the resulting page. You will notice that the information elements are not identical, even though these are sequence records for one and the same yeast gene product, in two similar databases, at the same data provider!
 
# Note down the RefSeq ID, you will probably need it later on.
 
}}
 
 
 
All well and good, and the Mbp1 protein is going to accompany us throughout the term&mdash;but we were actually trying to find related proteins in YFO. Let's give that a try.
 
 
 
{{task|1=
 
 
# Again in the right hand margin, find the section on '''Related Information''' and follow the link to '''Related Sequences'''. There are many. More than 21,000 actually<ref>21,000 related, non-identical sequences! What a treasure trove of information, the successful results of millennia of experimentation by nature. Now, if we could only read and understand this information ...</ref>. Definitely more than you would like to browse through to find the sequences in YFO. Let's use a filter on these results.
 
# Click on the '''Advanced''' link to access the search history that brought you here. Since you have read the Entrez page, you should be able to understand clearly that you can type something like
 
#4 AND "Schizosaccharomyces pombe"[organism]
 
... or whatever your command-history number resp. YFO name suggests.
 
 
You should find a handful of genes, all of them in YFO. If you find none, or hundreds, you did something wrong. Ask on the mailing list and make sure to fix the problem.
 
}}
 
 
 
This is '''one''' way to find related sequences: by accessing precomputed results at the NCBI. We will however explore much more principled approaches in a later assignment. Let's leave the sequence searches for the moment, and explore other information on Yeast Mbp1 that may be useful for annotating the related sequences in YFO.
 
 
===PubMed===
 
 
 
Arguably one of the most important databases in the life sciences is PubMed and this is a good time to look at PubMed in a bit more detail than you might have done previously.
 
 
 
{{task|1=
 
 
# Return back to the '''MBP1''' RefSeq record. If you have already closed it, simply enter the RefSeq ID into the search field for a Protein database search and find it again.
 
#  Find the '''PubMed''' links under '''Related information''' in the right-hand margin and explore them. One will take you only to information related to the actual RefSeq record, the others find more broadly relatd information. PubMed(weighted) appears to give a pretty good overview of systems-biology type, cross-sectional and functional information. But neither of the searches finds '''all''' Mbp1 related literature.
 
# Again, enter the '''Advanced''' query page and study the keywords that apply to PubMed searches. These are actually quite important and useful to remember.  Make yourself familiar with the section on [http://www.ncbi.nlm.nih.gov/books/NBK3827/ '''Search field descriptions and tags'''] in the PubMed help document, (in particular <tt>[DP]</tt>, <tt>[AU]</tt>, <tt>[TI]</tt>, and <tt>[TA]</tt>), how you use the ''History'' to combine searches, and the use of <tt>AND</tt>, <tt>OR</tt>, <tt>NOT</tt> and brackets. Understand how you can restrict a search to ''reviews'' only, and what the link to '''Related citations...''' is useful for.
 
# Now find publications with Mbp1 '''in the title'''. In the result list, follow the links for the two Biochemistry papers by Taylor ''et al.'' (2000) and by Deleeuw ''et al.'' (2008). Download the PDFs, we will need them later.
 
 
}}
 
 
 
==Structure search==
 
  
 
+
<hr style="width:33%; text-align:right; margin-right:0; height:1px;border-width:0;background-color:#999999;">
The search options in the PDB structure database are as sophisticated as those at the NCBI. For now, we will try a simple keyword search to get us started.
 
 
 
 
 
{{task|
 
# Visit the RCSB PDB website at http://www.pdb.org/
 
# Briefly orient yourself regarding the database contents and its information offerings and services.
 
# Enter <code>Mbp1</code> into the search field.
 
# In your journal, note down the PDB IDs for the three ''Saccharomyces cerevisiae'' Mbp1 transcription factor structures your search has retrieved.
 
# Click on one of the entries and explore the information and services linked from that page.
 
}}
 
  
 
&nbsp;
 
&nbsp;
  
==Chimera==
+
If we were working on yeast, most data we need is right here: curated, kept current and consistent, referenced to the literature and ready to use. But you'll be working on a different species and we'll explore the much, much larger databases at the NCBI for this. The upside is that most of the information like this '''is available''' for your species. The downside is that we'll have to integrate information from many different sources "by hand".
  
In this task we will explore the sequence interface of Chimera, use it to select specific parts of a molecule, and colour specific regions (or residues) of a molecule separately.
+
      <!-- Column 2 end -->
 +
    </div>
 +
  </div>
 +
</div>
  
&nbsp;
 
{{task|
 
# Open Chimera.
 
# One of the three yeast Mbp1 fragment structures has the PDB ID <code>1BM8</code>. Load it in Chimera (simply enter the ID into the appropriate field of the '''File''' &rarr; '''Fetch by ID...''' window).
 
# Display the protein in '''Presets''' &rarr; '''Interactive&nbsp;1''' mode and familiarize yourself with its topology of helices and strands.
 
# Open the sequence tool: '''Tools''' &rarr; '''Sequence''' &rarr; '''Sequence'''. You will see the sequence for each chain - here there is only one chain. By default, coloured rectangles overlay the secondary structure elements of the sequence.
 
# Hover the mouse over some residues and note that the sequence number and chain is shown at the bottom of the window.
 
# Click/drag one residue to select it. <small>(Simply a click wont work, you need to drag a little bit for the selection to catch on.)</small> Note that the residue gets a green overlay in the sequence window, as it also gets selected with a green border in the graphics window.
 
# In the bottom of the sequence window, there are instructions how to select (multiple) regions. Try this: colour the protein white ('''Select''' &rarr; '''Select&nbsp;All'''; '''Actions''' &rarr; '''Color''' &rarr; '''light&nbsp;gray'''). Clear the selection. Now select all the helical regions (pale yellow boxes) by click/dragging and using the shift key. Color them red. Then select all the strands by clicking into any of the pale green boxes and color them green.
 
# Finally, generate a stereo-view that shows the molecule well, in which the domain is coloured dark grey, and the APSES domain residues (as defined in the FASTA listing above, from I19 to Y93) are coloured with a colour ramp ('''Tools''' &rarr; '''Depiction''' &rarr; '''Rainbow''')<ref>The [https://www.cgl.ucsf.edu/chimera/1.2065/docs/ContributedSoftware/rainbow/rainbow.html Rainbow tool] can only create color ramps for an entire molecule. In order to achieve this effect: color the molecule with a color ramp, then select the APSES domain, then '''invert the selection''' and color the new selection dark grey.</ref>
 
# Show the first and last residue's CA atom<ref>See [https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/midas/frameatom_spec.html '''here'''] for details of the specification syntax.</ref> as a sphere and colour the first one blue (to mark the N-terminus) and the last one red. E.g.:
 
##'''Select''' &rarr; '''Atom&nbsp;specifier''' &rarr; <code>:4@CA</code>
 
##'''Actions''' &rarr; '''Ribbon''' &rarr; '''hide'''
 
##'''Actions''' &rarr; '''Atoms/bonds''' &rarr; '''show'''
 
##'''Actions''' &rarr; '''Atoms/bonds''' &rarr; '''sphere'''
 
##'''Actions''' &rarr; '''Color''' &rarr; '''cornflower&nbsp;blue'''
 
##Then click on the selection inspector (the green button with the magnifying glass at the lower right of the graphics window) and set the sphere radius to 1.0Å.
 
# Save the image in your Wiki journal in JPEG format ('''File''' &rarr; '''Save&nbsp;Image''' and upload it to the Student Wiki).
 
}}
 
  
  
 
&nbsp;
 
&nbsp;
  
== Stereo vision ==
+
===NCBI===
  
{{task|
 
Continue with your stereo practice.
 
  
Practice at least ...
+
'''TBC'''
* two times daily,
 
* for 3-5 minutes each session.
 
  
* Measure your interocular distance and your fusion distance as explained '''[http://biochemistry.utoronto.ca/steipe/abc/students/index.php/Stereo_vision_data here on the Student Wiki]''' and add it to the table.
 
}}
 
  
Keep up your practice throughout the course. '''Once again: do not go through your practice sessions mechanically. If you are not making constant progress in your practice sessions, contact me so we can help you on the right track.'''
+
==Choosing YFO (Your Favourite Organism)==
  
== Modeling small molecules (optional) ==
 
  
 +
'''TBC'''
  
As an optional part of the assignment, here is a small tutorial for modeling and visualizing "small-molecule" structures.
 
  
 +
==Data modelling==
  
 +
'''TBC'''
  
=== Defining a molecule ===
 
  
 
+
&nbsp;
A number of public repositories make small molecule information available, such as [http://pubchem.ncbi.nlm.nih.gov/ PubChem] at the NCBI, the ligand collection at the [http://pdb.org '''PDB'''], the [http://www.ebi.ac.uk/chebi/ ChEBI] database at the European Bioinformatics Institute, or the [http://cactus.nci.nih.gov/ncidb2.2/ NCI database browser] at the US National Cancer Institute. One general way to export topology information from these services is to use {{WP|SMILES|SMILES strings}}&mdash;a shorthand notation for the composition and topology of chemical compounds.
 
 
 
 
 
{{task|
 
# Access each of the databases mentioned above.
 
# Enter "caffeine" as a search term.
 
# Explore the contents of the result, in particular note and copy the SMILES string for the compound.
 
}}
 
 
 
 
 
Alternatively, you can sketch your own compound. Versions of Peter Ertl's {{WP|JME_editor|Java Molecular Editor (JME)}} are offered on several websites (e.g. click on '''Transfer to Java Editor''' on a NCI results page), and PubChem offers this functionality via its '''Sketcher''' tool.
 
 
 
{{task|
 
# Navigate to [http://pubchem.ncbi.nlm.nih.gov/ PubChem].
 
# Follow the link to '''Chemical structure search''' (in the right hand menu).
 
# Click on the '''3D conformer''' tab and on the '''Launch''' button to launch the molecular editor in its own window.
 
# Sketch the structure of caffeine. I find the editor quite intuitive but if you need help, just use the '''Help''' button in the editor.
 
# Save the SMILES string of your compound.
 
# Also '''Export''' your result in SMILES format as a file.
 
}}
 
 
 
=== Translating SMILES to structure ===
 
 
 
 
 
Online services exist to translate SMILES to (idealized) coordinates.
 
 
 
{{task|
 
# Access the [http://cactus.nci.nih.gov/translate/ online SMILES translation service] at the NCI.
 
# Paste a caffeine SMILES string into the form, choose the '''PDB''' radio button, click on '''Translate''' and download your file.
 
# Load the molecule in Chimera.
 
}}
 
 
 
Chimera also has a function to translate SMILES to coordinates.
 
 
 
{{task|
 
# In Chimera:
 
##'''File''' &rarr; '''Close&nbsp;Session'''.
 
##'''Tools''' &rarr; '''Structure&nbsp;Editing''' &rarr; '''Build&nbsp;Structure'''.
 
##Select '''SMILES string''', paste the string and click '''Apply'''.
 
# The caffeine molecule will be generated and visualized in the graphics window.
 
}}
 
  
 
;That is all.
 
;That is all.

Revision as of 19:29, 25 September 2015

Assignment for Week 2
Scenario, Labnotes, R-functions, Databases, Data Modeling


Note! This assignment is currently active. All significant changes will be announced on the mailing list.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.



 

The Scenario

I have introduced the concept of "cargo cult science" in class. The "cargo" in Bioinformatics is to understand biology. This includes understanding how things came to be the way they are, and how they work. Both relate to the concept of function of biomolecules, and the systems they contribute to. But "function" is a rather poorly defined concept and exploring ways to make it rigorous and computable will be the major objective of this course. The realm of bioinformatics contains many kingdoms and duchies and shires and hidden glades. To find out how they contribute to the whole, we will proceed on a quest. We will take a relatively well-characterized protein that is part of a relatively well-characterized process, and ask what its function is. We will examine the protein's sequence, its structure, its domain composition, its relationship to and interactions with other proteins, and through that paint a picture of a "system" that it contributes to.

Our quest will revolve around a transcription factor that plays an important role in the regulation of the cell cycle: Mbp1 is a key component of the MBF complex (Mbp1/Swi6) in yeast. This complex regulates gene expression at the crucial G1/S-phase transition of the mitotic cell cycle and has been shown to bind to the regulatory regions of more than a hundred target genes. It is therefore a DNA binding protein that acts as a control switch for a key cellular process.

We will start our quest with information about the Mbp1 protein of Baker's yeast, Saccharomyces cerevisiae, one of the most important model organisms. Baker's yeast is a eukaryote that has been studied genetically and biochemically in great detail for many decades, and it is easily manipulated with high-throughput experimental methods. But each of you will use this information to study not Baker's yeast, but a related organism. You will explore the function of the Mbp1 protein in some other species from the kingdom of fungi, whose genome has been completely sequenced; thus our quest is also an exercise in model-organism reasoning: the transfer of knowledge from one, well-studied organism to others.

It's reasonable to hypothesize that such central control machinery is conserved in most if not all fungi. But we don't know. Many of the species that we will be working with have not been characterized in great detail, and some of them are new to our class this year. And while we know a fair bit about Mbp1, we probably don't know very much at all about the related genes in other organisms: whether they exist, whether they have similar functional features and whether they might contribute to the G1/S checkpoint system in a similar way. Thus we might discover things that are new and interesting. This is a quest of discovery.

Here are the steps of the assignment for this week:

  1. We'll need to explore what data is available for the Mbp1 protein.
  2. We'll need to pick a species to adopt for exploration.
  3. We'll need to define what data we want to store and design a datamodel.

However, before we head off into the Internet: have you thought about how to document such a "quest"? How will you keep notes? Obviously, computational research proceeds with the same best-practice principles as any wet-lab experiment. We have to keep notes, ensure our work is reproducible, and that our conclusions are supported by data. I think it's pretty obvious that paper notes are not very useful for bioinformatics work. Ideally, you should be able to save results, and link to files and Webpages.


 

Keeping Labnotes

Consider it a part of your assignment to document your activities in electronic form. Here are some applications you might think of - but (!) disclaimer, I myself don't use any of these (yet) (except the Wiki of course).

  • Evernote - a web hosted, automatically syncing e-notebook.
  • Nevernote - the Open Source alternative to Evernote.
  • Google Keep - if you have a Gmail account, you can simply log in here. Grid-based. Seems a bit awkward for longer notes. But of course you can also use Google Docs.
  • Microsoft OneNote - this sounds interesting and even though I have had my share of problems with Microsoft products, I'll probably give this a try. Syncing across platforms, being able to format contents and organize it sounds great.
  • The Student Wiki - of course. You can keep your course notes with your User pages.

Are you aware of any other solutions? Let us know!

Keeping such a journal will be helpful, because the assignments are integrated over the entire term, and later assignments will make use of earlier results. But it is also excellent practice for "real" research. Expand the section below for details - written from a Wiki perspective but generally applicable.


 

Remember you are writing a lab notebook—not a formal lab report: a point-form record of your actual activities. Write such documentation as notes to your (future) self.


Create a lab-notes page as a subpage of your User space on the Student Wiki.


For each task:

  • Write a header and give it a unique number.
This is useful so you can refer to the header number in later text. Obviously, you should "hard-code" the number and not use the Wiki's automatic section numbering scheme, since the numbers should be stable over time, not change when you add or delete a section. It may be useful to add new contents at the top, so you don't have to scroll to the bottom of the page evry time you add new material. This does not have to be in strict chronological order, like we would have it in a paper notebook. It may be advantageous to give different subprojects their own page, or at least order them on one page. Just remember that things that are on the same page are easy to find.
  • State the objective.
In one brief sentence, restate what your task is supposed to achieve.
  • Document the procedure.
Note what you have done, as concisely as possible but with sufficient detail. I am often asked: "What is sufficient detail"? The answer is easy: detailed enough so that someone can reproduce what you have done. In practice that guy will often be you, yourself, in the future. I hope that you won't be constantly cursing your past-self because of omissions!
  • Document your results.
You can distinguish different types of results -
    • Static data does not change over time and it may be sufficient to note a reference to the result. For example, there is no need to copy a GenBank record into your documentation, it is sufficient to note the accession number or the GI number, or better, to link to it.
    • Variable data can change over time. For example the results of a BLAST search depend on the sequences in the database. A list of similar structures may change as new structures get solved. In principle you want to record such data, to be able to reproduce at a later time what your conclusions were based on. But be selective in what you record. For example you should not paste the entire set of results of a BLAST search into your document, but only those matches that were important for your conclusions. Indiscriminate pasting of irrelevant information will make your notes unusable.
    • Analysis results
The results of sequence analyses, alignments etc. in general get recorded in your documentation. Again: be selective. Record what is important.
  • Note your conclusions.
An analysis is not complete unless you conclude something from the results. (Remember what we said about "Cargo Cult Science". If there is no conclusion, your activities are quite pointless.) Are two sequences likely homologues, or not? Does your protein contain a signal-sequence or does it not? Is a binding site conserved, or not? The analysis provides the data. In your conclusion you provide the interpretation of what the data means in the context of your objective. Were you expecting a signal-sequence but there isn't one? What could that mean? Sometimes your assignment task in this course will ask you to elaborate on an analysis and conclusion. But this does not mean that when I don't explicitly mention it, you can skip the interpretation.
  • Add cross-references.
Cross-reference to other information are super valuable as your documentation grows. It's easy to see how to format a link to a section of your Wiki-page: just look at the link under the Table of Contents at the top. But you can also place "anchors" for linking anywhere on an HTML page: just use the following syntax. <span id="{some-label}"><\span> for the anchor, and append #{some-label} to the page URL. Try this here: (http://steipe.biochemistry.utoronto.ca/abc/Assignment_2#tf) .
  • Use discretion when uploading images
I have enabled image uploading with some reservations, we'll see how it goes. You must not:
  • upload images that are irrelevant for this course;
  • upload copyrighted images;
  • upload any images that are larger than 500 kb. I may silently remove large images when I encounter them.
Moreover, understand that any of your uploaded images may be deleted at any time. If they are valuable to you, keep backups on your own machine.
  • Prepare your images well
Don't upload uncompressed screen dumps. Save images in a compressed file format on your own computer. Then use the Special:Upload link in the left-hand menu to upload images. The Wiki will only accept .jpeg or .png images.
  • Use the correct image types.
In principle, images can be stored uncompressed as .tiff or .bmp, or compressed as .gif or .jpg or .png. .gif is useful for images with large, monochrome areas and sharp, high-contrast edges because the LZW compression algorithm it uses works especially well on such data; .jpg (or .jpeg) is preferred for images with shades and halftones such as the structure views you should prepare for several assignments, JPEG has excellent application support and is the most versatile general purpose image file format currently in use; .tiff (or .tif) is preferred to archive master copies of images in a lossless fashion, use LZW compression for TIFF files if your system/application supports it; The .png format is an open source alternative for lossless, compressed images. Application support is growing but still variable. .bmp is not preferred for really anything, it is bloated in its (default) uncompressed form and primarily used only because it is simple to code and ubiquitous on Windows computers.
Image dimensions and resolution
Stereo images should have equivalent points approximately 6cm apart. It depends on your monitor how many pixels this corresponds to. The dimensions of an image are stated in pixels (width x height). My notebook screen has a native display resolution of 1440 x 900 pixels/23.5 x 21 cm. Therefore a 6cm separation on my notebook corresponds to approximately 260 pixels. However on my desktop monitor, 260 pixels is 6.7 cm across. And on a high-resolution iPad display, at 227 ppi (pixels per inch), 260 pixels are just 2.9 cm across. For the assignments: adjust your stereo images so they are approximately at the right separation and are approximately 500 to 600 pixels wide. Also, scale your molecules so they fill the available window and - if you have depth cueing enabled - move them close to the front clipping plane so the molecule is are not just a dim blob, lost in murky shadows.
Considerations for print (manuscripts etc.) are slightly different: for print output you can specify the output resolution in dpi (dots per inch). A typical print resolution is about 300 dpi: 6 cm separation at 300dpi is about 700 pixels. Print images should therefore be about three times as large in width and height as screen images.
Preparation of stereo views
When assignments ask you to create molecular images, always create stereo views.
Keep your images uncluttered and expressive
Scale the molecular model to fill the available space of your image well. Orient views so they illustrate a point you are trying to make. Emphasize residues that you are writing about with a contrasting colouring scheme. Add labels, where residue identities are not otherwise obvious. Turn off side-chains for residues that are not important. The more you practice these small details, the more efficient you will become in the use of your tools.
If you have technical difficulties, post your questions to the list and/or contact me.



 

Data Sources

SGD - a Yeast Model Organism Database

Yeast happens to have a very well maintained model organism database - a Web resource dedicated to Saccharomyces cerevisiae. Where such resources are available, they are very useful for the community. For the general case however, we need to work with one of the large, general data providers - the NCBI and the EBI. But in order to get a sense of the type of data that is available, let's visit the SGD database first.

Task:
Access the information page on Mbp1 at the Saccharomyces Genome Database.

  1. Browse through the Summary page and note the available information: you should see:
    • information about the gene and the protein;
    • Information about it's roles in the cell curated at the Gene Ontology database;
    • Information about knock-out phenotypes; (Amazing. Would you have imagined that this is a non-essential gene?)
    • Information about protein-protein interactions;
    • Regulation and expression;
    • A curators' summary of our understanding of the protein. Mandatory reading.
    • And key references.
  2. Access the Protein tab and note the much more detailed information.
    • Domains and their classification;
    • Sequence;
    • Shared domains;
    • and much more...

You will notice that some of this information relates to the molecule itself, and some of it relates to its relationship with other molecules. Some of it is stored at SGD, and some of it is cross-referenced from other databases. And we have textual data, numeric data, and images.

How would you store such data to use it in your project? We will work on this question at the end of the assignment.

 


 

If we were working on yeast, most data we need is right here: curated, kept current and consistent, referenced to the literature and ready to use. But you'll be working on a different species and we'll explore the much, much larger databases at the NCBI for this. The upside is that most of the information like this is available for your species. The downside is that we'll have to integrate information from many different sources "by hand".


 

NCBI

TBC


Choosing YFO (Your Favourite Organism)

TBC


Data modelling

TBC


 

That is all.


 

Links and resources

 


Footnotes and references


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.