Difference between revisions of "BIO Assignment Week 2"
Line 25: | Line 25: | ||
* Execute the following R-code. | * Execute the following R-code. | ||
+ | |||
<source lang="rsplus"> | <source lang="rsplus"> | ||
pickSpecies <- function(ID) { | pickSpecies <- function(ID) { | ||
Line 84: | Line 85: | ||
</source> | </source> | ||
− | * Execute the function <code>pickSpecies</code> with your student ID as its parameter. Example: | + | * Execute the function <code>pickSpecies()</code> with your student ID as its parameter. Example: |
− | + | ||
+ | <source lang="text"> | ||
+ | > pickSpecies(991234567) | ||
[ 1] "Candida glabrata (CANGL)" | [ 1] "Candida glabrata (CANGL)" | ||
− | + | </source> | |
− | * Note down the species name and its five letter abbreviation. Use this species whenever this or future assignments refer to ''YFO''. | + | * Note down the species name and its five letter abbreviation. Use this species whenever this or future assignments refer to '''YFO'''. |
}} | }} | ||
Revision as of 14:07, 20 September 2012
Assignment for Week 2
Scenario, Databases, Search and Retrieve
Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.
Contents
The Scenario
In this assignment you will install the molecular graphics viewer VMD on your own computer, work through a tutorial on its use and begin practicing the skill of viewing split-screen stereographic scenes without aids. You will also install the statistics workbench R, and work through selected parts of an introductory tutorial.
Choosing YFO (Your Favourite Organism)
Task:
- Execute the following R-code.
pickSpecies <- function(ID) {
# this function randomly picks a fungal species
# from a list. It is seeded by a student ID. Therefore
# the pick is random, but reproducible.
# first, define a list of species:
Species <- c(
"Ajellomyces dermatitidis (AJEDE)",
"Arthroderma gypseum (ARTGY)",
"Ashbya gossypii (ASHGO)",
"Aspergillus clavatus (ASPCL)",
"Aspergillus flavus (ASPFL)",
"Botryotinia fuckeliana (BOTFU)",
"Candida glabrata (CANGL)",
"Chaetomium globosum (CHAGL)",
"Clavispora lusitaniae (CLALU)",
"Coccidioides immitis (COCIM)",
"Coprinopsis cinerea (COPCI)",
"Debaryomyces hansenii (DEBHA)",
"Gibberella zeae (GIBZE)",
"Kluyveromyces lactis (KLULA)",
"Komagataella pastoris (KOMPA)",
"Laccaria bicolor (LACBI)",
"Lachancea thermotolerans (LACTH)",
"Lodderomyces elongisporus (LODEL)",
"Magnaporthe oryzae (MAGOR)",
"Malassezia globosa (MALGL)",
"Meyerozyma guilliermondii (MEYGU)",
"Nectria haematococca (NECHA)",
"Neosartorya fischeri (NEOFI)",
"Paracoccidioides brasiliensis (PARBR)",
"Penicillium chrysogenum (PENCH)",
"Puccinia graminis (PUCGR)",
"Pyrenophora teres (PYRTE)",
"Scheffersomyces stipitis (SCHST)",
"Schizophyllum commune (SCHCO)",
"Phaeospheria nodorum (PHANO)",
"Schizosaccharomyces japonicus (SCHJA)",
"Sclerotinia sclerotiorum (SCLSC)",
"Talaromyces stipitatus (TALST)",
"Trichophyton rubrum (TRIRU)",
"Uncinocarpus reesii (UNCRE)",
"Vanderwaltozyma polyspora (VANPO)",
"Verticillium albo-atrum (VERAL)",
"Yarrowia lipolytica (YARLI)",
"Zygosaccharomyces rouxii (ZYGRO)"
)
l <- length(Species) # number of elements in the list
set.seed(ID) # seed the random number generator
# with the student ID
i <- runif(1, 0, 1) # pick one random number between 0 and 1
i <- l * i # multiply with number of elements
i <- ceiling(i) # round up to nearest integer
choice <- Species[i] # pick the i'th element from list
return(choice)
}
- Execute the function
pickSpecies()
with your student ID as its parameter. Example:
> pickSpecies(991234567)
[ 1] "Candida glabrata (CANGL)"
- Note down the species name and its five letter abbreviation. Use this species whenever this or future assignments refer to YFO.
Keeping a notebook
Documentation: Lab-Notebook Style
As one part of your assignment, you should submit documentation of your activities. Do this like you would write a lab notebook. This is not intended to be a formal lab report, but a point-form record of your actual activities. Write such documentation as notes to yourself.
For each task:
- Write a header.
- Please use the same header number and text of the assignment and do not change the sequence of tasks given in the assignment. Keep distinct tasks in separate paragraphs.
- State the objective.
- In one brief sentence, restate what this task is to achieve.
- Document the procedure.
- Note what you have done, as concisely as possible. Give enough information so that anyone could reproduce unambiguously what you have done.
- Document your results.
- You can distinguish different types of results -
- Static data does not change over time and it may be sufficient to note a reference to the result. For example, there is no need to copy a genbank record into your documentation, it is sufficient to note the accession number or the GI number.
- Variable data can change over time. For example the results of a BLAST search depend on the sequences in the database. A list of similar structures may change as new structures get solved. In principle you want to record such data, to be able to reproduce at a later time what your conclusions were based on. But be selective in what you record. For example you should not paste the entire set of results of a BLAST search into your asignment, but only those matches that were important for your conclusions. Indiscriminate pasting of irrelevant information may cause deduction of marks.
- Analysis results
- The results of sequence analyses, alignments etc. in general get recorded in your documentation. Again: be selective. Record what is important.
- Note your conclusions.
- An analysis is not complete unless you conclude something from the results. Are two sequences likely homologues, or not? Does your protein contain a signal-sequence or does it not? Is a binding site conserved, or not? The analysis gives you the data, in your conclusion you provide the interpretation of what the data means in the context of your objective. Sometimes we will ask you to elaborate on an analysis and conclusion. But this does not mean that when we do not ask, you don't need to interpret your data.
Preparation of images
- Don't paste uncompressed screendumps into your assignment. Save images separately in a compressed file format. Then use the Insert → Picture → From File ... function of MSWord to insert the image into your file.
- Image types.
- In principle, images can be stored uncompressed as
.tiff
or.bmp
, or compressed as.gif
or.jpg
or.png
..gif
is useful for images with large, monochrome areas and sharp, high-contrast edges due to the LZW compression algorithm it uses;.jpg
or.jpeg
is preferred for images with shades and halftones such as the structure views required in the course assignments, it has excellent application support and is the most versatile general purpose image file format currently in use;.tiff
or.tif
is preferred to archive master copies of images in a lossless fashion, use LZW compression for TIFF files if your system/application supports it; The.png
format is an open source alternative for lossless, compressed images. Application support is growing but still variable..bmp
is not preferred for really anything, it is bloated in its (default) uncompressed form and primarily used only because it is simple to code.
- Image dimensions and resolution
- Stereo images should have equivalent points approximately 6cm apart. It depends on your monitor how many pixels this corresponds to. The dimensions of an image are stated in pixels (width x height). My notebook screen has a native display resolution of 1440 x 900 pixels/23.5 x 21 cm. Therefore a 6cm separation on my notebook corresponds to ~260 pixels. However on my desktop monitor, 260 pixels is 6.7 cm across. For the assignments: adjust your stereo images so they are approximately at the the right separation and approximately 500 to 600 pixels across. Also, scale your molecules so they fill the available window.
- Considerations for print (manuscripts etc.) are slightly different: for print output you can specify the output resolution in dpi (dots per inch). A typical print resolution is about 300 dpi: 6 cm separation at 300dpi is about 700 pixels. Print images should therefore be about three times as large in width and height as screen images.
- Keep the overall size of your submission below 1.5 MB.
- We will deduct marks for larger submissions, or we may reject the submission outright.
- Preparation of stereo views
- When molecular images are required, always submit stereo views, even if this was not explicitly required in the text of the assignment. All required stereo views are to be presented as divergent (parallel, side-by-side) stereo frames (left eye's view in the left frame), even if you use cross-eyed views for yourself (three-panel views are acceptable).
- Keep your images uncluttered and expressive
- Turn off the axes if they are not needed and scale the molecular model to fill the available space of your image well. Orient views so they illustrate a point you are trying to make. Emphasize residues that you are writing about with a contrasting coloring scheme. Add labels, where residue identities are not otherwise obvious. Turn off side-chains for residues that are not important.
- If you have technical difficulties, post your questions to the list and/or contact me.
Sequence search
Key databases
Entrez and the NCBI (1 mark)
The NCBI administers some of the world's most important databases, such as GenBank. In this section you should
- Explore the NCBI Web site, familiarize yourself with its key databases and explore the resources to become confident that you will find information that you are looking for.
- Follow a protein's annotations into PubMed and familiarize yourself with PubMed's query syntax.
- Explore the Entrez search page, and learn how to limit queries and restrict searches
- Access the NCBI website at http://www.ncbi.nlm.nih.gov/ Look for the site-map and browse the contents of this large site; find which databases and services are hosted here. Expect to spend at least half an hour to familiarize yourself with the site.
- Access the Map viewer (under the Genomes section of the Databases division). Click on the link under Saccharomyces cerevisiae (Build 2.1) for a whole genome view, then click on the icon for chromosome IV for a more detailed view. Enter the region between 340,000 and 380,000 in the "Region shown" fields on the left. How many genes does this region contain? How many of these are protein genes?
- Click on MBP1 to follow the link to its Entrez Gene page. Study the contents of the page. If you are not clear what the sections show you, click on one of the question marks. If you are still not clear, ask on our mailing list.
- Follow the link to PubMed for this gene. You should find (at least) 27 publications. Click on the History tab to find the index of the query that got you here (eg. "#4"). Now search for those papers in your query that were published in 2008: enter #4 AND 2008[DP] into the search field and click "Go". Make yourself familiar with the Search field descriptions and tags (in particular [DP], [AU], [TI], and [TA]), how you use the History to combine searches, and the use of AND, OR, NOT and brackets.
- Back at the MapViewer pager, click on pr in the same row as the MBP1 gene to find a list of GenPept (protein) records for this gene. Follow the link to the RefSeq record for this protein: NP_010227. This is a flat-file record for the Mbp1 gene. Study the fields and the format. Then use the "Display" option in the header to show this protein sequence in a FASTA format, choose "send to ... Text" to get only the FASTA format. Make sure you understand the difference between GenBank/GenPept and RefSeq, between GI number, accession and locus (refer to the lecture slides as soon as they are posted).
- Click on MBP1 to follow the link to its Entrez Gene page. Study the contents of the page. If you are not clear what the sections show you, click on one of the question marks. If you are still not clear, ask on our mailing list.
- In the header bar of the MapViewer, click on the link to Entrez. Enter mbp1 into the search field of the Entrez page and click "GO".
- Increase the relevance of returned items by restricting your search to a particular organism. Access and read the Help pages for Entrez and make sure you understand how to use limits and how to search in search field indexes. You will already have encountered similar concepts when you visited PubMed.
- Enter: mbp1 AND "saccharomyces cerevisiae"[organism] into the Entrez search field and click "GO". Click on the CoreNucleotide link of the results.
- The RefSeq record listed in the results contains the entire yeast chromosome IV (1.5 Mbp) which you probably don't want to explore unless you actually want to. The result is correct, since mbp1 is one of the 787 genes annotated on that chromosome, but perhaps not what we had in mind when we queried for a nucleotide sequence of the mbp1 gene. Check the results for a different record that contains only the mbp1 gene's (full-length) nucleotide sequence. There are (as of this writing) two such records. Explore either one of the two, these are nucleotide sequences in the GenBank flat file format.
- Document your activities in point form.
The EBI (1 mark)
In many ways the European EBI is complementary to the US NCBI. A data-sharing agreement for instance guarantees that the contents of the EMBL Nucleotide Sequence Database, GenBank and the Japanese DDBJ are synchronized on a daily basis. But there are of course also unique and uniquely valuable resources at the EBI. In this part of the assignment
- you should explore the EBI Web site, familiarize yourself with its contents and services and explore the resources to become confident you will find information that you are looking for.
- You should read the 2can tutorial on database browsing and the UniProt knowledgebase.
- You should compared a UniProt record with the corresponding GenPept record and use the ensembl browser to access a gene report.
- Enter the EBI Website at http://www.ebi.ac.uk/ Look for the site-map and explore the contents of this site, the databases, the services and its other offerings. Spend some time getting an idea of what is being offered here.
- Visit the 2can education support portal at http://www.ebi.ac.uk/2can/home.html . Explore its offerings, in particular, follow the links Bioinformatics tutorials → Database browsing and read the section on the different interface systems. You have encountered Entrez previously, now find out more about SRS, BioMart and UniProt Search.
- To learn more about the UniProt database: access the UniProt user manual at http://ca.expasy.org/sprot/userman.html and read through sections 1 and 2 of the manual.
- Contrast the contents of a Uniprot record with a GenPept record: for example MBP1_YEAST and NP_010227.
- Follow the link to Ensembl, click on saccharomyces cerevisiae and then on chromosome IV. Access the regions from basepair 340000 to 380000; contrast the display with the NCBI MapViewer. Identify the Mbp1 gene and click on it to retrieve its Gene report (under the systematic name: YDL056W). Find your way from this Gene report to the expressed protein sequence and list the steps you have gone through.
Sequence retrieval
Cross-reference
Structure search
Visit the RCSB PDB website at http://www.pdb.org/ , explore the database and familiarize yourself with its contents.
- Look for the "Getting started" page and explore the page.
- Explore the links on the "Education" page to see where you might fill in gaps in your knowledege of structural molecular biology, such as the Biological Units tutorial; read up on one or two the excellent molecule of the month articles, such as the TATA binding protein (July 2005).
- From the homepage, search for the yeast Mbp1 protein (by keyword) and explore the information that is available in one of the entries that was retrieved.
Structure retrieval
Visualize in VMD
VMD
Task:
- Access the VMD page.
- Install the program as per the instructions in the section: "Installing VMD".
- In the tutorial section work through
- Part 1 (Introduction), and
- Part 2 (Working with a single molecule).
Stereo vision (1 mark):=
Task:
Access the Stereo Vision tutorial and practice viewing molecular structures in stereo.
Practice at least ...
- two times daily,
- for 3-5 minutes each session,
Keep up your practice throughout the course. Stereo viewing will be required in the final exam, but more importantly, it is a wonderful skill that will greatly support any activity of yours related to structural molecular biology. Practice with different molecules and try out different colours and renderings.
Note: do not go through your practice sessions mechanically. If you are not making any progress with stereo vision, contact me so we can help you on the right track.
R
The R statistics environment and programming language is an exceptionally well engineered, free (as in free speech) and free (as in free beer) platform for data manipulation and analysis. The number of functions that are included by default is large, there is a very large number of additional, community-generated analysis modules that can be simply imported from dedicated sites (e.g. the Bioconductor project for molecular biology data), or via the CRAN network, and whatever function is not available can be easily programmed. The ability to filter and manipulate data to prepare it for analysis is an absolute requirement in research-centric fields such as ours, where the strategies for analysis are constantly shifting and prepackaged solutions become obsolete almost faster than they can be developed. Besides numerical analysis, R has very powerful and flexible functions for plotting graphical output.
R is not a main focus of the course, but an important tool I would like you to pick up "on the side".
Task:
- Access the R tutorial on this site.
- Work through the sections Installation, User interface, and Packages.