BIO Assignment Week 2

Assignment for Week 2
Scenario, Labnotes, R-functions, Databases, Data Modeling

Note! This assignment is currently active. All significant changes will be announced on the mailing list.

Parts labelled as "TBC" are in progress and will be made available as they are being completed.

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.

The Scenario

I have introduced the concept of "cargo cult science" in class. The "cargo" in Bioinformatics is to understand biology. This includes understanding how things came to be the way they are, and how they work. Both relate to the concept of function of biomolecules, and the systems they contribute to. But "function" is a rather poorly defined concept and exploring ways to make it rigorous and computable will be the major objective of this course. The realm of bioinformatics contains many kingdoms and duchies and shires and hidden glades. To find out how they contribute to the whole, we will proceed on a quest. We will take a relatively well-characterized protein that is part of a relatively well-characterized process, and ask what its function is. We will examine the protein's sequence, its structure, its domain composition, its relationship to and interactions with other proteins, and through that paint a picture of a "system" that it contributes to.

Our quest will revolve around a transcription factor that plays an important role in the regulation of the cell cycle: Mbp1 is a key component of the MBF complex (Mbp1/Swi6) in yeast. This complex regulates gene expression at the crucial G1/S-phase transition of the mitotic cell cycle and has been shown to bind to the regulatory regions of more than a hundred target genes. It is therefore a DNA binding protein that acts as a control switch for a key cellular process.

We will start our quest with information about the Mbp1 protein of Baker's yeast, Saccharomyces cerevisiae, one of the most important model organisms. Baker's yeast is a eukaryote that has been studied genetically and biochemically in great detail for many decades, and it is easily manipulated with high-throughput experimental methods. But each of you will use this information to study not Baker's yeast, but a related organism. You will explore the function of the Mbp1 protein in some other species from the kingdom of fungi, whose genome has been completely sequenced; thus our quest is also an exercise in model-organism reasoning: the transfer of knowledge from one, well-studied organism to others.

It's reasonable to hypothesize that such central control machinery is conserved in most if not all fungi. But we don't know. Many of the species that we will be working with have not been characterized in great detail, and some of them are new to our class this year. And while we know a fair bit about Mbp1, we probably don't know very much at all about the related genes in other organisms: whether they exist, whether they have similar functional features and whether they might contribute to the G1/S checkpoint system in a similar way. Thus we might discover things that are new and interesting. This is a quest of discovery.

Here are the steps of the assignment for this week:

We'll need to explore what data is available for the Mbp1 protein.
We'll need to pick a species to adopt for exploration.
We'll need to define what data we want to store and design a datamodel.

However, before we head off into the Internet: have you thought about how to document such a "quest"? How will you keep notes? Obviously, computational research proceeds with the same best-practice principles as any wet-lab experiment. We have to keep notes, ensure our work is reproducible, and that our conclusions are supported by data. I think it's pretty obvious that paper notes are not very useful for bioinformatics work. Ideally, you should be able to save results, and link to files and Webpages.

Keeping Labnotes

Consider it a part of your assignment to document your activities in electronic form. Here are some applications you might think of - but (!) disclaimer, I myself don't use any of these (yet) (except the Wiki of course).

Evernote - a web hosted, automatically syncing e-notebook.
Nevernote - the Open Source alternative to Evernote.
Google Keep - if you have a Gmail account, you can simply log in here. Grid-based. Seems a bit awkward for longer notes. But of course you can also use Google Docs.
Microsoft OneNote - this sounds interesting and even though I have had my share of problems with Microsoft products, I'll probably give this a try. Syncing across platforms, being able to format contents and organize it sounds great.
The Student Wiki - of course. You can keep your course notes with your User pages.

Are you aware of any other solutions? Let us know!

Keeping such a journal will be helpful, because the assignments are integrated over the entire term, and later assignments will make use of earlier results. But it is also excellent practice for "real" research. Expand the section below for details - written from a Wiki perspective but generally applicable.

Expand for details

Remember you are writing a lab notebook—not a formal lab report: a point-form record of your actual activities. Write such documentation as notes to your (future) self.

Create a lab-notes page as a subpage of your User space on the Student Wiki.

For each task:

Write a header and give it a unique number.

This is useful so you can refer to the header number in later text. Obviously, you should "hard-code" the number and not use the Wiki's automatic section numbering scheme, since the numbers should be stable over time, not change when you add or delete a section. It may be useful to add new contents at the top, so you don't have to scroll to the bottom of the page evry time you add new material. This does not have to be in strict chronological order, like we would have it in a paper notebook. It may be advantageous to give different subprojects their own page, or at least order them on one page. Just remember that things that are on the same page are easy to find.

State the objective.

In one brief sentence, restate what your task is supposed to achieve.

Document the procedure.

Note what you have done, as concisely as possible but with sufficient detail. I am often asked: "What is sufficient detail"? The answer is easy: detailed enough so that someone can reproduce what you have done. In practice that guy will often be you, yourself, in the future. I hope that you won't be constantly cursing your past-self because of omissions!

Document your results.

You can distinguish different types of results -

- Static data does not change over time and it may be sufficient to note a reference to the result. For example, there is no need to copy a GenBank record into your documentation, it is sufficient to note the accession number or the GI number, or better, to link to it.
- Variable data can change over time. For example the results of a BLAST search depend on the sequences in the database. A list of similar structures may change as new structures get solved. In principle you want to record such data, to be able to reproduce at a later time what your conclusions were based on. But be selective in what you record. For example you should not paste the entire set of results of a BLAST search into your document, but only those matches that were important for your conclusions. Indiscriminate pasting of irrelevant information will make your notes unusable.
- Analysis results

The results of sequence analyses, alignments etc. in general get recorded in your documentation. Again: be selective. Record what is important.

Note your conclusions.

An analysis is not complete unless you conclude something from the results. (Remember what we said about "Cargo Cult Science". If there is no conclusion, your activities are quite pointless.) Are two sequences likely homologues, or not? Does your protein contain a signal-sequence or does it not? Is a binding site conserved, or not? The analysis provides the data. In your conclusion you provide the interpretation of what the data means in the context of your objective. Were you expecting a signal-sequence but there isn't one? What could that mean? Sometimes your assignment task in this course will ask you to elaborate on an analysis and conclusion. But this does not mean that when I don't explicitly mention it, you can skip the interpretation.

Add cross-references.

Cross-reference to other information are super valuable as your documentation grows. It's easy to see how to format a link to a section of your Wiki-page: just look at the link under the Table of Contents at the top. But you can also place "anchors" for linking anywhere on an HTML page: just use the following syntax. <span id="{some-label}"><\span> for the anchor, and append #{some-label} to the page URL. Try this here: (http://steipe.biochemistry.utoronto.ca/abc/Assignment_2#tf) .

Use discretion when uploading images

I have enabled image uploading with some reservations, we'll see how it goes. You must not:

upload images that are irrelevant for this course;
upload copyrighted images;
upload any images that are larger than 500 kb. I may silently remove large images when I encounter them.

Moreover, understand that any of your uploaded images may be deleted at any time. If they are valuable to you, keep backups on your own machine.

Prepare your images well

Don't upload uncompressed screen dumps. Save images in a compressed file format on your own computer. Then use the Special:Upload link in the left-hand menu to upload images. The Wiki will only accept .jpeg or .png images.

Use the correct image types.

In principle, images can be stored uncompressed as .tiff or .bmp, or compressed as .gif or .jpg or .png. .gif is useful for images with large, monochrome areas and sharp, high-contrast edges because the LZW compression algorithm it uses works especially well on such data; .jpg (or .jpeg) is preferred for images with shades and halftones such as the structure views you should prepare for several assignments, JPEG has excellent application support and is the most versatile general purpose image file format currently in use; .tiff (or .tif) is preferred to archive master copies of images in a lossless fashion, use LZW compression for TIFF files if your system/application supports it; The .png format is an open source alternative for lossless, compressed images. Application support is growing but still variable. .bmp is not preferred for really anything, it is bloated in its (default) uncompressed form and primarily used only because it is simple to code and ubiquitous on Windows computers.

Image dimensions and resolution: Stereo images should have equivalent points approximately 6cm apart. It depends on your monitor how many pixels this corresponds to. The dimensions of an image are stated in pixels (width x height). My notebook screen has a native display resolution of 1440 x 900 pixels/23.5 x 21 cm. Therefore a 6cm separation on my notebook corresponds to approximately 260 pixels. However on my desktop monitor, 260 pixels is 6.7 cm across. And on a high-resolution iPad display, at 227 ppi (pixels per inch), 260 pixels are just 2.9 cm across. For the assignments: adjust your stereo images so they are approximately at the right separation and are approximately 500 to 600 pixels wide. Also, scale your molecules so they fill the available window and - if you have depth cueing enabled - move them close to the front clipping plane so the molecule is are not just a dim blob, lost in murky shadows.

Considerations for print (manuscripts etc.) are slightly different: for print output you can specify the output resolution in dpi (dots per inch). A typical print resolution is about 300 dpi: 6 cm separation at 300dpi is about 700 pixels. Print images should therefore be about three times as large in width and height as screen images.

Preparation of stereo views: When assignments ask you to create molecular images, always create stereo views.

Keep your images uncluttered and expressive: Scale the molecular model to fill the available space of your image well. Orient views so they illustrate a point you are trying to make. Emphasize residues that you are writing about with a contrasting colouring scheme. Add labels, where residue identities are not otherwise obvious. Turn off side-chains for residues that are not important. The more you practice these small details, the more efficient you will become in the use of your tools.

If you have technical difficulties, post your questions to the list and/or contact me.

Data Sources

SGD - a Yeast Model Organism Database

Yeast happens to have a very well maintained model organism database - a Web resource dedicated to Saccharomyces cerevisiae. Where such resources are available, they are very useful for the community. For the general case however, we need to work with one of the large, general data providers - the NCBI and the EBI. But in order to get a sense of the type of data that is available, let's visit the SGD database first.

Task:
Access the information page on Mbp1 at the Saccharomyces Genome Database.

Browse through the Summary page and note the available information: you should see:
- information about the gene and the protein;
- Information about it's roles in the cell curated at the Gene Ontology database;
- Information about knock-out phenotypes; (Amazing. Would you have imagined that this is a non-essential gene?)
- Information about protein-protein interactions;
- Regulation and expression;
- A curators' summary of our understanding of the protein. Mandatory reading.
- And key references.
Access the Protein tab and note the much more detailed information.
- Domains and their classification;
- Sequence;
- Shared domains;
- and much more...

You will notice that some of this information relates to the molecule itself, and some of it relates to its relationship with other molecules. Some of it is stored at SGD, and some of it is cross-referenced from other databases. And we have textual data, numeric data, and images.

How would you store such data to use it in your project? We will work on this question at the end of the assignment.

If we were working on yeast, most data we need is right here: curated, kept current and consistent, referenced to the literature and ready to use. But you'll be working on a different species and we'll explore the much, much larger databases at the NCBI for this. The upside is that most of the information like this is available for your species. The downside is that we'll have to integrate information from many different sources "by hand".

NCBI

TBC

Choosing YFO (Your Favourite Organism)

TBC

Data modelling

TBC

That is all.

Links and resources

Footnotes and references

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.

Do consider how to ask your questions so that a meaningful answer is possible:
- How to create a Minimal, Complete, and Verifiable example on stackoverflow and ...
- How to make a great R reproducible example are required reading.

BIO Assignment Week 2

Contents

The Scenario

Keeping Labnotes

Data Sources

SGD - a Yeast Model Organism Database

NCBI

Choosing YFO (Your Favourite Organism)

Data modelling

Links and resources

Footnotes and references

Ask, if things don't work for you!

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools