BIO Assignment Week 2
Assignment for Week 2
Scenario, Labnotes, R-functions, Databases, Data Modeling
Note! This assignment is currently active. All significant changes will be announced on the mailing list.
- Parts labelled as "TBC" are in progress and will be made available as they are being completed.
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.
Contents
The Scenario
I have introduced the concept of "cargo cult science" in class. The "cargo" in Bioinformatics is to understand biology. This includes understanding how things came to be the way they are, and how they work. Both relate to the concept of function of biomolecules, and the systems they contribute to. But "function" is a rather poorly defined concept and exploring ways to make it rigorous and computable will be the major objective of this course. The realm of bioinformatics contains many kingdoms and duchies and shires and hidden glades. To find out how they contribute to the whole, we will proceed on a quest. We will take a relatively well-characterized protein that is part of a relatively well-characterized process, and ask what its function is. We will examine the protein's sequence, its structure, its domain composition, its relationship to and interactions with other proteins, and through that paint a picture of a "system" that it contributes to.
Our quest will revolve around a transcription factor that plays an important role in the regulation of the cell cycle: Mbp1 is a key component of the MBF complex (Mbp1/Swi6) in yeast. This complex regulates gene expression at the crucial G1/S-phase transition of the mitotic cell cycle and has been shown to bind to the regulatory regions of more than a hundred target genes. It is therefore a DNA binding protein that acts as a control switch for a key cellular process.
We will start our quest with information about the Mbp1 protein of Baker's yeast, Saccharomyces cerevisiae, one of the most important model organisms. Baker's yeast is a eukaryote that has been studied genetically and biochemically in great detail for many decades, and it is easily manipulated with high-throughput experimental methods. But each of you will use this information to study not Baker's yeast, but a related organism. You will explore the function of the Mbp1 protein in some other species from the kingdom of fungi, whose genome has been completely sequenced; thus our quest is also an exercise in model-organism reasoning: the transfer of knowledge from one, well-studied organism to others.
It's reasonable to hypothesize that such central control machinery is conserved in most if not all fungi. But we don't know. Many of the species that we will be working with have not been characterized in great detail, and some of them are new to our class this year. And while we know a fair bit about Mbp1, we probably don't know very much at all about the related genes in other organisms: whether they exist, whether they have similar functional features and whether they might contribute to the G1/S checkpoint system in a similar way. Thus we might discover things that are new and interesting. This is a quest of discovery.
Here are the steps of the assignment for this week:
- We'll need to explore what data is available for the Mbp1 protein.
- We'll need to pick a species to adopt for exploration.
- We'll need to define what data we want to store and design a datamodel.
However, before we head off into the Internet: have you thought about how to document such a "quest"? How will you keep notes? Obviously, computational research proceeds with the same best-practice principles as any wet-lab experiment. We have to keep notes, ensure our work is reproducible, and that our conclusions are supported by data. I think it's pretty obvious that paper notes are not very useful for bioinformatics work. Ideally, you should be able to save results, and link to files and Webpages.
Keeping Labnotes
Consider it a part of your assignment to document your activities in electronic form. Here are some applications you might think of - but (!) disclaimer, I myself don't use any of these (yet) (except the Wiki of course).
- Evernote - a web hosted, automatically syncing e-notebook.
- Nevernote - the Open Source alternative to Evernote.
- Google Keep - if you have a Gmail account, you can simply log in here. Grid-based. Seems a bit awkward for longer notes. But of course you can also use Google Docs.
- Microsoft OneNote - this sounds interesting and even though I have had my share of problems with Microsoft products, I'll probably give this a try. Syncing across platforms, being able to format contents and organize it sounds great.
- The Student Wiki - of course. You can keep your course notes with your User pages.
Are you aware of any other solutions? Let us know!
Keeping such a journal will be helpful, because the assignments are integrated over the entire term, and later assignments will make use of earlier results. But it is also excellent practice for "real" research. Expand the section below for details - written from a Wiki perspective but generally applicable.
Data Sources
SGD - a Yeast Model Organism Database
Yeast happens to have a very well maintained model organism database - a Web resource dedicated to Saccharomyces cerevisiae. Where such resources are available, they are very useful for the community. For the general case however, we need to work with one of the large, general data providers - the NCBI and the EBI. But in order to get a sense of the type of data that is available, let's visit the SGD database first.
Task:
Access the information page on Mbp1 at the Saccharomyces Genome Database.
- Browse through the Summary page and note the available information: you should see:
- information about the gene and the protein;
- Information about it's roles in the cell curated at the Gene Ontology database;
- Information about knock-out phenotypes; (Amazing. Would you have imagined that this is a non-essential gene?)
- Information about protein-protein interactions;
- Regulation and expression;
- A curators' summary of our understanding of the protein. Mandatory reading.
- And key references.
- Access the Protein tab and note the much more detailed information.
- Domains and their classification;
- Sequence;
- Shared domains;
- and much more...
You will notice that some of this information relates to the molecule itself, and some of it relates to its relationship with other molecules. Some of it is stored at SGD, and some of it is cross-referenced from other databases. And we have textual data, numeric data, and images.
How would you store such data to use it in your project? We will work on this question at the end of the assignment.
If we were working on yeast, most data we need is right here: curated, kept current and consistent, referenced to the literature and ready to use. But you'll be working on a different species and we'll explore the much, much larger databases at the NCBI for this. The upside is that most of the information like this is available for your species. The downside is that we'll have to integrate information from many different sources "by hand".
NCBI databases
The NCBI (National Center for Biotechnology Information) is the largest international provider of data for genomics and molecular biology. With its annual budget of several hundred million dollars, it organizes a challenging program of data management at the largest scale, it makes its data freely and openly available over the Internet, worldwide, and it runs significant in-house research projects.
Let us explore some of the offerings of the NCBI that can contribute to our objective of studying a particular gene in an organism of interest.
Entrez
Task:
Remember to document your activities.
- Access the NCBI website at http://www.ncbi.nlm.nih.gov/ [1]
- In the search bar, enter
mbp1
and click Search. - On the resulting page, look for the Protein section and click on the link. What do you find?
The result page of your search in "All Databases" is the "Global Query Result Page" of the Entrez system. If you follow the "Protein" link, you get taken to the more than 450 sequences in the NCBI Protein database that contain the keyword "mbp1". But when you look more closely at the results, you see that the result is quite non-specific: searching only by keyword retrieves an Arabidopsis protein, bacterial proteins, a Saccharomyces protein (perhaps one that we are actually interested in), Maltose Binding Proteins - and much more. There must be a more specific way to search, and indeed there is. Time to read up on the Entrez system.
Task:
- Navigate to the Entrez Help Page and read about the Entrez system, especially about:
- Boolean operators,
- wildcards,
- limits, and
- filters.
- You should minimally understand:
- How to search by keyword;
- How to search by gene or protein name;
- How to restrict a search to a particular organism.
Don't skip this part, you don't need to know the options by heart, but you should know they exist and how to find them. It would be great to have a synopsis of the important fields for reference, wouldn't it?Why don't you go and create one: I have put a template page on the Student Wiki (A synopsis of Entrez codes). Editors welcome!
Keyword and organism searches are pretty universal, but apart from that, each NCBI database has its own set of specific fields. You can access the keywords via the Advanced Search interface of any of the database pages.
Protein Sequence
Task:
With this knowledge we can restrict the search to proteins called "Mbp1" that occur in Baker's Yeast. Return to the Global Search page and in the search field, type:
Mbp1[protein name] AND
"Saccharomyces cerevisiae"[organism]
This should find one and only one protein. Follow the link into the protein database: since this is only one record, the link takes you directly to the result: CAA98618.1
—a data record in Genbank Flat File (GFF) format[2]. The database identifier CAA98618.1
tells you that this is a record in the GenPept database. There are actually several, identical versions of this sequence in the NCBI's holdings. A link to "Identical Proteins" near the top of the record shows you what these are:
Some of the sequences represent duplicate entries of the same gene (Mbp1) in the same strain (S288c) of the same species (S. cerevisiae). In particular:
- there are three GenBank; records (
CAA52271.1
,CAA98618.1
andDAA11800.1
); these are archival entries, submitted by independent yeast genome research projects;
- there is an entry in the RefSeq database:
NP_010227.1
. This is the preferred entry for us to work with. RefSeq is a curated, non-redundant database which solves a number of problems of archival databases. You can recognize RefSeq identifiers – they always look like NP_12345.1, NM_12345.1, XP_12345.1, NC_12345.1 etc. This reflects whether the sequence is protein, mRNA or genomic, and inferred or obtained through experimental evidence. The RefSeq IDNP_010227.1
actually appears twice, once linked to its genomic sequence, and once to its mRNA.
- there is a SwissProt sequence
P39678.1
[3]. This link is kind of a big deal. It's a cross-reference into UniProt, the huge protein sequence database maintained by the EBI (European Bioinformatics Institute), which is the NCBI's counterpart in Europe. SwissProt entries have the highest annotation standard overall and are expertly curated. Many Webservices that we will encounter, work with UniProt ID's (e.g.P39678.1
), rather than RefSeq. But it used to be until recently that the two databases did not link to each other, mostly for reasons of funding politics. It's great to see that this divide has now been overcome.
- Finally, there are four entries of the same sequence in different yeast strains. These don't have to be identical, they just happen to be. Sometimes we find identical sequences in quite divergent species. Therefore I would not actually consider
EIW11153.1
,AJU86440.1
,AJU58508.1
, andAJU61971.1
to be identical proteins, although they have the same sequence.
Note all the .1
suffixes of the sequence identifiers. These are version numbers. Two observations:
- It's great that version numbers are now used throughout the NCBI database. This is good database engineering practice because it's really important for reproducible research that updates to database records are possible, but recognizable. When working with data you always must provide for the possibility of updates, and manage the changes transparently and explicitly. Proper versioning should be a part of all datamodels.
- When searching, or for general use, you should omit the version number, i.e. use
NP_010227
orP39678
notNP_010227.1
resp.P39678.1
. This way the database system will resolve the identifier to the most current, highest version number (unless you want the older one, of course).
Task:
- Note down the RefSeq ID and the UniProt (SwissProt) ID in your journal.
- Follow the link to the RefSeq entry
NP_010227.1
. - Explore the page and follow these links (note the contents in your journal):
- Under "Analyze this Sequence": Identify Conserved Domains
- Under "Protein 3D Structure": See all 3 structures...
- Under "Pathways for the MBP1 gene": Cell cycle - yeast
- Under "Related information" Proteins with Similar Sequence
As we see, this is a good start page to explore all kinds of databases at the NCBI via cross-references.
PubMed
Arguably one of the most important databases in the life sciences is PubMed and this is a good time to look at PubMed in a bit more detail.
Task:
- Return back to the MBP1 RefSeq record.
- Find the PubMed links under Related information in the right-hand margin and explore them.
- The first one (PubMed) will take you to records that cite the sequence record;
- The second one (PubMed (RefSeq)) will take you to articles that relate to the Mbp1 gene or protein;
- The third one (PubMed (Weighted)) applies a weighting algorithm to find broadly relevant information - an example of literature data mining. PubMed(weighted) appears to give a pretty good overview of systems-biology type, cross-sectional and functional information.
But neither of the searches finds all Mbp1 related literature.
- On any of the PubMed pages open the Advanced query page and study the keywords that apply to PubMed searches. These are actually quite important and useful to remember. Make yourself familiar with the section on Search field descriptions and tags in the PubMed help document, (in particular [DP], [AU], [TI], and [TA]), how you use the History to combine searches, and the use of AND, OR, NOT and brackets. Understand how you can restrict a search to reviews only, and what the link to Related citations... is useful for.
- Now find publications with Mbp1 in the title. In the result list, follow the links for the two Biochemistry papers, by Taylor et al. (2000) and by Deleeuw et al. (2008). Download the PDFs, we will need them later.
Now, we were actually trying to find related proteins in a different species. Our next task is therefore to decide what that species should be.
Choosing YFO (Your Favourite Organism)
In this section we create a lottery to assign species at (pseudo) random to students. We'll try the following procedure.
- First, I create a list of suitable species.
- Then, we put this list into the body of an R function.
- The function picks one of the species at random - but to make sure this process is reproducible, we'll set a seed for the random number generator. Obviously, everyone has to use a different seed, or else everyone would end up getting the same species assigned.
- Thus we'll use your Student Number as the seed. This is an integer, so it can be used, and it's unique to each of you. The choice is then random, reproducible and unique.
You may notice that this process does not guarantee that everyone gets a different species, and that all species are chosen at least once. I don't think doing that is possible in a "stateless" way (i.e. I don't want to have to remember who chose what species), given that I don't know all of your student numbers. But if anyone can think of a better solution, that would be neat.
Is it possible that all of you end up working on the same species anyway? Indeed. That's the problem with randomness. But it is not very likely.
What about the "suitable species" though? Where do they come from? For the purposes of the course "quest", we need species
- that actually have transcription factors that are related to Mbp1;
- whose genomes have been sequenced; and
- for which the sequences have been deposited in the RefSeq database, NCBI's unique sequence collection.
Task:
- Access the R tutorial and work through the section on Writing your own functions. It is short, and crucial for your work.
Here is R code to assign the species:
Task:
- Read, try to understand and then execute the following R-code.
pickSpecies <- function(ID) {
# this function randomly picks a fungal species
# from a list. It is seeded by a student ID. Therefore
# the pick is random, but reproducible.
# first, define a list of species:
Species <- c(
"Agaricus bisporus (AGABI)",
"Arthrobotrys oligospora (ARTOL)",
"Arthroderma benhamiae (ARTBE)",
"Aureobasidium subglaciale (AURSU)",
"Auricularia delicata (AURDE)",
"Batrachochytrium dendrobatidis (BATDE)",
"Baudoinia panamericana (BAUPA)",
"Beauveria bassiana (BEABA)",
"Bipolaris sorokiniana (BIPSO)",
"Blastomyces dermatitidis (BLADE)",
"Botrytis cinerea (BOTCI)",
"Capronia epimyces (CAPEP)",
"Chaetomium thermophilum (CHATH)",
"Cladophialophora yegresii (CLAYE)",
"Clavispora lusitaniae (CLALU)",
"Coccidioides immitis (COCIM)",
"Colletotrichum graminicola (COLGR)",
"Coniophora puteana (CONPU)",
"Coniosporium apollinis (CONAP)",
"Coprinopsis cinerea (COPCI)",
"Cordyceps militaris (CORMI)",
"Cryptococcus neoformans (CRYNE)",
"Cyphellophora europaea (CYPEU)",
"Dactylellina haptotyla (DACHA)",
"Debaryomyces hansenii (DEBHA)",
"Dichomitus squalens (DICSQ)",
"Endocarpon pusillum (ENDPU)",
"Eremothecium gossypii (EREGO)",
"Eutypa lata (EUTLA)",
"Exophiala aquamarina (EXOAQ)",
"Fibroporia radiculosa (FIBRA)",
"Fomitiporia mediterranea (FOMME)",
"Fonsecaea pedrosoi (FONPE)",
"Fusarium pseudograminearum (FUSPS)",
"Gaeumannomyces graminis (GAEGR)",
"Glarea lozoyensis (GLALO)",
"Gloeophyllum trabeum (GLOTR)",
"Heterobasidion irregulare (HETIR)",
"Histoplasma capsulatum (HISCA)",
"Kazachstania africana (KAZAF)",
"Kluyveromyces lactis (KLULA)",
"Komagataella pastoris (KOMPA)",
"Laccaria bicolor (LACBI)",
"Lachancea thermotolerans (LACTH)",
"Leptosphaeria maculans (LEPMA)",
"Lodderomyces elongisporus (LODEL)",
"Magnaporthe oryzae (MAGOR)",
"Malassezia globosa (MALGL)",
"Marssonina brunnea (MARBR)",
"Metarhizium robertsii (METRO)",
"Meyerozyma guilliermondii (MEYGU)",
"Microsporum gypseum (MICGY)",
"Millerozyma farinosa (MILFA)",
"Moniliophthora roreri (MONRO)",
"Myceliophthora thermophila (MYCTH)",
"Naumovozyma dairenensis (NAUDA)",
"Nectria haematococca (NECHA)",
"Neofusicoccum parvum (NEOPA)",
"Neosartorya fischeri (NEOFI)",
"Ogataea parapolymorpha (OGAPA)",
"Paracoccidioides brasiliensis (PARBR)",
"Penicillium rubens (PENRU)",
"Pestalotiopsis fici (PESFI)",
"Phanerochaete carnosa (PHACA)",
"Pneumocystis murina (PNEMU)",
"Podospora anserina (PODAN)",
"Postia placenta (POSPL)",
"Pseudocercospora fijiensis (PSEFI)",
"Pseudogymnoascus destructans (PSEDE)",
"Pseudozyma hubeiensis (PSEHU)",
"Puccinia graminis (PUCGR)",
"Punctularia strigosozonata (PUNST)",
"Pyrenophora teres (PYRTE)",
"Rasamsonia emersonii (RASEM)",
"Rhinocladiella mackenziei (RHIMA)",
"Scheffersomyces stipitis (SCHST)",
"Schizophyllum commune (SCHCO)",
"Sclerotinia sclerotiorum (SCLSC)",
"Serpula lacrymans (SERLA)",
"Setosphaeria turcica (SETTU)",
"Sordaria macrospora (SORMA)",
"Spathaspora passalidarum (SPAPA)",
"Stereum hirsutum (STEHI)",
"Talaromyces marneffei (TALMA)",
"Tetrapisispora phaffii (TETPH)",
"Thielavia terrestris (THITE)",
"Tilletiaria anomala (TILAN)",
"Togninia minima (TOGMI)",
"Torulaspora delbrueckii (TORDE)",
"Trametes versicolor (TRAVE)",
"Tremella mesenterica (TREME)",
"Trichoderma virens (TRIVI)",
"Trichophyton rubrum (TRIRU)",
"Tuber melanosporum (TUBME)",
"Uncinocarpus reesii (UNCRE)",
"Vanderwaltozyma polyspora (VANPO)",
"Verticillium alfalfae (VERAL)",
"Wallemia mellicola (WALME)",
"Wickerhamomyces ciferrii (WICCI)",
"Yarrowia lipolytica (YARLI)",
"Zygosaccharomyces rouxii (ZYGRO)",
"Zymoseptoria tritici (ZYMTR)"
)
set.seed(ID) # seed the random number generator
choice <- sample(Species, 1) # pick a random element
return(choice)
}
- Execute the function
pickSpecies()
with your student ID as its argument. Example:
> pickSpecies(991234567)
[1] "Coccidioides immitis (COCIM)"
- Note down the species name and its five letter label on your Student Wiki user page. Use this species whenever this or future assignments refer to YFO.
Selecting "your" gene
Task:
- Back at the Mbp1 protein page follow the link to Run BLAST... under "Analyze this sequence".
- This allows you to perform a sequence similarity search. You need to set two parameters:
- As Database, select Reference proteins (refseq_protein) from the drop down menu;
- In the Organism field, type the species you have selected as YFO and select the corresponding taxonomy ID.
- Click on Run BLAST to start the search.
This should find a handful of genes, all of them in YFO. If you find none, or hundreds, or they are not all in the same species, you did something wrong. Ask on the mailing list and make sure to fix the problem.
- Note the results in your Journal.
Data modelling
TBC
- That is all.
Links and resources
Footnotes and references
- ↑ If you find this URL hard to remember, consider the acronyms:
- ncbi.nlm.nih.gov
- NCBI: National Center for Biotechnology Information
- NLM: National Library of Medicine
- NIH: National Institutes of Health
- GOV: the US GOVernment top-level domain
- ↑ If there would have been more than one match, you would have gotten a list of results, as before.
- ↑ Actually the "real" SwissProt identifier would like like
MBP1_YEAST
.P39678
is the corresponding UniProt identifier.
Ask, if things don't work for you!
- If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.
- Do consider how to ask your questions so that a meaningful answer is possible:
- How to create a Minimal, Complete, and Verifiable example on stackoverflow and ...
- How to make a great R reproducible example are required reading.