Difference between revisions of "BIO Assignment Week 3"

From "A B C"
Jump to navigation Jump to search
m
Line 22: Line 22:
 
==Model organism and YFO==
 
==Model organism and YFO==
  
You will explore the function of the Mbp1 protein in some other species from the {{WP|Kingdom_(biology)|kingdom}} of fungi, whose genome has been completely sequenced; thus our quest is also an exercise in ''model-organism reasoning'': the transfer of knowledge from one, well-studied organism to others.  It's reasonable to hypothesize that such central control machinery is conserved in most if not all fungi. But we don't know. Many of the species that we will be working with have not been characterized in great detail, and some of them are new to our class this year. And while we know a fair bit about Mbp1, we probably don't know very much at all about the related genes in other organisms: whether they exist, whether they have similar functional features and whether they might contribute to the ''G1/S checkpoint system'' in a similar way. Thus we might discover things that are new and interesting. This is a quest of discovery.  
+
In this assignment we will set out on our exploration of the system that regulates the G1/S transition by focussing first on the Mbp1 protein in selected species from the {{WP|Kingdom_(biology)|kingdom}} of fungi, whose genome has been completely sequenced; our quest is thus also an exercise in ''model-organism reasoning'': the transfer of knowledge from one, well-studied organism to others.  It's reasonable to hypothesize that such central control machinery is conserved in most if not all fungi. But we don't know. Many of the species that we will be working with have not been characterized in great detail, and some of them are new to our class this year. And while we know a fair bit about Mbp1, we probably don't know very much at all about the related genes in other organisms: whether they exist, whether they have similar functional features and whether they might contribute to the ''G1/S checkpoint system'' in a similar way. Thus we might discover things that are new and interesting. This is a quest of discovery.
# Pick a species to adopt for exploration.
+
 
 +
{{Vspace}}
 +
 
 +
==The BCH441 RStudio Project==
 +
 
 +
I have begun to unify '''R'''-scripts and other resources in an RStudio project that will make it easy to update and distribute code. Conveniently, if I push update material to the github repository, all you need to do is to ''pull'' the updated project to receive all updates and new files on your computer. Version control is really good for this. However, there is one limitation to this approach that you need to be aware of. If you create your own, local files and then '''commit''' them, ''git'' will complain that it would be overwriting such local material. As long as you don't '''commit''' your files then all should be fine. This means you'll need to do your own "versioning" by saving your own scripts under a different name from time to time. Once again: in this context:
 +
*'''saving''' your own files is fine;
 +
*'''committing''' your own files to version control will cause problems;
 +
* changes you make to course material files and save '''under the same filename''' (like adding comments and notes) will not persist, these changes will be overwritten with the next update. You need to "Save As..." with a new filename (e.g. just prefix the original name with "<code>my</code>").
  
 +
{{Vspace}}
  
 +
{{task|1=
 +
* Open RStudio and create a '''New Project...''' cloned from a git version control directory. The repository URL is <code><nowiki>https://github.com/hyginn/BCH441_2016</nowiki></code>. Create this in the same way as you did for the [[R_tutorial#Git_Version_control|'''R'''-tutorial]].
 +
* As requested on the console, type <code>init()</code>. This will create a file called <code>.myProfile.R</code> and ask you for your UofT eMail address and Student ID. You need to enter the correct values because other scripts will make assumptions about that these variables exist and are valid.
 +
* Work through the task: <code>"local script"</code> in the <code>BCH441_2016.R</code> script.
 +
}}
  
Now, we were actually trying to find related proteins in a different species. Our next task is therefore to decide what that species should be.
 
  
 +
==
  
  
&nbsp;
+
{{Vspace}}
  
 
==Choosing YFO (Your Favourite Organism)==
 
==Choosing YFO (Your Favourite Organism)==
 +
 +
 +
# Pick a species to adopt for exploration.
 +
 +
 +
 +
Now, we were actually trying to find related proteins in a different species. Our next task is therefore to decide what that species should be.
  
  
Line 52: Line 73:
 
* whose genomes have been sequenced; and
 
* whose genomes have been sequenced; and
 
* for which the sequences have been deposited in the RefSeq database, NCBI's unique sequence collection.
 
* for which the sequences have been deposited in the RefSeq database, NCBI's unique sequence collection.
 
 
{{task|
 
* Access the  [[R tutorial|'''R tutorial''']] and work through the section on [[R tutorial#Writing_your_own_functions|'''Writing your own functions''']]. It is short, and crucial for your work.
 
}}
 
  
  

Revision as of 20:08, 29 September 2016

Assignment for Week 3
Sequence Analysis

< Assignment 2 Assignment 4 >

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.



 

Introduction

Model organism and YFO

In this assignment we will set out on our exploration of the system that regulates the G1/S transition by focussing first on the Mbp1 protein in selected species from the kingdom of fungi, whose genome has been completely sequenced; our quest is thus also an exercise in model-organism reasoning: the transfer of knowledge from one, well-studied organism to others. It's reasonable to hypothesize that such central control machinery is conserved in most if not all fungi. But we don't know. Many of the species that we will be working with have not been characterized in great detail, and some of them are new to our class this year. And while we know a fair bit about Mbp1, we probably don't know very much at all about the related genes in other organisms: whether they exist, whether they have similar functional features and whether they might contribute to the G1/S checkpoint system in a similar way. Thus we might discover things that are new and interesting. This is a quest of discovery.


 

The BCH441 RStudio Project

I have begun to unify R-scripts and other resources in an RStudio project that will make it easy to update and distribute code. Conveniently, if I push update material to the github repository, all you need to do is to pull the updated project to receive all updates and new files on your computer. Version control is really good for this. However, there is one limitation to this approach that you need to be aware of. If you create your own, local files and then commit them, git will complain that it would be overwriting such local material. As long as you don't commit your files then all should be fine. This means you'll need to do your own "versioning" by saving your own scripts under a different name from time to time. Once again: in this context:

  • saving your own files is fine;
  • committing your own files to version control will cause problems;
  • changes you make to course material files and save under the same filename (like adding comments and notes) will not persist, these changes will be overwritten with the next update. You need to "Save As..." with a new filename (e.g. just prefix the original name with "my").


 

Task:

  • Open RStudio and create a New Project... cloned from a git version control directory. The repository URL is https://github.com/hyginn/BCH441_2016. Create this in the same way as you did for the R-tutorial.
  • As requested on the console, type init(). This will create a file called .myProfile.R and ask you for your UofT eMail address and Student ID. You need to enter the correct values because other scripts will make assumptions about that these variables exist and are valid.
  • Work through the task: "local script" in the BCH441_2016.R script.


==


 

Choosing YFO (Your Favourite Organism)

  1. Pick a species to adopt for exploration.


Now, we were actually trying to find related proteins in a different species. Our next task is therefore to decide what that species should be.


In this section we create a lottery to assign species at (pseudo) random to students. We'll try the following procedure.

  1. First, I create a list of suitable species.
  2. Then, we put this list into the body of an R function.
  3. The function picks one of the species at random - but to make sure this process is reproducible, we'll set a seed for the random number generator. Obviously, everyone has to use a different seed, or else everyone would end up getting the same species assigned.
  4. Thus we'll use your Student Number as the seed. This is an integer, so it can be used, and it's unique to each of you. The choice is then random, reproducible and unique.

You may notice that this process does not guarantee that everyone gets a different species, and that all species are chosen at least once. I don't think doing that is possible in a "stateless" way (i.e. I don't want to have to remember who chose what species), given that I don't know all of your student numbers. But if anyone can think of a better solution, that would be neat.

Is it possible that all of you end up working on the same species anyway? Indeed. That's the problem with randomness. But it is not very likely.


What about the "suitable species" though? Where do they come from? For the purposes of the course "quest", we need species

  • that actually have transcription factors that are related to Mbp1;
  • whose genomes have been sequenced; and
  • for which the sequences have been deposited in the RefSeq database, NCBI's unique sequence collection.


To prepare such a list of species, I have searched the NCBI's RefSeq database for proteins whose sequences are similar to the KilA-N domain of Mbp1 (this is the most highly conserved part of the protein, which binds DNA) and compiled the names of organisms that contain them...

Thess are the steps of the process:


(1) Compiled a list of genome-sequenced fungi from information on the NCBI genome browser page by selecting Eukaryota / Fungi ... and downloading the entire list of species as a text document. An excerpt of the first lines of the document is shown here:
#Organism/Name            Kingdom    Group  SubGroup        Size (Mb)
Aciculosporium take       Eukaryota  Fungi  Ascomycetes     58.8364
Agaricus bisporus         Eukaryota  Fungi  Basidiomycetes  32.6144
Ajellomyces capsulatus    Eukaryota  Fungi  Ascomycetes     46.124 
Ajellomyces dermatitidis  Eukaryota  Fungi  Ascomycetes     75.4047
[...]


(2) Performed a BLAST search with the Mbp1 KilA-N domain sequence, the search database restricted to the refseq_protein database and an organism seleciton of "Fungi".
(3) In the header of the BLAST results page, there is a link to [Taxonomy reports] This contains a list of all hits, sorted by species. I copied the lineage report to a separate file.


(4) Ran the following R script:
# prepareSpeciesList.R
#
# From a list of genome-sequenced species, find those that
# have KilA-N domain containing proteins in RefSeq.
# 

setwd(BCH441DIR)  

if (!require(ore, quietly=TRUE)) { # Oniguruma Regular Expressions
    install.packages("ore")
    library(ore)
}

# read output of search at NCBI genomes
#Organism/Name,Kingdom,Group,SubGroup,Size (Mb),Chrs,Organelles,Plasmids,Assemblies
#"Absidia idahoensis",Eukaryota,"Fungi","Other Fungi",-,-,-,-,-
#"Aciculosporium take",Eukaryota,"Fungi","Ascomycetes",58.8364,-,-,-,1
#"Acremonium chrysogenum",Eukaryota,"Fungi","Ascomycetes",28.5905,-,1,-,1
#"Agaricus bisporus",Eukaryota,"Fungi","Basidiomycetes",32.6144,-,-,-,2
# etc.

sequencedFungi <- read.csv("genomes_overview.csv",
                            stringsAsFactors=FALSE)
head(sequencedFungi)  # Check what we have!
tail(sequencedFungi)


# Run BLASTp on RefSeq with KilA-N:                      
# >Yeast Mbp1 KilA-N domain (AA 19..93 of NP_010227)
# IHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQG 
# GFGKYQGTWVPLNIAKQLAEKFSVY
# E value cutoff 0.01
# Copy and save Lineage Report from Taxonomy report of results.
#. . . . . . Vanderwaltozyma polyspora DSM 70294 ............  143   4 hits
#. . . . . . Candida glabrata CBS 138 .......................  140   3 hits
#. . . . . . Tetrapisispora phaffii CBS 4417 ................  140   4 hits
#. . . . . . Naumovozyma dairenensis CBS 421 ................  139   3 hits
#. . . . . . Kazachstania africana CBS 2517 .................  133   5 hits
# etc.


hits <- readLines("lineageReport.txt", n = -1)

for (i in 1:length(hits)) { # get first two words from line
    hits[i] <- ore.search("(\\w+\\s+\\w+)", hits[i])$match	
}

# choose only hits from sequenced fungi
fungi <- unique(hits[hits %in% sequencedFungi[,1]])

# extract genus and keep only one representative of genus
fungi <- cbind(fungi, rep("", length(fungi)))
for (i in 1:nrow(fungi)) {
    fungi[i,2] <- ore.search("^\\w+", fungi[i,1])$match	
}
selected <- fungi[!duplicated(fungi[, 2]), 1]

selected <- selected[order(selected)]
selected

# remove reference species
reference <- c("Saccharomyces cerevisiae",
               "Aspergillus nidulans",
               "Candida glabrata",               # ref.: albicans
               "Neurospora tetrasperma",         # ref.: crassa
               "Schizosaccharomyces japonicus",  # ref.: pombe
               "Ustilago maydis")
               
selected <- selected[!(selected  %in% reference)]             
                
# add binomial codes
biCode <- function(s) { 
	substr(s, 4, 5) <- substr(strsplit(s,"\\s+")[[1]][2], 1, 2)
	return (toupper(substr(s, 1, 5)))
}

for (i in 1:length(selected)) {
    selected[i] <- paste(selected[i],
                         " (",
                         biCode(selected[i]),
                         ")",
                         sep="")	
}

# print output
cat(paste(selected, collapse="\n"))

# [End]


This process with its mix of Web service, programmed post-processing and a bit of manual cleanup, is a fairly typical example of gathering and collating information across different data sources.

Here is R code to assign the species:

Task:


  • Read, try to understand and then execute the following R-code.
pickSpecies <- function(ID) {
	# this function randomly picks a fungal species
	# from a list. It is seeded by a student ID. Therefore
	# the pick is random, but reproducible.
	
	# first, define a list of species:
	Species <- c(
"Agaricus bisporus (AGABI)",
"Arthrobotrys oligospora (ARTOL)",
"Arthroderma benhamiae (ARTBE)",
"Aureobasidium subglaciale (AURSU)",
"Auricularia delicata (AURDE)",
"Batrachochytrium dendrobatidis (BATDE)",
"Baudoinia panamericana (BAUPA)",
"Beauveria bassiana (BEABA)",
"Bipolaris sorokiniana (BIPSO)",
"Blastomyces dermatitidis (BLADE)",
"Botrytis cinerea (BOTCI)",
"Capronia epimyces (CAPEP)",
"Chaetomium thermophilum (CHATH)",
"Cladophialophora yegresii (CLAYE)",
"Clavispora lusitaniae (CLALU)",
"Coccidioides immitis (COCIM)",
"Colletotrichum graminicola (COLGR)",
"Coniophora puteana (CONPU)",
"Coniosporium apollinis (CONAP)",
"Coprinopsis cinerea (COPCI)",
"Cordyceps militaris (CORMI)",
"Cryptococcus neoformans (CRYNE)",
"Cyphellophora europaea (CYPEU)",
"Dactylellina haptotyla (DACHA)",
"Debaryomyces hansenii (DEBHA)",
"Dichomitus squalens (DICSQ)",
"Endocarpon pusillum (ENDPU)",
"Eremothecium gossypii (EREGO)",
"Eutypa lata (EUTLA)",
"Exophiala aquamarina (EXOAQ)",
"Fibroporia radiculosa (FIBRA)",
"Fomitiporia mediterranea (FOMME)",
"Fonsecaea pedrosoi (FONPE)",
"Fusarium pseudograminearum (FUSPS)",
"Gaeumannomyces graminis (GAEGR)",
"Glarea lozoyensis (GLALO)",
"Gloeophyllum trabeum (GLOTR)",
"Heterobasidion irregulare (HETIR)",
"Histoplasma capsulatum (HISCA)",
"Kazachstania africana (KAZAF)",
"Kluyveromyces lactis (KLULA)",
"Komagataella pastoris (KOMPA)",
"Laccaria bicolor (LACBI)",
"Lachancea thermotolerans (LACTH)",
"Leptosphaeria maculans (LEPMA)",
"Lodderomyces elongisporus (LODEL)",
"Magnaporthe oryzae (MAGOR)",
"Malassezia globosa (MALGL)",
"Marssonina brunnea (MARBR)",
"Metarhizium robertsii (METRO)",
"Meyerozyma guilliermondii (MEYGU)",
"Microsporum gypseum (MICGY)",
"Millerozyma farinosa (MILFA)",
"Moniliophthora roreri (MONRO)",
"Myceliophthora thermophila (MYCTH)",
"Naumovozyma dairenensis (NAUDA)",
"Nectria haematococca (NECHA)",
"Neofusicoccum parvum (NEOPA)",
"Neosartorya fischeri (NEOFI)",
"Ogataea parapolymorpha (OGAPA)",
"Paracoccidioides brasiliensis (PARBR)",
"Penicillium rubens (PENRU)",
"Pestalotiopsis fici (PESFI)",
"Phanerochaete carnosa (PHACA)",
"Pneumocystis murina (PNEMU)",
"Podospora anserina (PODAN)",
"Postia placenta (POSPL)",
"Pseudocercospora fijiensis (PSEFI)",
"Pseudogymnoascus destructans (PSEDE)",
"Pseudozyma hubeiensis (PSEHU)",
"Puccinia graminis (PUCGR)",
"Punctularia strigosozonata (PUNST)",
"Pyrenophora teres (PYRTE)",
"Rasamsonia emersonii (RASEM)",
"Rhinocladiella mackenziei (RHIMA)",
"Scheffersomyces stipitis (SCHST)",
"Schizophyllum commune (SCHCO)",
"Sclerotinia sclerotiorum (SCLSC)",
"Serpula lacrymans (SERLA)",
"Setosphaeria turcica (SETTU)",
"Sordaria macrospora (SORMA)",
"Spathaspora passalidarum (SPAPA)",
"Stereum hirsutum (STEHI)",
"Talaromyces marneffei (TALMA)",
"Tetrapisispora phaffii (TETPH)",
"Thielavia terrestris (THITE)",
"Tilletiaria anomala (TILAN)",
"Togninia minima (TOGMI)",
"Torulaspora delbrueckii (TORDE)",
"Trametes versicolor (TRAVE)",
"Tremella mesenterica (TREME)",
"Trichoderma virens (TRIVI)",
"Trichophyton rubrum (TRIRU)",
"Tuber melanosporum (TUBME)",
"Uncinocarpus reesii (UNCRE)",
"Vanderwaltozyma polyspora (VANPO)",
"Verticillium alfalfae (VERAL)",
"Wallemia mellicola (WALME)",
"Wickerhamomyces ciferrii (WICCI)",
"Yarrowia lipolytica (YARLI)",
"Zygosaccharomyces rouxii (ZYGRO)",
"Zymoseptoria tritici (ZYMTR)"
		)
	set.seed(ID)                  # seed the random number generator
	choice <- sample(Species, 1)  # pick a random element
	return(choice)
}
  • Execute the function pickSpecies() with your student ID as its argument. Example:
 > pickSpecies(991234567)
[1] "Coccidioides immitis (COCIM)"
  • Note down the species name and its five letter label on your Student Wiki user page. Use this species whenever this or future assignments refer to YFO.


 

Selecting "your" gene

Task:

  1. Back at the Mbp1 protein page follow the link to Run BLAST... under "Analyze this sequence".
  2. This allows you to perform a sequence similarity search. You need to set two parameters:
    1. As Database, select Reference proteins (refseq_protein) from the drop down menu;
    2. In the Organism field, type the species you have selected as YFO and select the corresponding taxonomy ID.
  3. Click on Run BLAST to start the search. This should find a handful of genes, all of them in YFO. If you find none, or hundreds, or they are not all in the same species, you did something wrong. Ask on the mailing list and make sure to fix the problem.
  4. The top "hit" on the results page is what you are looking for. Its alignment and alignment score are shown in the Alignments section a bit further down the page. Your hit should have on the order of more than 40% identities to the query
  5. The first item for each hit is a link to its database entry, right next to the checkbox. It says something like ref|NP_123456789 or ref|XP_123456789 ... follow that link.
  6. Note the RefSeq ID, %ID, E-value etc. in your Journal. We will refer to this sequence as "YFO Mbp1" or similar in the future.


Store

The System datamodel

Entity-Relationship Diagram (ERD) for a data model that stores data for a systems project. Entity names are at the top of each box, attributes are listed below. If you think of the entity as a table, its attributes are the column names and each row stores the values for one particular instance. Semantically related entities are shaded in similar colours; this helps to make the design-principles visible, but one must be careful not to overdo the use of colour. As with all graphical elements in information design: "less is more". All relationships link to unique primary keys and are thus 1 to (0, n): i.e. as attributes, the relationship does not have to exist but there could be many, as the primary key, exactly one key must exist. The diagram was drawn as a "Google presentation" slide and you can view it on the Web and make a copy for your own purposes.

Here is a first version of a systems data model, based on what we discussed in class:

  • The feature table is at the centre. This was not intentional, but emerged from iterating the model through a number of revisions. It emphasizes that the main purpose of this model is to integrate and annotate data from various sources. Feature in the way we understand it here is an abstraction of quite disparate categories of information items. This includes domain annotations, system functions, literature references, and cross-references to other databases. The type attribute will require some thought: this attribute really needs a "controlled vocabulary" to ensure that the same category is described consistently with the same string ("PubMed"? "Lit."? "reference"?). There are a number of ways to achieve this, the best way[1] is to store these types in their own "reference" table type - and link to that table via a foreign key.
  • The protein table is at the centre. Its primary key is a unique integer. We store the NCBI RefSeq ID and the Uniprot ID in the table. We would not call these "foreign keys", since the information they reference is not in our schema but at the NCBI resp. EBI. For example, we can't guarantee that they are unique keys.
  • The taxonomy table holds information about species. We use the NCBI taxonomy ID as its primary key. The same key appears in the protein table as the foreign key that links the protein with its proper taxonomy information. This is an instance where the data model actually does not describe reality well. The problem is that particular proteins that we might retrieve from database searches will often be annotated to a specific strain of a species – and there is no easy way to reference strains to species. We'll see whether this turns out to be a problem in practice. But it may be that an additional table may be required that stores parent/child relationships of the taxonomic tree of life.
  • The protein_feature table links a protein with all the features that we annotate it with. start and end coordinates identify the region of the sequence we have annotated.
  • The ...Annotation tables link feature-information with annotated entities.
  • Should the system table have its own taxonomy attribute? Or should the species in which the system is observed be inferred from the component protein's taxonomy.ID? What do you think? I decided not to add a taxonomy attribute to that table. How would you argue for or against this decision?
  • The component table links proteins that collaborate together as a system. There is an implicit assumption in this model that only proteins are system components (and not e.g. RNA), and that components are atomic (i.e. we can't link to subsystems). How would you change the model to accommodate more realistic biological complexity?


 

Implementing the Data Model in R

To actually implement the model in R we will create the tables as data frames, and we will collect them in a list. We don't have to keep the tables in a list - we could also keep them as independent objects, but with a table named "protein" we should be worried of inadvertently overwriting the table. A list is neater, more flexible (we might want to have several of these), it reflects our intent about the model better, and doesn't require very much more typing. For now, to keep things simple, we will implement only two tables: protein and taxonomy. We'll add the rest when we actually need them.

db <- list()      # we'll call our database "db" and start with an empty list

db$protein <- data.frame(id = 1,
                  name = "Mbp1",
                  refseq_id = "NP_010227",
                  uniprot_id = "P39678",
                  taxonomy_id = 4932,
                  genome_xref = "NC_001136.10",
                  genome_from = 352877,
                  genome_to = 355378,
                  sequence = "...",
                  stringsAsFactors = FALSE)
                      
db$taxonomy <- data.frame(id = 4932, 
                  species_name = "Saccharomyces cerevisiae",
                  stringsAsFactors = FALSE)

# Let's add a second protein:
db$protein <- rbind(db$protein,
                data.frame(id = 2,
                  name = "Res2",
                  refseq_id = "NP_593032",
                  uniprot_id = "P41412",
                  taxonomy_id = 4896,
                  genome_xref = "NC_003424.3",
                  genome_from = 686543,
                  genome_to = 689179,
                  sequence = "...",
                  stringsAsFactors = FALSE))

db$taxonomy <- rbind(db$taxonomy,
                 data.frame(id = 4896,
                  species_name = "Schizosaccharomyces pombe",
                  stringsAsFactors = FALSE))

str(db)

# you can look at the contents of the tables in the usual way we would access
# elements from lists and dataframes. Here are some examples:
db$protein
db$protein$refseq_id
db$protein[,"name"]
db$taxonomy
db$taxonomy[1,]


# Here is an example to look up information in one table,
# based on a condition in another table:
# what is the species name for the protein
# whose name is "Mbp1"?

# First, get the taxonomy_id for the Mbp1 protein. This is
# the key we need for the taxonomy table. We find it in a cell in the 
# table: db$protein[<row>, <column>]
# <row> is that row for which the value in
# the "name" column is Mbp1:

db$protein$name == "Mbp1" # TRUE FALSE

# The <column> is called "taxonomy_id". Simply
# insert these two expressions in the square
# brackets.

db$protein[db$protein$name == "Mbp1", "taxonomy_id"]

# Assign the taxonomy_id value ...
x <- db$protein[db$protein$name == "Mbp1", "taxonomy_id"]

# ... and fetch the species_name value from db$taxonomy

db$taxonomy[db$taxonomy$id == x, "species_name"]

As you see we can edit any component of our data model directly by assigning new values to the element. But in general that's a really bad idea, since it is far too easy to bring the whole model into an inconsistent state. It's much better to write functions that get and set data elements, not only to keep the data consistent, but also because it is much easier, if the model ever changes, to simply edit the function code, rather than having to repeat every single data entry by hand.

What would an setData function have to look like? It should

  • create a new entry if the requested row of a table does not exist yet;
  • update data if the protein exists;
  • perform consistency checks (i.e. check that the data has the correct type);
  • perform sanity checks (i.e. check that data values fall into the expected range);
  • perform completeness checks (i.e. handle incomplete data)

Let's start simple, and create a set- function to add new values to existing sequence data. Also, for clarity, we'll forgo many checks. The first thing we should do is to add the actual sequence.

We only entered a placeholder for the sequence field. Sequences come in many different flavours when we copy them from a Webpage: there can be whitespace, carriage returns, numbers, they can be upper-case, lower-case mixed-case ...

What we want in our sequence data field is one string that contains the entire sequence, and nothing but upper-case, amino-acid code letters.

We'll need to look at how we work with strings in R, how we identify and work with patterns in strings. This is a great time to introduce regular expressions.


A Brief First Encounter of Regular Expressions

Regular expressions are a concise description language to define patterns for pattern-matching in strings.

Truth be told, many programmers have a love-hate relationship with regular expressions. The syntax of regular expressions is very powerful and expressive, but also terse, not always intuitive, and sometimes hard to understand. I'll introduce you to a few principles here that are quite straightforward and they will probably cover 99% of the cases you will encounter.

Here is our test-case: the sequence of Mbp1, copied from the Genbank Webpage.

       1 msnqiysary sgvdvyefih stgsimkrkk ddwvnathil kaanfakakr trilekevlk
      61 ethekvqggf gkyqgtwvpl niakqlaekf svydqlkplf dftqtdgsas pppapkhhha
     121 skvdrkkair sastsaimet krnnkkaeen qfqsskilgn ptaaprkrgr pvgstrgsrr
     181 klgvnlqrsq sdmgfprpai pnssisttql psirstmgpq sptlgileee rhdsrqqqpq
     241 qnnsaqfkei dledglssdv epsqqlqqvf nqntgfvpqq qssliqtqqt esmatsvsss
     301 pslptspgdf adsnpfeerf pgggtspiis miprypvtsr pqtsdindkv nkylsklvdy
     361 fisnemksnk slpqvllhpp phsapyidap idpelhtafh wacsmgnlpi aealyeagts
     421 irstnsqgqt plmrsslfhn sytrrtfpri fqllhetvfd idsqsqtvih hivkrksttp
     481 savyyldvvl skikdfspqy rielllntqd kngdtalhia skngdvvffn tlvkmgaltt
     541 isnkegltan eimnqqyeqm miqngtnqhv nssntdlnih vntnnietkn dvnsmvimsp
     601 vspsdyityp sqiatnisrn ipnvvnsmkq masiyndlhe qhdneikslq ktlksisktk
     661 iqvslktlev lkesskdeng eaqtnddfei lsrlqeqntk klrkrliryk rlikqkleyr
     721 qtvllnklie detqattnnt vekdnntler lelaqeltml qlqrknklss lvkkfednak
     781 ihkyrriire gtemnieevd ssldvilqtl iannnknkga eqiitisnan sha
//


Task:
Navigate to http://regexpal.com and paste the sequence into the lower box. This site is one of a number of online regular expression testers; their immediate, visual feedback is invaluable when you are developing regular expression patterns.

Lets try some expressions:

Most characters are matched literally.
Type "a" in to the upper box and you will see all "a" characters matched. Then replace a with q.
Now type "aa" instead. Then krnnkk. Sequences of characters are also matched literally.
We can specify more than one character to match if we place it in square brackets.
[lq] matches l OR q. [familyvw] matches hydrophobic amino acids.
Within square brackets, we can specify "ranges".
[1-5] matches digits from 1 to 5.
Within square brackets, we can specify characters that should NOT be matched, with the caret, ^.
[^0-9] matches everything EXCEPT digits. [^a-z] matches everything that is not a lower-case letter. That's what we need (try it).

One of the R functions that uses regular expressions is the function gsub(). It replaces characters that match a "regex" with other characters. That is useful for our purpose: we can

  1. match all characters that are NOT a letter, and
  2. replace them by - nothing: the empty string "".

This deletes them.

Task:
Try this code:

seq <- "
        1 msnqiysary sgvdvyefih stgsimkrkk ddwvnathil kaanfakakr trilekevlk
       61 ethekvqggf gkyqgtwvpl niakqlaekf svydqlkplf dftqtdgsas pppapkhhha
      121 skvdrkkair sastsaimet krnnkkaeen qfqsskilgn ptaaprkrgr pvgstrgsrr
      181 klgvnlqrsq sdmgfprpai pnssisttql psirstmgpq sptlgileee rhdsrqqqpq
      241 qnnsaqfkei dledglssdv epsqqlqqvf nqntgfvpqq qssliqtqqt esmatsvsss
      301 pslptspgdf adsnpfeerf pgggtspiis miprypvtsr pqtsdindkv nkylsklvdy
      361 fisnemksnk slpqvllhpp phsapyidap idpelhtafh wacsmgnlpi aealyeagts
      421 irstnsqgqt plmrsslfhn sytrrtfpri fqllhetvfd idsqsqtvih hivkrksttp
      481 savyyldvvl skikdfspqy rielllntqd kngdtalhia skngdvvffn tlvkmgaltt
      541 isnkegltan eimnqqyeqm miqngtnqhv nssntdlnih vntnnietkn dvnsmvimsp
      601 vspsdyityp sqiatnisrn ipnvvnsmkq masiyndlhe qhdneikslq ktlksisktk
      661 iqvslktlev lkesskdeng eaqtnddfei lsrlqeqntk klrkrliryk rlikqkleyr
      721 qtvllnklie detqattnnt vekdnntler lelaqeltml qlqrknklss lvkkfednak
      781 ihkyrriire gtemnieevd ssldvilqtl iannnknkga eqiitisnan sha
 //
"

seq     # "\n" means: line-break

seq <- gsub("[^a-zA-Z]", "", seq) #replace all non-letters with ""

seq

# Now to change the sequence to upper-case. R has toupper()
# and tolower().

toupper(seq)

# Lets put this in a function:

sanitizeSequence <- function(s) {
	return(toupper(gsub("[^a-zA-Z]", "", s)))
}

# try it:

test <- "f123  a$^&&*)m. i	l马马虎虎 é yßv+w"
sanitizeSequence(test)

A Function to Set Values

Building on this function, we can write a function to set values in our protein table.

setDB <- function(database,
                  table,
                  id,
                  name = "",
                  refseq_id = "",
                  uniprot_id = "",
                  taxonomy_id = "",
                  genome_xref = "",
                  genome_from = "",
                  genome_to = "",
                  sequence = "",
                  species_name = "") {
    if (missing(database) | missing(table) | missing(id)) {
    	stop("database, table and id have to be specified")
    }
    if (table == "protein") {
    	row <- which(database$protein$id == id)
    	if (name != "") { database$protein[row, "name"] <- name } 
    	if (refseq_id != "") { database$protein[row, "refseq_id"] <- refseq_id} 
    	if (uniprot_id != "") { database$protein[row, "uniprot_id"] <- uniprot_id} 
    	if (taxonomy_id != "") { database$protein[row, "taxonomy_id"] <- taxonomy_id} 
    	if (genome_xref != "") { database$protein[row, "genome_xref"] <- genome_xref} 
    	if (genome_from != "") { database$protein[row, "genome_from"] <- genome_from} 
    	if (genome_to != "") { database$protein[row, "genome_to"] <- genome_to} 
    	if (sequence != "") { database$protein[row, "sequence"] <- sanitizeSequence(sequence)} 
    }
    if (table == "taxonomy") {
    	row <- which(database$taxonomy$id == id)
    	if (species_name != "") { database$taxonomy[row, "species_name"] <- species_name } 
    }
    return(database)
}

# Note that this code could be significantly improved: we are
# not doing any tests that the values are sane.
# But for now I just want to keep the code simple.



db$protein[1,]

db <- setDB(database = db, table="protein", id=1, name="Armadillo")
db$protein[1,]
db <- setDB(database = db, table="protein", id=1, name="Mbp1")

# With this, we can update our entries to set the sequence to its actual value.

seq

seq <- "
        1 msnqiysary sgvdvyefih stgsimkrkk ddwvnathil kaanfakakr trilekevlk
       61 ethekvqggf gkyqgtwvpl niakqlaekf svydqlkplf dftqtdgsas pppapkhhha
      121 skvdrkkair sastsaimet krnnkkaeen qfqsskilgn ptaaprkrgr pvgstrgsrr
      181 klgvnlqrsq sdmgfprpai pnssisttql psirstmgpq sptlgileee rhdsrqqqpq
      241 qnnsaqfkei dledglssdv epsqqlqqvf nqntgfvpqq qssliqtqqt esmatsvsss
      301 pslptspgdf adsnpfeerf pgggtspiis miprypvtsr pqtsdindkv nkylsklvdy
      361 fisnemksnk slpqvllhpp phsapyidap idpelhtafh wacsmgnlpi aealyeagts
      421 irstnsqgqt plmrsslfhn sytrrtfpri fqllhetvfd idsqsqtvih hivkrksttp
      481 savyyldvvl skikdfspqy rielllntqd kngdtalhia skngdvvffn tlvkmgaltt
      541 isnkegltan eimnqqyeqm miqngtnqhv nssntdlnih vntnnietkn dvnsmvimsp
      601 vspsdyityp sqiatnisrn ipnvvnsmkq masiyndlhe qhdneikslq ktlksisktk
      661 iqvslktlev lkesskdeng eaqtnddfei lsrlqeqntk klrkrliryk rlikqkleyr
      721 qtvllnklie detqattnnt vekdnntler lelaqeltml qlqrknklss lvkkfednak
      781 ihkyrriire gtemnieevd ssldvilqtl iannnknkga eqiitisnan sha
 //
"

db <- setDB(database = db, table="protein", id=1, sequence=seq)

seq <- "
        1 maprssavhv avysgvevye cfikgvsvmr rrrdswlnat qilkvadfdk pqrtrvlerq
       61 vqigahekvq ggygkyqgtw vpfqrgvdla tkykvdgims pilsldideg kaiapkkkqt
      121 kqkkpsvrgr rgrkpsslss stlhsvnekq pnssisptie ssmnkvnlpg aeeqvsatpl
      181 paspnallsp ndntikpvee lgmleapldk yeeslldffl hpeegripsf lyspppdfqv
      241 nsvidddght slhwacsmgh iemiklllra nadigvcnrl sqtplmrsvi ftnnydcqtf
      301 gqvlellqst iyavdtngqs ifhhivqsts tpskvaaaky yldcilekli siqpfenvvr
      361 lvnlqdsngd tslliaarng amdcvnslls ynanpsipnr qrrtaseyll eadkkphsll
      421 qsnsnashsa fsfsgispai ispscsshaf vkaipsissk fsqlaeeyes qlrekeedli
      481 ranrlkqdtl neisrtyqel tflqknnpty sqsmenlire aqetyqqlsk rlliwlearq
      541 ifdlerslkp htslsisfps dflkkedgls lnndfkkpac nnvtnsdeye qlinkltslq
      601 asrkkdtlyi rklyeelgid dtvnsyrrli amscginped lsleildave ealtrek
"
db <- setDB(database = db, table="protein", id=2, sequence=seq)


A Function to Add New Entries

The function to add a new entry to the database looks quite similar. But we have to add a test whether the species name has been entered when we add a species with a new taxonomy ID.

addToDB <- function(database,
                    name = "",
                    refseq_id = "",
                    uniprot_id = "",
                    taxonomy_id = "",
                    genome_xref = "",
                    genome_from = "",
                    genome_to = "",
                    sequence = "",
                    species_name = "") {
    if (missing(database) | missing(taxonomy_id)) {
    	stop("database and taxonomy_id have to be specified")
    }
    # handle taxonomy
    if (! any(database$taxonomy$id == taxonomy_id)) {  # new taxonomy_id
        if (missing(species_name)) {
    		stop("taxonomy_id is new: species_name has to be specified")
    	}
    	else {
    		# add this species to the taxonomy table
            database$taxonomy <- rbind(database$taxonomy,
                 data.frame(id = taxonomy_id,
                 species_name = species_name,
                 stringsAsFactors = FALSE))
    		
    	}
    }
    # handle protein
    # get a new protein id
    
    pid <- max(database$protein$id) + 1
    
    database$protein <- rbind(database$protein,
      data.frame(id = pid,
        name = name,
        refseq_id = refseq_id,
        uniprot_id = uniprot_id,
        taxonomy_id = taxonomy_id,
        genome_xref = genome_xref,
        genome_from = genome_from,
        genome_to = genome_to,
        sequence = sanitizeSequence(sequence),
        stringsAsFactors = FALSE))
        
    return(database)
}

# Test this.

newSeq <- "
        1 msgdktifka tysgvpvyec iinnvavmrr rsddwlnatq ilkvvgldkp qrtrvlerei
       61 qkgihekvqg gygkyqgtwi pldvaielae ryniqgllqp itsyvpsaad spppapkhti
      121 stsnrskkii padpgalgrs rratsietes evigaapnnv segsmspsps dissssrtps
      181 plpadrahpl hanhalagyn grdannhary adiildyfvt enttvpslli npppdfnpdm
      241 sidddehtal hwacamgrir vvklllsaga difrvnsnqq talmratmfs nnydlrkfpe
      301 lfellhrsil nidrndrtvf hhvvdlalsr gkphaaryym etminrlady gdqladilnf
      361 qddegetplt maararskrl vrlllehgad pkirnkegkn aedyiieder frsspsrtgp
      421 agielgadgl pvlptsslht seagqrtagr avtlmsnllh sladsydsei ntaekkltqa
      481 hgllkqiqte iedsakvaea lhheaqgvde erkrvdslql alkhainkra rddlerrwse
      541 gkqaikrarl qaglepgals tsnatnapat gdqkskddak sliealpagt nvktaiaelr
      601 kqlsqvqank telvdkfvar areqgtgrtm aayrrliaag cggiapdevd avvgvlcell
      661 qeshtgarag aggerddrar dvammlkgag aaalaanaga p
"

db2 <- addToDB(db,
            name = "UMAG_1122",
            refseq_id = "XP_011392621",
            uniprot_id = "A0A0D1DP35",
            taxonomy_id = 5270,
            genome_xref = "NC_026499.1",
            genome_from = 150555,
            genome_to = 152739,
            sequence = newSeq,
            species_name = "Ustilago maydis")

As you can see, adding a new protein to our database is now rather compact.

Add Your YFO Sequence

Task:
In the last assignment, you discovered a protein via BLAST that is related to yeast Mbp1 in YFO. (If you haven't noted its ID in your journal, you'll have to redo the BLAST search.)

Add that protein to the database. Ask on the mailing list if you don't know how (but be specific) or if you don't know how to find particular information items.

Check that the protein has been properly added to all columns.


 

Analyze

Let us perform a few simple sequence analyses using the online EMBOSS tools. EMBOSS (the European Molecular Biology laboratory Open Software Suite) combines a large number of simple but fundamental sequence analysis tools. The tools can be installed locally on your own machine, or run via a public Web interface. Google for EMBOSS explorer, public access points include http://emboss.bioinformatics.nl/ .

Access an EMBOSS Explorer services and explore some of the tools:


Task:

Local composition
  1. Find pepinfo under the PROTEIN COMPOSITION heading.
  2. Retrieve the YFO Mbp1 related sequence from your R database, e.g. with something like db$protein[db$protein$name == "UMAG_1122", "sequence"]
  3. Copy and paste the sequence into the input field.
  4. Run with default parameters.
  5. Scroll to the figures all the way at the bottom.
  6. Do the same in a separate window for yeast Mbp1.
  7. Try to compare ... (kind of hard without reference, overlay and expectation, isn't it?)


Task:

Global composition
  1. Find pepstats under the PROTEIN COMPOSITION heading.
  2. Paste the YFO Mbp1 sequence in FASTA format into the input field.
  3. Run with default parameters.
  4. Do the same in a separate window for yeast Mbp1.
  5. Try to compare ... are there significant and unexpected differences?


Task:

Motifs
  1. Find pepcoils, an algorithm to detect coiled coil motifs.
  2. Run this with the YFO Mbp1 sequence and yeast Mbp1.
  3. Try to compare ... do both sequences have coiled-coil motif predictions? Are they annotated in approximately comparable regions of the respective sequence?


Task:

Transmembrane sequences
  1. Find tmap. Also find shuffleseq.
  2. Use your YFO sequence to annotate transmembrane helices for your protein and for a few shuffled sequences. The YFO is not expected to have TM helices, nor are the shuffled sequences expected to have any. If you do find some, these are most likely "false positives".
  1. Also compare the following positive control: Gef1 - a yeast chloride channel with 10 trans-membrane helices and outward localized N-terminus:
>gi|6322500|ref|NP_012574.1| Gef1p [Saccharomyces cerevisiae S288c]
MPTTYVPINQPIGDGEDVIDTNRFTNIPETQNFDQFVTIDKIAEENRPLSVDSDREFLNSKYRHYREVIW
DRAKTFITLSSTAIVIGCIAGFLQVFTETLVNWKTGHCQRNWLLNKSFCCNGVVNEVTSTSNLLLKRQEF
ECEAQGLWIAWKGHVSPFIIFMLLSVLFALISTLLVKYVAPMATGSGISEIKVWVSGFEYNKEFLGFLTL
VIKSVALPLAISSGLSVGKEGPSVHYATCCGYLLTKWLLRDTLTYSSQYEYITAASGAGVAVAFGAPIGG
VLFGLEEIASANRFNSSTLWKSYYVALVAITTLKYIDPFRNGRVILFNVTYDRDWKVQEIPIFIALGIFG
GLYGKYISKWNINFIHFRKMYLSSWPVQEVLFLATLTALISYFNEFLKLDMTESMGILFHECVKNDNTST
FSHRLCQLDENTHAFEFLKIFTSLCFATVIRALLVVVSYGARVPAGIFVPSMAVGATFGRAVSLLVERFI
SGPSVITPGAYAFLGAAATLSGITNLTLTVVVIMFELTGAFMYIIPLMIVVAITRIILSTSGISGGIADQ
MIMVNGFPYLEDEQDEEEEETLEKYTAEQLMSSKLITINETIYLSELESLLYDSASEYSVHGFPITKDED
KFEKEKRCIGYVLKRHLASKIMMQSVNSTKAQTTLVYFNKSNEELGHRENCIGFKDIMNESPISVKKAVP
VTLLFRMFKELGCKTIIVEESGILKGLVTAKDILRFKRIKYREVHGAKFTYNEALDRRCWSVIHFIIKRF
TTNRNGNVI
  1. Evaluate the output: does the algorithm (wrongly) predict TM-helices in your protein? In the shuffled sequences? Does it find all ten TM-helices in Gef1?


Try to familiarize yourself with the offerings in the EMBOSS package. I find some of the nucleic acid tools indispensable in the lab, such as restriction-site mapping tools, and I frequently use the alignment tools Needle and Water, but by and large the utility of many of the components–while fast, efficient and straightforward to use– suffers from lack of reference and comparison and from terse output. The routines show their conceptual origin in the 1970s and 1980s. We will encounter alternatives in later assignments.


 

R Sequence Analysis Tools

It's interesting to see this collection of tools that were carefully designed some twenty years ago, as an open source replacement for a set of software tools - the GCG package - that was indispensable for molecular biology labs in the 80s and 90s, but whose cost had become prohibitive. Fundamentally this is a building block approach, and the field has turned to programming solutions instead.

As for functionality, much more sophisticated functions are available on the Web: do take a few minutes and browse the curated Web services directory of bioinformatics.ca.

As for versatility, R certainly has the edge. Let's explore some of the functions available in the seqinr package that you already encountered in the introductory R tutorial. They are comparatively basic - but it is easy to prepare our own analysis.

if (!require(seqinr, quietly=TRUE)) {
	install.packages("seqinr")
	library(seqinr)
}

help(package = seqinr) # shows the available functions

# Let's try a simple function
?computePI

# This takes as input a vector of upper-case AA codes
# Let's retrieve the YFO sequence from our datamodel
# (assuming it is the last one that was added):

db$protein[nrow(db$protein), "sequence"]

# We can use the function strsplit() to split the string
# into single characters, but we transform the sequence
# into uppercase first.

s <- db$protein[nrow(db$protein), "sequence"]
s <- toupper(s)
s <- strsplit(s, "") # splitting on the empty spring
                     # splits into single characters
s <- unlist(s)

# Alternatively, seqinr provides
# the function s2c() to convert strings into
# character vectors (and c2s to convert them back).

s <- s2c(toupper(db$protein[nrow(db$protein), "sequence"]))
s

computePI(s)  # isoelectric point
pmw(s)        # molecular weight
AAstat(s)     # This also plots the distribution of 
              # values along the sequence

# A true Labor of Love has gone into the
# compilation of the "aaindex" data:

?aaindex
data(aaindex)

length(aaindex)

# Here are all the index descriptions
for (i in 1:length(aaindex)) {
	cat(paste(i, ": ", aaindex[[i]]$D, "\n", sep=""))
}


# Lets use one of the indices to calculate and plot amino-acid
# composition enrichment: 
aaindex[[459]]

<source>


=== Sequence Composition Enrichment ===

We will plot an enrichment plot to compare one of the amino acid indices with the
situation in our sequence.

<source lang="R">

refData <- aaindex[[459]]$I   # reference frequencies in %
names(refData) <- a(names(refData))  # change names to single-letter
                                     # code using seqinr's "a()" function
refData


# tabulate our sequence of interest and normalize
obsData <- table(s)                # count occurrences
obsData = 100 * (obsData / sum(obsData))   # Normalize
obsData

len <- length(refData)

logRatio <- numeric() # create an empty vector

# loop over all elements of the reference, calculate log-ratios
# and store them in the vector
for (i in 1:len) {
	aa <- names(refData)[i] # get the name of that amino acid
	fObs <- obsData[aa]  # retrieve the frequency for that name
	fRef <- refData[aa]
	logRatio[aa] <- log(fObs / fRef) / log(2)
}

barplot(logRatio)



# Sort by frequency, descending
logRatio <- sort(logRatio, decreasing = TRUE)  
 
barplot(logRatio)



# label the y-axis
# (see text() for details)
label <- expression(paste(log[2],"( f(obs) / f(ref) )", sep = ""))

barplot(logRatio, 
        main = paste("AA composition enrichment"),
        ylab = label, 
        cex.names=0.9)



# color the bars by type 
# define colors
chargePlus  <- "#282A3F"
chargeMinus <- "#531523"
hydrophilic <- "#47707F"
hydrophobic <- "#6F7A79"
plain       <- "#EDF7F7"
  
# Assign the colors to the different amino acid names
barColors <- character(len)

for (i in 1:length(refData)) {
	AA <- names(logRatio[i])
         if (grepl("[HKR]",      AA)) {barColors[i] <- chargePlus }
    else if (grepl("[DE]",       AA)) {barColors[i] <- chargeMinus}
    else if (grepl("[NQST]",     AA)) {barColors[i] <- hydrophilic}
    else if (grepl("[FAMILYVW]", AA)) {barColors[i] <- hydrophobic}
    else                               barColors[i] <- plain
}

barplot(logRatio, 
        main = paste("AA composition enrichment"),
        ylab = label, 
        col = barColors, 
        cex.names=0.9)


# draw a line at 0
abline(h=0)
 
# add a legend that indicates what the colours mean
legend (x = 1,
        y = -1,
        legend = c("charged (+)",
                   "charged (-)",
                   "hydrophilic",
                   "hydrophobic",
                   "plain"),
        bty = "n",
        fill = c(chargePlus,
                 chargeMinus,
                 hydrophilic,
                 hydrophobic, 
                 plain)
        )

Task:
Interpret this plot. (Can you?)


 


That is all.


 

Links and resources

  • RegEx Pal - a tool for testing and developing regular expressions.


 


Footnotes and references

  1. Relational databases like MySQL, PostgresQL, and MariaDB offer the datatype "Enum" for this purpose but this is an inferior solution. Enum types need to be fixed at the time the schema is created, they can't store information about their semantics, i.e. how the keywords are defined and when they should be used, and they can't be used in more than one table, since they are metadata of one particular column.


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.



< Assignment 2 Assignment 4 >