Difference between revisions of "User:Boris/Temp/APB"

From "A B C"
Jump to navigation Jump to search
m
 
(125 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{Template:Active}}
+
<div id="APB">
<!-- {{Template:Inactive}} -->
 
  
 +
<table width="40%"><tr><td class="l1">&nbsp;</td><td>
  
&nbsp;
+
===Hardware===
&nbsp;
+
<table width="100%">
 +
<tr class="s1"><td class="l1">High performance computing <!-- (... at the bench: GPUs, FPGAs, Clusters) --></td></tr>
 +
<tr class="s2"><td class="l1">Cloud computing</td></tr>
 +
<tr><td class="sp">&nbsp;</td></tr>
 +
</table>
  
 +
===Systems and Tools===
 +
<table width="100%">
  
__TOC__
+
<tr class="s1"><td class="l1 mw-collapsible mw-collapsed" data-expandtext="Expand subtopics" data-collapsetext="Collapse">[[Unix]]
&nbsp;
+
<div class="mw-collapsible-content">
&nbsp;
+
<table width="100%"><tr class="s2"><td class="l2">[[Unix system administration]]</td></tr></table>
 
+
<table width="100%"><tr class="s1"><td class="l2">[[Unix automation]]</td></tr></table>
<div style="padding: 5px; background: #A6AFD0;  border:solid 1px #AAAAAA; font-size:200%;font-weight:bold;">
+
<table width="100%"><tr class="s2"><td class="l2">[[Program installation]]</td></tr></table>
Assignment 2 - Search, retrieve and annotate
+
<table width="100%"><tr class="s1"><td class="l2">[[wget]]</td></tr></table>
</div>
 
 
 
&nbsp;
 
&nbsp;
 
 
 
 
 
{{Template:Preparation|
 
care=Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which important aspects have simply been overlooked and marks are unnecessarily lost. Sadly, we always get assignments back in which important aspects have simply been overlooked and marks are unnecessarily lost. If you did not notice that the above sentence was repeated, you are not reading carefully enough.|
 
num=2|
 
ord=second|
 
due = Thursday, October 24 at 12:00 noon (before the quiz)}}
 
 
 
 
 
;Your documentation for the procedures you follow in this assignment will be worth 1 mark.
 
 
 
 
 
&nbsp;
 
&nbsp;
 
 
 
 
 
 
 
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
 
Introduction
 
</div>
 
Baker's yeast, ''Saccharomyces cerevisiae'', is perhaps the most important [http://en.wikipedia.org/wiki/Model_organism model organism]. It is a eukaryote that has been studied genetically and biochemically in great detail for many decades, and it is easily manipulated with high-throughput experimental methods. We will use information from this model organism to study the conservation of function and sequence in other fungi whose genomes have been completely sequenced; the assignments are an exercise in model-organism reasoning: the transfer of knowledge from one, well-studied organism to others.
 
 
 
This and the following assignments will revolve around a transcription factor that plays an important role in the regulation of the cell cycle: '''Mbp1''' is a key component of the MBF complex (Mbp1/Swi6). This complex regulates gene expression at the crucial G1/S-phase transition of the mitotic cell cycle and has been shown to bind to the regulatory regions of more than a hundred target genes.
 
 
 
One would speculate that such central control machinery would be conserved in other fungi and it will be your task in these assignments to collect evidence whether related molecular machinery is present in some of the newly sequenced fungal genomes. Throughout the assignments we will use freely available tools to conduct bioinformatics investigations of questions such as:
 
*What functional features can we detect in Mbp1?
 
*Do homologous proteins exist in other organisms?
 
*Do we believe these homologues may bind to similar sequence motifs?
 
*Do we believe they may function in a similar way?
 
*Do other organisms appear to have related cell-cycle control systems?
 
 
 
 
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*Access the [http://db.yeastgenome.org/cgi-bin/locus.pl?locus=mbp1 information page on Mbp1] at the ''Saccharomyces'' Genome Database and read the summary paragraph on the protein's function!
 
</div>
 
 
 
(If you would like to brush up on the concepts mentioned above, you could study the corresponding chapter in [http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=mcb.chapter.3432 Lodish's Molecular Cell Biology] and./or read Nobel laureate [http://www.cumc.columbia.edu/dept/eukaryotic/nurse.pdf Paul Nurse's review (pdf)] of the key concepts of the eukaryotic cycle. It is not strictly necessary to understand the details of the yeast cell-cycle to complete the assignments, but recommended, since it's obviously more fun to work with concepts that actually make some sense.)
 
 
 
In this particular assignment you will go on a search and retrieve mission for information on yeast Mbp1, using common public databases and Web resources.
 
 
 
 
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
==Retrieve==
 
</div>
 
 
 
 
 
Much useful information on yeast Mbp1 is compiled at the [http://db.yeastgenome.org/cgi-bin/locus.pl?locus=mbp1 SGD information page on Mbp1]. However we don't always have the luxury of such precompiled information. Let's look at the protein and it's features "the traditional way".
 
 
 
 
 
<div style="padding: 5px; background: #EEEEEE;  border:solid 1px #AAAAAA;">
 
*Navigate to the NCBI homepage (you probably have bookmarked it anyway) and enter <code>Mbp1 AND "saccharomyces cerevisiae"[organism]</code> as an Entrez query.
 
*Click on '''Protein''' and find the RefSeq record for the protein sequence.
 
*From the NCBI RefSeq record, obtain a FASTA sequence of the protein and paste it into your assignment.
 
</div>
 
 
 
 
 
There are several sources for functional domain annotations of proteins. The NCBI has the [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml Conserved Domain Database], in Europe, the [http://smart.embl-heidelberg.de/ SMART database] provides such annotations. In terms of domains, both resources are very comparable. But SMART also analyses more general features such as low-complexity sequences and coiled coils. In order to use SMART however, we need the '''Uniprot accession number''' that corresponds to the refseq identifier. In a rational world, one would wish that such important crossreferences would simply be provided by the NCBI ... well, we have been wishing this for many years now. Fortunately ID-mapping services exist.
 
 
 
 
 
<div style="padding: 5px; background: #EEEEEE;">
 
*Navigate to the [http://www.uniprot.org/?tab=mapping UniProt ID-Mapping service]. Enter the RefSeq identifier for the yeast Mbp1 protein and retrieve the corresponding UniProtKB Accession number. If this does not work, try the same mapping at the [http://pir.georgetown.edu/pirwww/search/idmapping.shtml PIR ID-mapping service]. Note the Uniprot accession number you find. (Should this work equally on both sites?)
 
 
</div>
 
</div>
 +
</td></tr>
  
 +
<tr class="s2"><td class="l1">[[Network Configuration]]</td></tr>
 +
<tr class="s1"><td class="l1">[[Apache]]</td></tr>
 +
<tr class="s2"><td class="l1">[[MySQL]]</td></tr>
 +
<tr class="s1"><td class="l1">[[Tools for the bioinformatics lab]]</td></tr>
 +
<tr class="s2"><td class="l1">[[GBrowse|GBrowse and LDAS]]</td></tr>
 +
<tr><td class="sp">&nbsp;</td></tr>
 +
</table>
  
Now navigate to [http://www.uniprot.org '''Uniprot'''], enter the ID you have found into the search field and select [Sequence Clusters(UniRef)] as the database to search in. There should be two sequences in the '''[UniRef100 ... (100% identical)]''' cluster. Compare them. One of them is a highly annotated Swiss-Prot record, the other is practically unannotated data that has been imported from a "third party" to UniProt. Unfortunately, that one is the sequence that the ID mapping service had found. No cross-references to the NCBI are included with Swiss-Prot records, nor do NCBI RefSeq records cross-reference NCBI holding. I consider this a sorry state of affairs. Therefore most of us actually run BLAST searches to find equivalent sequences in other databases and this is the most wasteful way imaginable to address the problem.
+
===Programming===
 +
<table width="100%" >
 +
<tr class="s1"><td class="l1">[[IDE|IDE (Integrated Development Environment)]]</td></tr>
 +
<tr class="s2"><td class="l1">[[Regular Expressions]]</td></tr>
 +
<tr class="s1"><td class="l1">[[Screenscraping]]</td></tr>
  
 
+
<tr class="s2"><td class="l1 mw-collapsible mw-collapsed" data-expandtext="Expand subtopics" data-collapsetext="Collapse">[[Perl]]
<div style="padding: 5px; background: #EEEEEE;">
+
<div class="mw-collapsible-content">
*Note down the SwissProt ID and the UniProtKB Accession Number for yeast Mbp1.
+
<table width="100%"><tr class="s1"><td class="l2">[[Perl basic programming]]</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">[[Perl hash example]]</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">[[Perl LWP example]]</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">[[Perl MySQL introduction]]</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">[[Perl OBO parser]]</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">[[Perl basic programming]]</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">[[Perl programming exercises 1]]</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">[[Perl programming exercises 2]]</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">[[Perl programming Data Structures]]</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">[[Perl references]]</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">[[Perl simulation]]</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">[[Perl: Object oriented programming]]</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">[[Perl: Ugly programming]]</td></tr></table>
 
</div>
 
</div>
 +
</td></tr>
  
 +
<tr class="s1"><td class="l1">[[BioPerl]]</td></tr>
 +
<tr class="s2"><td class="l1">[[PHP]]</td></tr>
 +
<tr class="s1"><td class="l1">[[Data modelling]]</td></tr>
 +
<tr class="s2"><td class="l1">BioPython <!-- (scope, highlights, installation, use, support) --></td></tr>
 +
<tr class="s1"><td class="l1">Graphical output <!-- (PNG and SVG) --></td></tr>
 +
<tr class="s2"><td class="l1">[[Autonomous agents]]</td></tr>
 +
</table>
  
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
+
===Algorithms===
 
+
<table width="100%" >
==Analyse==
+
<tr class="sh"><td class="l1">Algorithms on Sequences</td></tr>
</div>
+
<tr class="s1"><td class="l2">[[Dynamic Programming]]</td></tr>
&nbsp;
+
<tr class="s2"><td class="l2">[[Multiple Sequence Alignment]]</td></tr>
&nbsp;
+
<tr class="s1"><td class="l2">[[Genome Assembly]]</td></tr>
  
 +
<tr><td class="sp">&nbsp;</td></tr>
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+
<tr class="sh"><td class="l1">Algorithms on Structures</td></tr>
 +
<tr class="s1"><td class="l2">[[Docking]]</td></tr>
 +
<tr class="s2"><td class="l2">Protein Structure Prediction <!-- ''ab initio'' --></td></tr>
  
=== ''saccharomyces cerevisiae'' Mbp1 - domain annotations===
+
<tr><td class="sp">&nbsp;</td></tr>
</div>
 
  
Now we can analyse Mbp1's domain in SMART, and use this information to annotate the sequence in detail.
+
<tr class="sh"><td class="l1">Algorithms on Trees</td></tr>
 +
<tr class="s1"><td class="l2">Computing with trees <!-- Bayesian approaches for phylogenetic trees, tree comparison) --></td></tr>
  
<div style="padding: 5px; background: #EEEEEE;">
+
<tr><td class="sp">&nbsp;</td></tr>
*Navigate to the [http://smart.embl-heidelberg.de/ SMART database], enter the yeast Mbp1 accession number and review the domain features of the protein.
 
*In your assignment, highlight the annotated features in the actual sequence by using the SMART annotations.
 
</div>
 
  
&nbsp;
+
<tr class="sh"><td class="l1">Algorithms on Networks</td></tr>
 +
<tr class="s1"><td class="l2">Network metrics <!-- (Degree distributions, Centrality metrics, other metrics on topology, small-world- vs. random-geometric controversy) --></td></tr>
 +
<tr class="s2"><td class="l3">[[Dijkstras Algorithm]]</td></tr>
 +
<tr class="s1"><td class="l3">[[Floyd Warshall Algorithm]]</td></tr>
 +
</table>
  
&nbsp;
 
  
 +
===Communication and collaboration===
 +
<table width="100%" >
 +
<tr class="s1"><td class="l1">[[MediaWiki]]</td></tr>
 +
<tr class="s2"><td class="l1">[[HTML essentials]]</td></tr>
 +
<tr class="s1"><td class="l1">[[HTML 5]]</td></tr>
 +
<tr class="s2"><td class="l1">[[SADI|SADI Semantic Automated Discovery and Integration]]</td></tr>
 +
<tr class="s1"><td class="l1">[[CGI]]</td></tr>
 +
<tr><td class="sp">&nbsp;</td></tr>
 +
</table>
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+
===Statistics===
 +
<table width="100%" >
 +
<tr class="s1"><td class="l1">[[Pattern discovery]]</td></tr>
 +
<tr class="s2"><td class="l1">Correlation <!-- (Covariance matrices and their interpretation, application to large problems, collaborative filtering, MIC and MINE) --></td></tr>
 +
<tr class="s1"><td class="l1">Clustering methods <!-- (Algorithms and choice (including: hierarchical, model-based and partition clustering, graphical methods (MCL), flow based methods (RRW) and spectral methods). Implementation in R if possible) --></td></tr>
 +
<tr class="s2"><td class="l1">Cluster metrics <!-- (Cluster quality metrics (Akaike, BIC)–when and how) --></td></tr>
 +
<tr class="s1"><td class="l1">[[Map equation|The Map Equation]] </td></tr>
 +
<tr class="s2"><td class="l1">Machine learning <!-- (Classification problems: Neural Networks, HMMs, SVM..) --></td></tr>
  
=== APSES domains ===
+
<tr class="s1"><td class="l1 mw-collapsible mw-collapsed" data-expandtext="Expand subtopics" data-collapsetext="Collapse">[[R]]
 +
<div class="mw-collapsible-content">
 +
<table width="100%"><tr class="s2"><td class="l2">R plotting</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">[[R programming]]</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">R EDA</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">R regression</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">R PCA</td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">R Clustering</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">R Classification <!-- Phrasing inquiry as a classification problem, dealing with noisy data, machine learning approaches to classification, implementation in R) --></td></tr></table>
 +
<table width="100%"><tr class="s1"><td class="l2">R hypothesis testing</td></tr></table>
 +
<table width="100%"><tr class="s2"><td class="l2">[[Bioconductor]]</td></tr></table>
 
</div>
 
</div>
 +
</td></tr>
  
As you see from the annotations, Mbp1 is a large multidomain protein; it binds DNA through a small domain called the APSES domain and many organisms have more than one transcription factor that has a domain homologous to other APSES domains. Since we are interested in related proteins, and all functional relatives would be expected to share such a DNA binding domain, we should define this domain in more detail in order to be able to use it later to search for homologous proteins in each target organism.
+
<tr><td class="sp">&nbsp;</td></tr>
 +
</table>
  
&nbsp;<br>
+
===Applications===
Use the NCBI Entrez system to search for the string "apses" in the "Conserved Domains" database and access the entry for the APSES domain. You should find a number of aligned sequences on that page, each with their own GI identifier.
+
<table width="100%" >
 +
<tr class="s1"><td class="l1">[[Data integration]] <!-- Add BioMart: Biodata integration, and data-mining of complex, related, descriptive data --></td></tr>
 +
<tr class="s2"><td class="l1">Text mining <!-- (Use cases, tasks and metrics, taggers, vocabulary mapping, Practicals: R-support, Python/Perl support, others...) --></td></tr>
 +
<tr class="s1"><td class="l1">[[HMMER]]</td></tr>
 +
<tr class="s2"><td class="l1">High-throughput sequencing</td></tr>
 +
<tr class="s1"><td class="l1">Functional annotation <!-- GFF --></td></tr>
 +
<tr class="s2"><td class="l1">Microarray analysis <!-- (... in R: differential expression and multiple testing; Loading and normalizing data, calculating differential expression, LOWESS, the question of significance, FWERs: Bonferroni and FDR; SAM and LIMMA) --></td></tr>
 +
<tr><td class="sp">&nbsp;</td></tr>
 +
</table>
 +
</td></tr></table>
  
<div style="padding: 5px; background: #EEEEEE;">
 
*Identify the two sequences that come from ''Saccharomyces cerevisiae'' (the Mbp1 and Swi4 APSES domains).
 
*Check whether the NCBI and the SMART definition of the APSES domain in Mbp1 coincide.
 
*Make sure you understand how the sequences displayed on the CDD page and the actual domain sequences differ. <small>Hint: not all sequences are displayed in their full-length.</small>
 
 
</div>
 
</div>
 
&nbsp;
 
 
&nbsp;
 
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
 
=== APSES domain structure ===
 
</div>
 
 
We can expect that the structures of all homologous APSES domains should be similar, i.e. if the structure of one is known, we should be able to conclude the approximate three-dimensional structure of any APSES domain. Indeed, structural information ''is'' available for APSES domains!
 
 
Identify and download the most appropriate coordinate file to study the structure, function and conservation of APSES domains from the PDB. Your choice could be based on:
 
* experimental method (X-ray or NMR)
 
* quality of the structure (resolution, refinement)
 
* size of the structure (number of animo acids for which structure has been determined)
 
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*Record how you have identified the file, what criteria you have used to define whether it is better suited for analysis than others, and paste the <tt>HEADER</tt>,  <tt>TITLE</tt>,  <tt>COMPND</tt> and  <tt>SOURCE</tt> records from the file into your assignment.
 
</div>
 
 
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
 
=== DNA binding site ===
 
</div>
 
 
The Mbp1 APSES domain has been shown to bind to DNA and the residues involved in DNA binding have been characterized. ([http://www.ncbi.nlm.nih.gov/pubmed/10747782 Taylor ''et al.'' (2000) ''Biochemistry'' '''39''': 3943-3954] and [http://www.ncbi.nlm.nih.gov/pubmed/18491920 Deleeuw ''et al.'' (2008) Biochemistry. '''47''':6378-6385]) . In particular the residues between 50-74 have been proposed to comprise the DNA recognition domain.
 
 
&nbsp;<br><div style="padding: 5px; background: #FFCC99;">
 
;Analysis (1 mark)
 
 
* Using VMD, generate a parallel stereo view of the protein structure that clearly shows the proposed Mbp1 DNA recognition domain, distinctly coloured differently from the rest of the protein. Use a representation that includes the sidechains.
 
 
* Generate a second VMD stereo image as above, but use a representation that emphasizes the secondary structure of the structure (tube or cartoon representation, colouring by structure).
 
 
* Generate a third VMD stereo image  that shows three representations combined: (1) the backbone, (2) the sidechains of residues that presumably contact DNA, distinctly colored, and (3) a transparent surface of the entire protein. This image should show whether residues annotated as DNA binding form a contiguous binding interface.
 
 
Paste the images into your assignment in a compressed format. Briefly(!) summarize the VMD forms and parameters you have used.
 
</div>
 
 
 
DNA binding interfaces are expected to comprise a number of positively charged amino acids, that might form salt-bridges with the phosphate backbone.
 
 
&nbsp;<br><div style="padding: 5px; background: #FFCC99;">
 
;Analysis (2 marks)
 
 
*Report whether this is the case here and which residues might be included.
 
 
*Do the DNA binding residues form a contiguous surface that is compatible with a binding interface? Justify your conclusions.
 
 
</div>
 
 
&nbsp;
 
 
&nbsp;
 
 
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
== Onward: the Genome of Interest ==
 
</div>
 
 
Up to now, we have looked at the model-organism gene to obtain a baseline of information we are interested in. To move on, we need to access the genome of an organism we are interested in. In this course, the organism of interest is assigned to you.
 
 
The systematic name and strain of a fungus is listed with the [[Group project|project group]] that you have been assigned to. Navigate to the NCBI homepage &rarr; "Genomic Biology" &rarr; "Fungal Genomes Central" &rarr; "Genome Sequencing Projects". This should take you to a tabular view of ongoing and completed fungal genome sequencing projects. Find your organism name in this table. There may be one or more sequencing projects associated with the organism, but there should be only one project for the specific strain.
 
 
Click on the organism name to navigate to the Genome Project information page.
 
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*Review the status of the data you are working with - such as
 
**whether the entire genome is available or only a partial sequence;
 
**How many chromosomes does this genome have?
 
**What is the status of its genome assembly and annotation?
 
**Has the mitochondrial genome been sequenced as well?
 
**Why is this organism deemed important enough to be sequenced?
 
</div>
 
 
&nbsp;
 
 
 
<div style="padding: 5px; background: #D3D8E8;  border:solid 1px #AAAAAA;">
 
[End of assignment]
 
</div>
 
 
If you have any questions at all, don't hesitate to mail me at [mailto:boris.steipe@utoronto.ca boris.steipe@utoronto.ca] or post your question to the [mailto:bch441_2011@googlegroups.com Course Mailing List]
 

Latest revision as of 12:44, 27 September 2015

 

Hardware

High performance computing
Cloud computing
 

Systems and Tools

Unix
Network Configuration
Apache
MySQL
Tools for the bioinformatics lab
GBrowse and LDAS
 

Programming

IDE (Integrated Development Environment)
Regular Expressions
Screenscraping
Perl
BioPerl
PHP
Data modelling
BioPython
Graphical output
Autonomous agents

Algorithms

Algorithms on Sequences
Dynamic Programming
Multiple Sequence Alignment
Genome Assembly
 
Algorithms on Structures
Docking
Protein Structure Prediction
 
Algorithms on Trees
Computing with trees
 
Algorithms on Networks
Network metrics
Dijkstras Algorithm
Floyd Warshall Algorithm


Communication and collaboration

MediaWiki
HTML essentials
HTML 5
SADI Semantic Automated Discovery and Integration
CGI
 

Statistics

Pattern discovery
Correlation
Clustering methods
Cluster metrics
The Map Equation
Machine learning
R
R plotting
R programming
R EDA
R regression
R PCA
R Clustering
R Classification
R hypothesis testing
Bioconductor
 

Applications

Data integration
Text mining
HMMER
High-throughput sequencing
Functional annotation
Microarray analysis