Difference between revisions of "User:Boris/Temp/APB"

From "A B C"
Jump to navigation Jump to search
Line 1: Line 1:
<!-- div style="padding: 5px; background: #FF4560;  border:solid 2px #000000;">
+
<!-- {{Template:Active}} -->
'''Note!'''
+
{{Template:Inactive}}
This assignment is currently inactive. Major and minor unannounced changes may be made at any time.
 
</div>
 
&nbsp;
 
 
 
&nbsp; -->
 
  
  
Line 13: Line 8:
  
 
<div style="padding: 5px; background: #A6AFD0;  border:solid 1px #AAAAAA; font-size:200%;font-weight:bold;">
 
<div style="padding: 5px; background: #A6AFD0;  border:solid 1px #AAAAAA; font-size:200%;font-weight:bold;">
Assignment 3 - Multiple Sequence Alignment
+
Assignment 4 - Homology modeling
 
</div>
 
</div>
  
<!--Please note: This assignment is currently inactive. Unannounced changes may be made at any time.
+
<div style="padding: 15px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
&nbsp;-->
+
;How could the search for ultimate truth have revealed so hideous and visceral-looking an object?
'''Please note: This assignment is currently active. All significant changes will be announced on the course mailing list.'''
+
::''<small>Max Perutz (on his first glimpse of the Hemoglobin structure)</small>''
 +
</div>
 +
&nbsp;
 
&nbsp;
 
&nbsp;
  
 +
Where is the hidden beauty in structure, and where, the "ultimate truth"? In the previous assignments we have studied sequence conservation in APSES family domains and looked at how these domains have evolved over time. We have seen that this is an ancient family, that had several members already in the cenancestor of all fungi, an organism that lived in the [http://www.ucmp.berkeley.edu/fungi/fungifr.html vendian period] of the proterozoic era of precambrian times, more than 600,000,000 years ago.
  
<div style="padding: 2px; background: #F0F1F7; border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+
In order to understand how particular residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to consider an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. In particular, it would be interesting to correlate the conservation patterns we have observed in the MSAs with specific DNA binding interactions. Unfortunately, the 1MB1 structure does not have DNA bound and the evidence we have considered in Assignment 2 ([http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=10747782 Taylor ''et al.'', 2000]) is not sufficient to define the details of how a DNA double helix might be bound. These details would require the structure of a complex that contains protein as well as DNA. No such complex of an APSES domain has yet been crystallized.
Introduction
+
 
 +
''In this assignment you will construct a molecular model of the Mbp1 orthologue in your assigned organism, identify similar structures of distantly related domains for which protein-DNA complexes are known, define whether the available evidence allows you to distinguish between different modes of ligand binding, and assemble a hypothetical complex structure.''
 +
 
 +
For the following, please remember the following terminology:
 +
 
 +
;Target
 +
:The protein that you are planning to model.
 +
;Template
 +
:The protein whose structure you are using as a guide to build the model.
 +
;Model
 +
:The structure that results from the modeling process. It has the '''Target sequence''' and is similar to the '''Template structure'''.
 
&nbsp;
 
&nbsp;
  
;Take care of things, and they will take care of you.
+
A brief overview article on the construction and use of homology models is linked to the resource section at the bottom of this page. That section also contains links to other sites and resources you might require.
:''Shunryu Suzuki''
 
</div>
 
  
A carefully done multiple sequence alignment (MSA) is a cornerstone for the annotation of a gene or protein. MSAs combine the information from several related proteins, allowing us to study their essential, shared and conserved properties. They are useful to resolve ambiguities in the precise placement of gaps and to ensure that columns in alignments actually contain amino acids that evolve in a similar context. Therefore we need MSAs as input for
+
{{Template:Preparation|
* protein homology modeling,
+
care=Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we see too many assignments which, arduously effected, nevertheless intimate nescience of elementary tenets of molecular biology. If the sentence above did not trigger an urge to open a dictionary, you are trying to guess, rather than confirm possibly important information.|
* phylogenetic analyses, and
+
num=4|
* sensitive homology searches in databases.
+
ord=fourth|
 +
due = Monday, November 5 at 10:00 in the morning}}
  
In addition, conservation - or the lack of conservation - is a consequence of selection under the constraints imposed by the structural or functional features of a protein. Conservation patterns emphasize domain boundaries in multi-domain proteins, and amino acid propensities are powerful predictors for protein engineering and design.
 
  
Given the ubiquitous importance of multiple sequence alignment, it is remarkable that by far the most frequently used algorithm is CLUSTAL, a procedure that was first published for the microprocessors of the late 1980s, surpassed in performance many times, and shown to be significantly inferior to more modern approaches when aligning sequences with about 30% identity or less.
+
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 +
==(1) Preparation==
 +
</div>
  
In this assignment we will explore MSAs of fungal proteins that are orthologous to yeast Mbp1, and of the APSES domains they contain, and compare several approaches to alignment:
 
  
* A model-based approach (based on the [[Glossary#PSSM| PSSM]] that PSI-BLAST generates)
+
<!--
* A progressive alignment - the CLUSTAL algorithm
 
* A consistency based alignment - T-Coffee, MUSCLE or Probcons
 
  
 +
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 +
===Choosing a template (1 marks)===
 +
</div>
 +
&nbsp;<br>
 +
Often more than one related structure can be found in the PDB. We have discussed principles of selecting template structures in the lecture. Interestingly the PDB itself cannot be searched for the contents of its holdings, by structural- or sequence similarity, but there is always BLAST since the NCBI conveniently allows you to search against all sequences in PDB files.
  
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+
*Use BLAST to identify all PDB files that contain APSES domains that are clearly homologuous to your target. (Document that you have searched in the correct subsection of the Genbank holdings). For the hits you find, consider how these structures differ and which features would make each more or less suitable for your task. Comment briefly on what options you have, select one template and note why you have decided to use this particular structure as a template. Include aspects of sequence similarity, length of the sequence, presence or absence of ligands and their potential effect on the structure, and experimental method and quality in your reasoning.
Preparation, submission and due date
 
</div>
 
  
Please read carefully. Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which people have simply overlooked crucial questions. Sadly, we always get assignments back in which people have not described procedural details. If you did not notice that the above were two different sentences, you are still not reading carefully enough.
+
*Note which sequence is contained in the coordinate section of the PDB file; note if and how this implied sequence differs from the sequences ...
  
Prepare a Microsoft Word document with a title page that contains:
+
:*listed in the seqres records;
*your full name
+
:*given in the FASTA sequence for the template that the PDB provides;
*your Student ID
+
:*and that stored by the NCBI.
*your e-mail address
 
*the organism name you have been [[Organism_list_2007|assigned]]
 
  
Follow the steps outlined below. You are encouraged to  write your answers in short answer form or point form, '''like you would document an analysis in a laboratory notebook'''. However, you must
+
* In a table, establish the correspondence of the coordinate sequence numbering (defined by the residue numbers/insertion codes in the atom records) with your target sequence numbering.
*document what you have done,
 
*note what Web sites and tools you have used,
 
*paste important data sequences, alignments, information etc.
 
  
'''If you do not document the process of your work, we will deduct marks.'''  Try to be concise, not wordy! Use your judgement: are you giving us enough information so we could exactly reproduce what you have done? If not, we will deduct marks. Avoid RTF and unnecessary formating. Do not paste screendumps or other uncompressed images. The size of your submission must remain '''below 1.5 MB'''.
+
* Retrieve the most suitable template structure coordinate file from the PDB.
  
Write your answers into separate paragraphs and give each its title. Save your document with a filename of:
+
-->
<code>A3_family name.given name.doc</code>
 
<small>(for example my submission would be named: A3_steipe.boris.doc - and don't switch the order of your given name and family name please!)</small>
 
  
Finally e-mail the document to [mailto:boris.steipe@utoronto.ca boris.steipe@utoronto.ca] before the due date.
 
  
Your document must not contain macros. Please turn off and/or remove all macros from your Word document; we will disable macros, since they pose a security risk.
+
&nbsp;
 +
&nbsp;
  
With the number of students in the course, we have to economize on processing the assignments. '''Thus we will not accept assignments that are not prepared as described above.''' If you have technical difficulties, contact the course coordinator.
+
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 +
=== The input alignment (1 marks)===
 +
</div>
 +
&nbsp;<br>
  
'''The due date for the assignment is Monday, October 22. at 10:00.'''
+
The sequence alignment between target and template is the single most important factor that determines the quality of your model.
  
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+
No homology modeling process will repair an incorrect alignment and it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment, rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these just because they are convenient, rather than the more sophisticated methods and more informed procedures we have discussed. Detailed analysis of fallacious models rarely leads to good results.
Grading
 
</div>
 
  
Don't wait until the last day to find out there are problems! Assignments that are received past the due date will have one mark deducted and an additional mark for every full twelve hour period past the due date. Assignments received more than 5 days past the due date will not be assessed.  If you need an extension, you '''must''' arrange this beforehand.
+
The best possible alignment is usually constructed from a multiple sequence alignment that includes at least the target and template sequence and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Typically such an alignment will also include additional optimization steps to move insertions or deletions between target and template out of the secondary structure elements of the template structure.
  
Marks are noted below in the section headings for of the tasks. A total of 10 marks will be awarded, if your assignment answers all of the questions. A total of 2 bonus marks (up to a maximum of 10 overall) can be awarded for particularily interesting findings, or insightful comments. A total of 2 marks can be subtracted for lack of form or for glaring errors. The marks you receive will
+
Here is an excerpt from the T-coffee aligned Mbp1 sequences: it contains all the residues of the yeast sequence that are found in the 1MB1 crystal structure - the '''template''' sequence for our homology model - and it has been edited to remove the N-terminal gaps in the sequence. Thus the N-terminus is 21 amino acids longer than the definition of the APSES domain in CDD (which starts with <code>SIMKR...</code>), the C- terminus is slightly shorter.  
* count directly towards your final marks at the end of term, for BCH441 (undergraduates), or
 
* be divided by two for BCH1441 (graduates).
 
  
&nbsp;
+
Since the sequences are very similar between each other, there is no ambiguity in the alignment and the construction of a homology model should be straightforward. Normally one would spend considerable some effort at this stage to consider which parts of the target sequence and the template sequence appear to  correctly aligned and to edit the alignment manually. In our case, evolutionary pressure was so strong that essentially all have evolved without a single indel in their sequence.
&nbsp;
 
  
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
+
I have added to the alignment the APSES domain of [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=116197493&dopt=GenPept XP_001224558], the ''Chaetomium globosum'' Mbp1 orthologue (MBP1_CHAGL). This will serve as the reference and fallback sequence.
==(1) Retrieve==
 
</div>
 
&nbsp;
 
&nbsp;
 
  
In [[Assignment 2]] you retrieved the protein sequences of ''saccharomyces cerevisiae'' [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=6320147 '''Mbp1'''] and its orthologue in your assigned organism. In order to produce a multiple sequence alignment, we have to define which sequences we wish to use. Then we need to retrieve the sequences from the database. Finally we have to store the sequences in a format that we can use as input for the alignment programs.
+
1MB1            NQIYSARYSGVDVYEFIHSTG---SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEV
 +
MBP1_CANGL      NQIYSAKYSGVDVYEFIHPTG---SIMKRKNDGWVNATHILKAANFAKAKRTRILEKEV
 +
MBP1_EREGO      TQIYSAKYSGVEVYEFLHPTG---SIMKRKADDWVNATHILKAAKFAKAKRTRILEKEV
 +
MBP1_KLULA      NQIYSAKYSGVDVYEFIHPTG---SIMKRKADNWVNATHILKAAKFPKAKRTRILEKEV
 +
MBP1_CANAL      SQIYSATYSNVPAFEFVTSEG---PIMRRKKDSWINATHILKIAKFPKAKRTRILEKDV
 +
MBP1_DEBHA      TQIYSATYSNVPVFEFVTLEG---PIMRRKLDSWINATHILKIAKFPKAKRTRILEKDV
 +
MBP1_YARLI      MSIYKATYSGVPVYEFQCKNV---AVMRRKSDGWVNATHILKVAGFDKPQRTRILEKEV
 +
MBP1_SCHPO      SAVHVAVYSGVEVYECFIKGV---SVMRRRRDSWLNATQILKVADFDKPQRTRVLERQV
 +
MBP1_USTMA      KTIFKATYSGVPVYECIINNV---AVMRRRSDDWLNATQILKVVGLDKPQRTRVLEREI
 +
MBP1_ASPNI      SNVYSATYSSVPVYEFKIGTD---SVMRRRSDDWINATHILKVAGFDKPARTRILEREV
 +
MBP1_ASPTE      SKIYSATYSSVPVYEFKIEGD---SVMRRRADDWINATHILKVAGFDKPARTRILEREV
 +
MBP1_CRYNE      PKVYASVYSGVPVFEAMIRGI---SVMRRASDSWVNATQILKVAGVHKSARTKILEKEV
 +
MBP1_GIBZE      G-IYSASYSGVDVYEMEVNNI---AVMRRRNDSWLNATQILKVAGVDKGKRTKILEKEI
 +
MBP1_NEUCR      IYSLQATYSGVGVYEMEVNNV---AVMRRQKDGWVNATQILKVANIDKGRRTKILEKEI
 +
MBP1_MAGGR      P-IYTAVYSNVEVYEFEVNGV---AVMKRIGDSKLNATQILKVAGVEKGKRTKILEKEI
 +
MBP1_ASPFU      PQIYKAVYSNVSVYEMEVNGV---AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEI
 +
MBP1_CHAGL      AGIYSATYSGIPVYEYQFGPDMKEHVMRRREDNWINATHILKAAGFDKPARTRILERDV
 +
 +
1MB1            LKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLF
 +
MBP1_CANGL      LKEMHEKVQGGFGKYQGTWVPLNIAINLAEKFDVYQDLKPLF
 +
MBP1_EREGO      IKDTHEKVQGGFGKYQGTWVPLDIARRLAQKFEVLEELRPLF
 +
MBP1_KLULA      ITDTHEKVQGGFGKYQGTWIPLELASKLAEKFEVLDELKPLF
 +
MBP1_CANAL      QTGIHEKVQGGYGKYQGTYVPLDLGAAIARNFGVYDVLKPIF
 +
MBP1_DEBHA      QTGVHEKVQGGYGKYQGTYVPLDLGADIAKNFGVFDSLRPIF
 +
MBP1_YARLI      QKGVHEKVQGGYGKYQGTWVPLERAREIATLYDVDSHLAPIF
 +
MBP1_SCHPO      QIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPIL
 +
MBP1_USTMA      QKGIHEKVQGGYGKYQGTWIPLDVAIELAERYNIQGLLQPIT
 +
MBP1_ASPNI      QKGVHEKVQGGYGKYQGTWIPLQEGRQLAERNNILDKLLPIF
 +
MBP1_ASPTE      QKGVHEKVQGGYGKYQGTWIPLPEGRLLAERNNIIDKLRPIF
 +
MBP1_CRYNE      LNGIHEKIQGGYGKYQGTWVPLDRGRDLAEQYGVGSYLSSVF
 +
MBP1_GIBZE      QTGEHEKVQGGYGKYQGTWIKFERGLQVCRQYGVEELLRPLL
 +
MBP1_NEUCR      QIGEHEKVQGGYGKYQGTWIPFERGLEVCRQYGVEELLSKLL
 +
MBP1_MAGGR      QTGEHEKVQGGYGKYQGTWIKYERALEVCRQYGVEELLRPLL
 +
MBP1_ASPFU      AAGEHEKVQGGYGKYQGTWVNYQRGVELCREYHVEELLRPLL
 +
MBP1_CHAGL      QKDVHEKIQGGYGKYQGTWIPLEQGRALAQRNNIYDRLRPIF
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
===(1.1) Input data for multiple alignments (1 mark)===
 
</div>
 
 
&nbsp;<br>
 
&nbsp;<br>
  
In your second assignments, you used BLAST to find the best matches to the yeast Mbp1 protein in your assigned organism's genome. To avoid ambiguity, I have generated a reference list of these homologues using the canonical procedure defined below. This was not entirely straightforward in all cases and several departures from the procedure are noted below the table; I consider these variations quite normal for a database query. You need to be familiar with exceptions such as the ones described below and know how to deal with them.
+
It should be obvious to you by now how you can copy a string of amino acids from such an alignment and create a FASTA file. However we need to take a little detour: this detour brings us to the question of sequence numbers.
  
# Retrieved the Mbp1 protein sequence by searching [http://www.ncbi.nlm.nih.gov/ Entrez] for <code>Mbp1 AND "saccharomyces cerevisiae"[organism]</code>
+
It is not straightforward at all how to number sequence in such a project. The "natural" way would be to start a sequential numbering from the start-codon of the full length protein and go sequentially from there. However imagine what would happen if a curator would discover that one of the splice-sites for a gene has been missed in automatic annotation. All of a sudden a corrected sequence would have a different length than the one that may have been used for earlier studies. Unfortunatlety, there is no mechanism (''wouldn't it be nice!'') that automatically goes back through the literature and your lab-journal and updates the revised sequence numbering... But there are other possible complications, regarding sequence numbers. The first residue of the CDD-APSES domain is not Residue 1 of the Mbp1 protein. The first residue of the 1MB1 FASTA file ''is'' the first residue of Mbp1 protein, but the last five residues are an artifiical His tag. Is H125 of 1MB1 the equivalent residue to R125 in MBP1_SACCE? The N-terminus of the Mbp1 crystal structure is disordered. The first residue in the structure is ASN 3, whereas the SEQRES records start with MET ... and so on. The take-home message is that a sequence number is nothing absolute, but something that makes sense only in a particular context. To emphasize this, we will write a FASTA header for our '''target''' sequence that lists the residues of the source sequence it correspond to. In terms of actual sequence numbering, we will adopt the numbering of the 1MB1 protein throughout to be able to consistently label particular amino acids.
# Clicked on the ''RefSeq tab'' to find the RefSeq ID "<code>[http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6320147&dopt=GenPept NP_010227]</code>"
+
 
# Accessed the [http://www.ncbi.nlm.nih.gov/blast '''BLAST'''] form, followed the link to the list of all genomic BLAST databases and clicked on the (B) icon, next to Fungi to navigate to the [http://www.ncbi.nlm.nih.gov/sutils/genom_table.cgi?organism=fungi Fungi Genomic BLAST page.]
+
Access the sequence of "your" organism's Mbp1 Orthologue at UniProt. (You can use the links I have provided in the table below).  
# Pasted "<code>NP_010227</code>" into the ''query field''. Chose ''Protein'' for both Query and Database, kept default parameters but set the ''Filter'' option to ''none''. Clicked on the check-box of each of the fungal species we have considered in the previous assignment. Run BLAST.
 
#On the results page, checked the checkbox next to the alignment to select ''the most significant hit from each organism'' we are studying.
 
#Clicked on the "Get selected sequences" button.
 
#Separately searched for sequences from organisms that were either not included in the list or for which no hits were reported. Verified all ambiguous cases, as explained in the notes below. 
 
#Verified that each of these sequences finds Mbp1 as the best match in the ''saccharomyces cerevisiae'' genome by clicking on each "Blink" ([http://www.ncbi.nlm.nih.gov/sutils/blink.cgi?pid=68465419 <small>click for example</small>]) in the retrieved list. Scrolled down the list to confirm that the '''top hit of a  ''saccharomyces cerevisiae'' protein''' is indeed Mbp1 (<code>NP_010227</code>).
 
#Obtained UniProt accessions for all sequences, with a single query using the UniProt [http://www.pir.uniprot.org/search/idmapping.shtml ID mapping service]. This service accepts a comma delimited list of RefSeq IDs, GI numbers or  GenPept accession numbers and returns a list of Uniprot accession numbers.
 
  
Since it was thus confirmed that each of these sequences is the protein that is most similar to yeast Mbp1 in its respective organism's genome, and that yeast Mbp1 is the most similar yeast protein to each of them, the all fulfil the criterion of a '''reciprocal best match''' with yeast Mbp1. Accordingly we can postulate that this list contains the fungal '''orthologues''' to Mbp1.
 
  
<br>&nbsp;
 
<br>&nbsp;
 
 
<table style="border-left:1px solid #AAAAAA; border-bottom:1px solid #AAAAAA;" cellpadding="10" cellspacing="0">
 
<table style="border-left:1px solid #AAAAAA; border-bottom:1px solid #AAAAAA;" cellpadding="10" cellspacing="0">
 
<tr style="background: #A6AFD0;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;" colspan="6"><b>Mbp1 and its orthologues</b></td>
 
</tr>
 
 
 
<tr style="background: #BDC3DC;">
 
<tr style="background: #BDC3DC;">
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b><i>Organism</i></b></td>
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b><i>Organism</i></b></td>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>CODE</code></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b>Uniprot Accession</b></td>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b>GI</b></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b>NCBI</b></td>
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b>Uniprot</b></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b>Most similar yeast gene</b></td>
 
 
</tr>
 
</tr>
  
 
<tr style="background: #FFFFFF;">
 
<tr style="background: #FFFFFF;">
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus fumigatus</i></td>
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus fumigatus</i></td>
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>ASPFU</code></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q4WGN2_ASPFU Q4WGN2]</td>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">70986922</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_748947</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q4WGN2_ASPFU </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
 
 
</tr>
 
</tr>
  
 
<tr style="background: #E9EBF3;">
 
<tr style="background: #E9EBF3;">
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus nidulans</i></td>
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus nidulans</i></td>
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>ASPNI</code></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q5B8H6_EMENI Q5B8H6]</td>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">40739343</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">EAA58533</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q5AYB5_EMENI </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
 
 
</tr>
 
</tr>
  
 
<tr style="background: #FFFFFF;">
 
<tr style="background: #FFFFFF;">
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus terreus</i></td>
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus terreus</i></td>
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>ASPTE</code></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q0CQJ5_ASPTE Q0CQJ5]</td>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">115391425</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_001213217</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q0CQJ5_ASPTN </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
 
 
</tr>
 
</tr>
  
 
<tr style="background: #E9EBF3;">
 
<tr style="background: #E9EBF3;">
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Candida albicans</i></td>
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Candida albicans</i></td>
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>CANAL</code></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q5ANP5_CANAL Q5ANP5]</td>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">46444933</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">EAL04204</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q5ANP5_CANAL </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
 
 
</tr>
 
</tr>
  
 
<tr style="background: #FFFFFF;">
 
<tr style="background: #FFFFFF;">
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Candida glabrata</i></td>
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Candida glabrata</i></td>
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>CANGL</code></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q6FWD6_CANGL Q6FWD6]</td>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">50286059</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_445458</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q6FWD6_CANGA </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
 
 
</tr>
 
</tr>
  
 
<tr style="background: #E9EBF3;">
 
<tr style="background: #E9EBF3;">
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Coprinopsis cinerea</i></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Cryptococcus neoformans</i></td>
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>COPCI</code></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q5KHS0_CRYNE Q5KHS0]</td>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">116501415</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">EAU84310</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> N.A. </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
 
 
</tr>
 
</tr>
  
 
<tr style="background: #FFFFFF;">
 
<tr style="background: #FFFFFF;">
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Cryptococcus neoformans</i></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Debaryomyces hansenii</i></td>
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>CRYNE</code></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q6BSN6_DEBHA Q6BSN6]</td>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">134110416</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_776035</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q5KHS0_CRYNE </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
 
 
</tr>
 
</tr>
  
 
<tr style="background: #E9EBF3;">
 
<tr style="background: #E9EBF3;">
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Debaryomyces hansenii</i></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Eremothecium gossypii</i></td>
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>DEBHA</code></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q752H3_ASHGO Q752H3]</td>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">50420495</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_458784</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q6BSN6_DEBHA </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
 
 
</tr>
 
</tr>
  
 
<tr style="background: #FFFFFF;">
 
<tr style="background: #FFFFFF;">
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Eremothecium gossypii</i></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Gibberella zeae</i></td>
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>EREGO</code></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q4IEY8_GIBZE Q4IEY8]</td>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">45199118</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">NP_986147</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q752H3_ASHGO </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
 
 
</tr>
 
</tr>
  
 
<tr style="background: #E9EBF3;">
 
<tr style="background: #E9EBF3;">
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Gibberella zeae</i></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Kluyveromyces lactis</i></td>
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>GIBZE</code></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=MBP1_KLULA P39679]</td>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">46116756</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_384396</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> UPI000023DBF3 </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
 
 
</tr>
 
</tr>
  
 
<tr style="background: #FFFFFF;">
 
<tr style="background: #FFFFFF;">
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Kluyveromyces lactis</i></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Magnaporthe grisea</i></td>
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>KLULA</code></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q3S405_MAGGR Q3S405]</td>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">50308375</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_454189</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> MBP1_KLULA </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
 
 
</tr>
 
</tr>
  
 
<tr style="background: #E9EBF3;">
 
<tr style="background: #E9EBF3;">
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Magnaporthe grisea</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>MAGGR</code></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">74274844</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">ABA02072 </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q3S405_MAGGR </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1*</td>
 
</tr>
 
 
<tr style="background: #FFFFFF;">
 
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Neurospora crassa</i></td>
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Neurospora crassa</i></td>
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>NEUCR</code></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q7SBG9_NEUCR Q7SBG9]</td>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">157070373</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">EAA33731</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> ''Q7SBG9'' </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
 
</tr>
 
 
 
<tr style="background: #E9EBF3;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Pichia stipitis</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>PICST</code></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">149388844</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">EAZ62798</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> A3GHD6_PICST </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
 
 
</tr>
 
</tr>
  
 
<tr style="background: #FFFFFF;">
 
<tr style="background: #FFFFFF;">
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Saccharomyces cerevisiae</i></td>
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Saccharomyces cerevisiae</i></td>
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>SACCE</code></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=MBP1_YEAST P39678]</td>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">6320147 </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">NP_010227</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> MBP1_YEAST </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
 
 
</tr>
 
</tr>
  
 
<tr style="background: #E9EBF3;">
 
<tr style="background: #E9EBF3;">
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Schizosaccharomyces pombe</i></td>
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Schizosaccharomyces pombe</i></td>
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>SCHPO</code></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=RES2_SCHPO P41412]</td>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">19113944</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">NP_593032</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> RES2_SCHPO </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
 
 
</tr>
 
</tr>
  
 
<tr style="background: #FFFFFF;">
 
<tr style="background: #FFFFFF;">
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Ustilago maydis</i></td>
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Ustilago maydis</i></td>
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>USTMA</code></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q4P117_USTMA Q4P117]</td>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">46101867</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">EAK87100</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q4P117_USTMA </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
 
 
</tr>
 
</tr>
  
 
<tr style="background: #E9EBF3;">
 
<tr style="background: #E9EBF3;">
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Yarrowia lipolytica</i></td>
 
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Yarrowia lipolytica</i></td>
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>YARLI</code></td>
+
   <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q6CGF5_YARLI Q6CGF5]</td>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">50545439</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_500257</td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q6CGF5_YARLI </td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
 
 
</tr>
 
</tr>
  
 
</table>
 
</table>
  
<small>Table of yeast Mbp1 orthologues in genome-sequenced fungi. Columns from left to right: Systematic name, organism code (simply a string that lets us identify the organism in alignments), GI number, RefSeq ID (if existing) or GenPept accession, Uniprot accession, most similar yeast protein.
 
 
Note: ''Coprinopsis cinerea'' accession numbers are not yet in UniProt.
 
 
Note: For ''Giberella zeae'' and ''Magnaporthe grisea'', the protein BLAST search had to go through the entire '''nr''' database, by entering an organism restriction,  since genomic BLAST was not enabled.
 
 
Note: For ''Giberella zeae'' XP_384396  no UniProt ID was returned as cross-reference. EBI-BLAST retrieved  FG04220 which is largely identical, except for short stretches that are absent in GenPept: apparently UniProt has a different gene-model for this protein.
 
 
Note: The ''Neurospora crassa'' protein EAA33731 has no direct cross-reference in UniProt. The closest match is Q7SBG9 which is largely identical, except for short stretches that are absent in GenPept: apparently UniProt has a different gene-model for this protein.
 
 
Note: The ''Magnaporthe grisea'' protein ABA02072 has greater local C-terminal similarity to the yeast protein Swi6 than to Mbp1, whereas the N-terminal APSES domain is most similar to yeast Mbp1. However a '''global''' Needleman-Wunsch alignment (BLOSUM 30, gaps: 8.0/1.0) shows greater '''overall''' similarity to yeast Mbp1 than to Swi6. Accordingly I consider this an orthologue to Mbp1 even though its database annotation calls ABA02072  the ''M. grisea'' Swi6 homologue.
 
 
Note: For ''Pichia stipitis'', BLAST finds two very similar sequences in GenPept as candidate Mbp1 orthologues; the RefSeq sequence XP_001386821.1 is translated according to the standard code, the EMBL generated entry EAZ62798.2 is translated according to the alternative nuclear code [http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c#SG12 '''12''']. The question had to be considered which translation appears to be correct. This required looking at the conservation of the residues in question in the BAST alignment; better conservation indeed supports the alternative code translation.
 
 
Note: The ''Ustilago maydis'' protein EAK87100 is only the second-best hit in the original BLAST list, however local optimal alignment (EMBOSS water) shows a much higher percentage of identity to yeast Mbp1 in the APSES domain than the top BLAST hit EAK86587 and global alignment  (after trimming the N- and C- terminal extensions, respectively) also shows a slightly higher degree of similarity for EAK87100 than EAK86587. Accordingly, EAK87100 is considered the Mbp1 orthologue, even though it is the second highest hit according to BLAST. This emphasizes the fact that optimal sequence alignments are not entirely equivalent to BLAST alignments.
 
 
</small>
 
 
&nbsp;<br>
 
Our second task is to obtain all FASTA sequences based on a list of identifiers and to save them in a format in which we can use them as input for other programs or services. This is easy: we simply paste all GI numbers as a comma separated list into the Entrez search form and select Display FASTA, send to Text on the results page, then save the contents as a Text file.
 
&nbsp;<br>
 
 
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*We have applied the "reciprocal best match criterion" to assert that these sequences are '''orthologues to yeast Mbp1''' and this is how orthologues are commonly defined comnputationally. Briefly explain why this criterium will distinguish between orthologues and paralogues (when no genes have been lost). Consider at least the following three cases (''i'') a gene duplication has occurred before a speciation event, (''ii'') a gene duplication in the query organism has occurred after a speciation event. (''iii'') a gene duplication in the target organism has occurred after a speciation event. Use sketches to illustrate the cases. (1 mark)
 
 
*Review the resulting multi-FASTA file for the  [[All_Mbp1_proteins|'''Mbp1 proteins (linked here)''']] and make sure you understand the procedure that led to it. Depending on your personal learning style you may either carefully review the described procedure, reproduce key steps of the procedure, reproduce the entire procedure paying special attention to the problem cases discussed in the notes, or develop your own procedure. Whatever you do, you must be confident in the end that you could have produced the same input file.<br>
 
  
 +
<div style="padding: 5px; background: #EEEEEE;">
 +
*Copy your organism's Mbp1 sequence from the alignment above. Then define the start- and end- sequence numbers of the '''target''' sequence relative to the full-length protein. Prepare a FASTA formatted file for the '''target''' sequence in your organism, giving it an appropriate header and include the sequence numbers. Refer to the [[Assignment_5_fallback_data|'''Fallback data''']] file if you are not sure about the format. (1 mark)
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
Mbp1 orthologues are not the only proteins that contain APSES domains. In order to find all the rest, a PSI-BLAST search was performed using the yeast Mbp1 APSES domain as query. From the list of hits, the APSES domains were extracted and summarized in a file.
 
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
Your FASTA sequence should look similar to this:
*Review the resulting file for the  [[All_APSES_domains|'''APSES domains (linked here)''']] and make sure you understand the procedure that was used in its construction, as above.
 
</div>
 
&nbsp;<br>
 
  
<div style="padding: 5px; background: #E9EBF3; border:solid 1px #AAAAAA;">
+
>1MB1: Mbp1_SACCE 1..100
 +
NQIYSARYSGVDVYEFIHSTG---SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEV
 +
  LKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLF
  
===(1.2)  Orthologues (1 mark)===
+
&nbsp;
</div>
+
&nbsp;
&nbsp;<br>
 
  
For '''one''' of the the APSES domains from your assigned organism, determine whether it is orthologous to a yeast APSES domain:
+
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
  
# Choose at random one of the [[All_APSES_domains|'''APSES domains''']] from your organism (but not one from an Mbp1 orthologue) and copy it's [[All_APSES_domains|sequence]] into the input window of a genomic [http://www.ncbi.nlm.nih.gov/blast/ BLAST] search against ''saccharomyces cerevisiae'' proteins.
+
==(2) Homology model==
# Run the search and determine the gene name of the best hit. (This is the best match.)
 
# The BLAST-retrieved sequence may be truncated on the results page and not cover the entire APSES domain: find the sequence of your best match in yeast in the [[All_APSES_domains| sequence file]]. (Since the file contains all yeast APSES domains, your best match should be in this file, labeled with <code>????_SACCE</code>.
 
# Copy that sequence from the Wiki page and perform the same kind of BLAST search with this yeast sequence, against the proteins in your organism's genome. (This finds the reciprocal match.)
 
 
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
* Document the process and report briefly what you have found on the forward and on the reverse search. Does the gene you have chosen have an APSES domain that fulfils the ''reciprocal best match'' criterion for orthology with a yeast gene? (1 mark)
 
</div>
 
&nbsp;<br>
 
 
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
 
 
==(2) Align==
 
 
</div>
 
</div>
&nbsp;
 
&nbsp;
 
 
Actually performing multiple sequence alignements used to involve downloading and installing software on your own computer. While most tools were available on the Web in principle, many groups have restricted the total number of sequences or the total number of characters to be aligned. The EBI however offers three of the most commonly used tools with few limitations and it was possible to run MSAs for all Mbp1 orthologues jointly.
 
 
&nbsp;
 
&nbsp;
 
&nbsp;
 
&nbsp;
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
===(2.1) Aligning the Mbp1 orthologues (1 mark)===
+
=== (2.1) SwissModel (1 mark)===
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
 
I used the following three servers:
 
  
* [http://www.ebi.ac.uk/clustalw/ '''CLUSTAL-W''']  is a progressive alignment program, it is the most popular, most widely referenced MSA algorithm, it is reasonably fast and easy to use. But alignment errors that are made early in the process can't get corrected and thus CLUSTAL is prone to misalign sets of sequences that have poor (<30% ID) local similarity. It is no longer considered state-of-the-art for carefully done alignments.
+
Access the Swissmodel server at [http://swissmodel.expasy.org '''http://swissmodel.expasy.org'''] . Navigate to the '''Alignment Interface'''.
* [http://www.ebi.ac.uk/muscle/ '''MUSCLE'''] essentially starts out from a CLUSTAL like alignment as a draft, then identifies similar groups of sequences from which it calculates profiles, it then re-aligns the group to the profile. This procedure is iterated.
 
* [http://www.ebi.ac.uk/t-coffee/ '''T-Coffee'''] is one of my favourites - the tradeoffs appear to be especially well balanced. It too starts from a set of pairwise global alignments, like CLUSTAL, then additionally calculates sets of best local alignments. Global and local alignments are then combined to a similarity matrix and based on this matrix a guide-tree is constructed. This determines the order of steps in which sequences are added to the multiple alignment. A nice feature of T-Coffee is color coded output that allows you to quickly judge the local reliability of the alignment.
 
  
We shall perform multiple sequence alignments for all 18 Mbp1 orthologues and compare the results. Since the results will all look the same for the same input file, I have simply prepared them. Of course you are welcome to do run an alignment on your own for your own learning experience, but it is not required. The first alignment was run with CLUSTAL.
+
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
+
*Copy from the alignment above the 1MB1 sequence and the sequence from your organism, and paste it into the form field. Refer to the [[Assignment_5_fallback_data|'''Fallback Data file''']] if you are not sure about the format.
[[Image:A03_01.jpg|frame|none|Assignment 3, Figure 01<br>
+
:(You have to choose the format, and, if e.g. you choose a CLUSTAL format, you have to include a header line and a blank line. Other common problems uploading your alignment may include uploading a file that has not been saved as "text only" and periods i.e.   "."  in sequence names. Underscores appear to be safe.)
The guide tree computed by CLUSTAL-W. The algorithm uses this tree to determine the best order for its progressive alignment for the 18 Mbp1 orthologue sequences. This tree is based on a matrix of pairwise distances.]]
 
  
Subseqently, sequence alignments were performed with T-Coffee and MUSCLE. However, the input files were re-ordered to correspond to the order of the CLUSTAL output, and the option to order the alignments according to the ''input sequences'' was chosen on the form. This makes it much easier to compare alignments, since all MSAs are displayed in the same relative order.
+
* Click '''submit''' and define your '''target''' and '''template''' sequence. For the '''template sequence''' define the coordinate file and chain. (In our case the coordinate file is <code>'''1MB1'''</code> and the chain is "<code>'''_'''</code>" i.e. none, since the PDB file does not contain more than one chain.
  
 +
*Click '''submit''' and request the construction of a homology model: Enter your e-mail address and check the button for '''Normal Mode''', not "Swiss-PDB Viewer mode. (Important, since there will be problems with the output otherwise). Click '''submit'''. You should receive four files files by e-mail within half an hour or so. (1 mark)
  
The result files are linked here:
+
(You do not need to submit any coordinate files with your assignment.)
  
* [[All_Mbp1_CLUSTAL|Mbp1 proteins '''CLUSTAL''' aligned]]
 
* [[All_Mbp1_MUSCLE|Mbp1 proteins '''MUSCLE''' aligned]]
 
* [[All_Mbp1_T-COFFEE|Mbp1 proteins '''T-Coffee''' aligned (text version)]] and [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/resources/T-coffee_scores.html (coloured according to scores)]
 
 
Globally speaking, the alignments are quite similar. Let's first look at the common themes, before we discuss details of the results. The  [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/resources/T-coffee_scores.html (score-colored T-COFFEE alignment)] is well suited to look at general relationships between the sequences, since outliers can be easily identified.  For example, if one of the sequences would have a low-scoring domain that aligns poorly to the others of the group, it may be possible that that domain has been acquired in a separate evolutionary event and is not homologous to all others. We would notice an isolated stretch of poorly alignable sequence, i.e. it should be a segment coloured with a low score in a set of otherwise high-scoring segments. Also a gene may have acquired significant lengths of N- or C-terminal extensions which may not be homologous (unless they are the result of an internal duplication).
 
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*Review the  [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/resources/T-coffee_scores.html (score-colored T-COFFEE alignment)]. Based on this alignment, how do you feel about our initial assertion that these 18 proteins should be considered orthologous? (Answer briefly, but with reference to specific evidence in the alignment. Note that this question does not ask about the general level of conservation, but about whether significant segments (of about the lenght of a domain) do not appear related/alignable at all in regions where the rest of the group are reasonably well conserved.) (1 mark)
 
 
</div>
 
</div>
 +
&nbsp;<br>
 +
In case you do not wish to submit the modelling job yourself, you can access the result files for the  from the  [[Assignment_5_fallback_data|'''Fallback Data file''']].
  
&nbsp;
 
&nbsp;
 
  
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
  
==(3) Mbp1 orthologues: analysis of full length MSAs==
+
==(3) Model analysis==
 
</div>
 
</div>
 
&nbsp;
 
&nbsp;
 
&nbsp;
 
&nbsp;
 
What do we mean by a ''good'' versus a ''poor'' multiple sequence alignment?
 
 
Let us first consider some of the features of the yeast Mbp1 protein that we have defined in the second assignment (and some structural features I have compiled from various sources). Below is the yeast Mbp1 sequence with a number of annotations, compiled according to the following procedure.
 
 
# Performed [http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi '''CDD'''] search with yeast Mbp1 protein sequence. This retrieves alignments of Mbp1 with the APSES and the ANKYRIN domains. These are profile based alignments and thus they are more reliable than pairwise alignments.
 
# Performed  [http://smart.embl-heidelberg.de/ '''SMART'''] search with yeast Mbp1 protein sequence. This retrieved the APSES domain, annotated a number of low-complexity regions and a stretch of coiled coil.
 
# Performed a [http://www.ebi.ac.uk/thornton-srv/databases/sas/ '''SAS'''] search with yeast Mbp1 protein sequence. This retrieved pairwise alignments with the structures 1MB1 (APSES) and chain D of 1IKN (ankyrin domains of I<sub>kappa</sub>b), together with their respective secondary structure annotations.
 
# Copied GenPept sequence into Word-processor.
 
# Transferred annotations of low complexity and coiled-coil regions from SMART.
 
# Transferred annotations of APSES secondary structure from SAS (this is a ''direct'' annotation, since the experimentally determined structure 1MB1 is a fagment of of the Mbp1 protein). The central helix that was proposed to be part of the DNA binding region is slightly distorted and SAS annotates a break in the helix, this break was bridged with lowercase "h" in the annotation.
 
# Ankyrin domain annotation was not as straightforward. While CDD, SMART and SAS all annotate the same general regions, they disagree in details of the domain boundaries and on the precise alignment. Used the profile-based CDD alignment of 1IKN. Transferred annotations of secondary structure from SAS output for 1IKN to sequence (this is a ''transferred'' annotation, the original annotation was for 1IKN and we assume that it applies to Mbp1 as well).
 
 
 
MBP1_SACCE
 
Annotations based on
 
- CDD domain analysis,
 
- SAS structure annotation and
 
- literature data on binding region
 
 
Keys:
 
 
C  Coiled coil regions predicted by Coils2 program
 
x  Low complexity region
 
*  Proposed binding region
 
+  positively charged residues, oriented for possible DNA binding interactions
 
-  negatively charged residues, oriented for possible DNA binding interactions
 
 
E  beta strand
 
H  alpha helix
 
t  beta turn
 
 
 
                  10        20        30        40        50        60
 
          MSNQIYSARY SGVDVYEFIH STGSIMKRKK DDWVNATHIL KAANFAKAKR TRILEKEVLK
 
1MB1      ----EEEEEt t-EEEEEEEE t-EEEEEEtt ---EEHHHHH HH----HHHH HHHHhhhHHH
 
                                                                * *+**-+****
 
 
                  70        80        90        100        110        120
 
          ETHEKVQGGF GKYQGTWVPL NIAKQLAEKF SVYDQLKPLF DFTQTDGSAS PPPAPKHHHA
 
1MB1      ---EEE---- tt--EEEE-H HHHHHHHHH- --HHHHtt-        xxx xxxxxxxxxx
 
          **+*+***** ****
 
 
                  130        140        150        160        170        180
 
          SKVDRKKAIR SASTSAIMET KRNNKKAEEN QFQSSKILGN PTAAPRKRGR PVGSTRGSRR
 
          x                                                                         
 
 
 
                  190        200        210        220        230        240
 
          KLGVNLQRSQ SDMGFPRPAI PNSSISTTQL PSIRSTMGPQ SPTLGILEEE RHDSRQQQPQ
 
                                                                      xxxxx
 
 
 
                  250        260        270        280        290        300
 
          QNNSAQFKEI DLEDGLSSDV EPSQQLQQVF NQNTGFVPQQ QSSLIQTQQT ESMATSVSSS
 
          x                                        xx xxxxxxxxxx xxxxxxxxxx
 
 
 
                  310        320        330        340        350        360
 
          PSLPTSPGDF ADSNPFEERF PGGGTSPIIS MIPRYPVTSR PQTSDINDKV NKYLSKLVDY
 
          xxxxxxx
 
 
                  370        380        390        400        410        420
 
          FISNEMKSNK SLPQVLLHPP PHSAPYIDAP IDPELHTAFH WACSMGNLPI AEALYEAGTS
 
ANKYRIN                                -- t----HHHHH HH---HHHHH t-t--t-t--
 
 
 
                  430        440        450        460        470        480
 
          IRSTNSQGQT PLMRSSLFHN SYTRRTFPRI FQLLHETVFD IDSQSQTVIH HIVKRKSTTP
 
ANKYRIN  t----t---- HHHHHHHH-- -------HHH HHHHHH-ttH HH-----HHH HHHH--tH--
 
 
 
                  490        500        510        520        530        540
 
          SAVYYLDVVL SKIKDFSPQY RIELLLNTQD KNGDTALHIA SKNGDVVFFN TLVKMGALTT
 
ANKYRIN  HHHHHHHHH- ---------- -----t---- tt---HHHHH HH---HHHHH HHH--t-tt-
 
 
 
                  550        560        570        580        590        600
 
          ISNKEGLTAN EIMNQQYEQM MIQNGTNQHV NSSNTDLNIH VNTNNIETKN DVNSMVIMSP
 
ANKYRIN  ---t----HH HHHHHH--HH HHH-t--HHH -t----HHHH HHH--tHHHH HHHHHH---t
 
 
 
                  610        620        630        640        650        660
 
          VSPSDYITYP SQIATNISRN IPNVVNSMKQ MASIYNDLHE QHDNEIKSLQ KTLKSISKTK
 
ANKYRIN  ---tt----H HHHHHH---H HHHHHHH      CCCCCCCC CCCCCCCCCC CCCCC
 
 
 
                  670        680        690        700        710        720
 
          IQVSLKTLEV LKESSKDENG EAQTNDDFEI LSRLQEQNTK KLRKRLIRYK RLIKQKLEYR
 
                                                    x xxxxxxxxxx xxxxxxx
 
 
                  730        740        750        760        770        780
 
          QTVLLNKLIE DETQATTNNT VEKDNNTLER LELAQELTML QLQRKNKLSS LVKKFEDNAK
 
 
 
                  790        800        810        820        830
 
          IHKYRRIIRE GTEMNIEEVD SSLDVILQTL IANNNKNKGA EQIITISNAN SHA
 
 
 
A '''good''' MSA comprises only columns of residues that play similar roles in the proteins' mechanism and/or that evolve in a comparable structural context. Since it is a result of biological selection and conservation, it has relatively few indels and the indels it has are usually not placed into elements of secondary structure or into functional motifs.
 
 
A '''poor''' MSA has many errors in its columns, they contain residues that actuallly have diffferent functions or structural roles, even though they may look similar to a scoring matrix. It also may have introduced indels in biologically irrelevant positions, to maximize spurious sequence similarities.
 
 
In order to evaluate the MSAs for our proteins, we will analyze alignments relative to the features we have annotated above.
 
&nbsp;
 
 
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
===(3.1) APSES domains (1 mark)===
+
=== (3.1) The PDB file (1 mark)===
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
  
The APSES domains in all of our Mbp1 orthologues are highly conserved and any program must be able to align such obviously similar regions.
+
Open your  '''model''' coordinates in a text-editor (make sure you view the PDB file in a fixed-width font) and consider the following questions: (Alternatively, view the coordinates linked to the [[Assignment_5_fallback_data|'''Fallback Data file''']].)
  
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
*Consider the CLUSTAL, Muscle and T-Coffee alignments of the Mbp1 orthologues.  Orient yourselves as to where the APSES domains are located. Briefly note whether the three alignments agree and, for one of the alignments, whether the charged residues in the proposed binding region are wholly or partially conserved across all 18 proteins. (Refer to the specific residues labelled (+) or (-) in the Mbp1 annotation above). (1 mark) <!-- Sequence variation may indicate variations in binding site -->
+
*What is the residue number of the first residue in the '''model'''? What should it be, based on the alignment? If the putative DNA binding region was reported to be residues 50-74 in the Mbp1 protein, which residues of the '''model''' correspond to that? (1 mark)
 
</div>
 
</div>
&nbsp;<br>
+
 
 +
<!-- discuss flagging of loops - setting of B-factor to 99.0 -->
  
 
&nbsp;
 
&nbsp;
Line 527: Line 292:
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
+
===(3.2) first visualization (3 marks)===
===(3.2) Ankyrin domains (1 mark)===
 
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
  
The Ankyrin domains are more highly diverged, the boundaries are less well defined and not even CDD, SMART and SAS agree on the precise annotations. Nevertheless we would hope that a good alignment would recognize homology in that region and that ideally the required indels would be placed between the secondary structure elements, not in their middle.
+
In assignment 2 you have already studied the 1MB1 coordinate file and compared it to your organism's Mbp1 APSES domain, Since a homology model inherits its structural details from the '''template''', the model should look very similar to the original structure but contain the sequence of the '''target'''.
  
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
*For one of the alignments of your choice (CLUSTAL, T-coffee or MUSCLE), identify the helices in the Ankyrin repeat region of Mbp1, based on the annotations given above. (This is probably easiest done by pasting that part of the alignment into a word-processor and highlighting the residues you are discussing). Briefly state whether the indels in this region are concentrated in segments that connect the helices, or if they are more or less evenly distributed along the entire region of similarity. Conclude whether the assertion that ''indels should not be placed in elements of secondary structure'' has merit in this case; in particular if you notice indels that violate this rule-of-thumb, consider whether the location of the indel has strong support from aligned sequence motifs, or whether it could apparently be placed  into a different location whithout much loss in alignment quality. Support your conclusions with specific reference to particular elements of the alignment. (1 mark)
+
*Save the attachment of your '''model''' coordinates to your harddisk and visualize it in RasMol. (Alternatively, copy and save the coordinates from the [[Assignment_5_fallback_data|'''Fallback Data file''']] to your harddisk.) Make an informative view, divergent stereo and paste it into your assignment. (3 marks)
</div>
 
&nbsp;<br>
 
 
 
&nbsp;
 
&nbsp;
 
 
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
  
===(3.3)  Other features (1 mark)===
 
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
  
Aligning functional features like ''coiled coil domains'' or ''intrinsically disorderd regions'' is even more difficult, since this is to a large degree a property of the amino acid composition, not as much the precise sequence. Thus we would expect it to be difficult to detect the correspondence between sequences in such regions.  I have annotated four low complexity regions of the yeast Mbp1 sequence.
 
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
[[Image:A5_Mbp1_subdomain.jpg|frame|none|Stereo-view of a subdomain within the 1MB1 structure that includes residues 36 to 76. The color gradient ramps from blue (36) to green (76).]]
*Copy the Mbp1 sequence from your organism from the multi-FASTA files and run a [http://smart.embl-heidelberg.de/ SMART] sequence analysis: paste your FASTA formatted sequence (or its Uniprot accession number), check only the checkbox for detecting '''intrinsic protein disorder''' and click "Sequence SMART". Locate the segments of '''low complexity''' for your sequence (they are in the lower part of the results page since they overlap with disordered segments). Now comment on '''one''' of the multiple sequence alignments: do the proteins '''have''' similar low complexity regions, and have these been '''aligned''' by the MSA algorithm? Briefly describe the situation: state whether these segments are found in the same general region, in the same detailed location, or perhaps even conserved in sequence, when you compare them to the ''saccharomyces cerevisiae'' protein.  Backup your conclusions with specific reference to particular elements of the alignment.
 
  
* Briefly discuss whether this observation should lead you to conclude that disorder in these proteins appears to be a conserved feature, i.e. a feature that is selected for in evolution. (1 mark)
 
</div>
 
&nbsp;<br>
 
 
&nbsp;
 
&nbsp;
 
<!-- add at a later time similar analysis of coils via 2ZIP server - conserved feature? [http://2zip.molgen.mpg.de/index.html 2Zip server], also add VMD alignment on ankyrin prototype.
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*Task
 
</div>
 
&nbsp;<br>
 
 
&nbsp;
 
&nbsp;
-->
 
  
  
 +
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
  
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
+
===(3.3) modeling a DNA ligand (4 marks)===
 
 
==(4) APSES domain homologues: analysis of domain MSAs==
 
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
  
You have read how to generate a source sequence file based on the results of a PSI-BLAST search for all APSES domains in fungi. Of course, since PSI-BLAST has detected these sequences due to their high-similarity to a sequence profiel, this similarity implies an alignment; this is a model based MSA because the sequences are aligned to a protoypic model and not to each other. To align these domains the MUSCLE server is the tool of choice for such highly diverged sequences. For comparison, a CLUSTAL alignment has been computed as well.
+
The really interesting question we could begin to address with our model is how sequence variation might be converted into changing DNA recognition sites, and then lead to changed cognate DNA binding sequences. But in order to address this, we would need to add a plausible model for a bound DNA molecule to our model.
  
* The [[APSES_domains_PSI-BLAST| resulting alignment derived from the '''PSI-BLAST''' profile]] as an example of a model-based alignment. <small>Note that PSI-BLAST has not been optimized to work as an alignment program, thus the conclusion that model-based alignments are inferior because this example is a poor alignment is not justified.</small>
+
Since there is currently no software available that would accurately model such a complex from first principles, we will base this on homology modeling as well. This means we need to find a similar structure for which the complex structure is known. However, you may remember from the third assignment that the APSES domains in fungi seem to be a relatively small family. And there is no structure available of a protein-DNA complex. Now what?
* The [[APSES_domains_CLUSTAL| '''CLUSTAL-W''' alignment]] as an example of a progressive alignment.
 
* The [[APSES_domains_MUSCLE| '''MUSCLE''' alignment]] as an example of a consistency-based alignment.
 
  
If we compare the alignments, we notice immediately that they disagree over siginficant portions of the sequences.
+
Remember that homologous sequences can have diverged to the point where their sequence similarity is no longer recognizable, however their structure may be quite well conserved. Thus if we could find similar structures in the PDB, these might provide us with some plausible hypotheses for how DNA is bound by APSES domains. We thus need a tool similar to BLAST, but not for the purpose of sequence alignment, but for structure alignment. A kind of BLAST for structures.
&nbsp;
 
&nbsp;
 
  
===(4.1)  Manual improvement  (1 mark)===
+
However, very similar to BLAST, we might not want to search with the entire protein, if all we are interested in is a subdomain that binds to DNA. Attempting to match all structural elements in addition to the ones we are actually interested in is likely to make the search less specific - we would find false positives that are similar to some irrelevant part of our structure. However, defining too small of a subdomain would also lead to a loss of specificity: in the extreme it is easy to imagine that the search for e.g. a single helix would retrieve very many hits that would be quite meaningless. The arrangement of the residues from 50 to 74 that we have already discussed in Assignment 2 suggests that the compact subdomain from 36 to 76 (see the image above) might be a useful structure to search with: it contains the residues we are interested in and enough of connected secondary structure elements to be structurally meaningful.
  
Often errors or inconsistencies are easy to spot, and manually editing an MSA is not generally frowned upon, even though this is not a strictly objective procedure. The main goal of manual editing is to make an alignment biologically more plausible. Most comonly this means to mimize the number of rare evolutionary events that the alignment suggestsand/or to emphasize conservation of known functional motifs. Here are some examples for what one might aim for in manually editing an alignment:
+
At the '''NCBI''', [http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml VAST] is a search tool for structural similarity search tool for this purpose. Unfortunately it does not seem to be able to handle a query with such a structural subdomain (the process did not finish after several days) but at least you can get a list of structural neighbors of the 1MB1 full-length template structure, by entering the PDB ID in a small form field on the VAST home page, and then clicking on the colored bar labeled "Chain" on the MMDB structure summary page. This precomputed page for the 1MB1 structure shows a number of diverse proteins matching to various helices and strands of the structure.
  
* Reduce number of indels
+
At the '''EBI''' there are a number of very well designed structure analysis tools linked off the [http://www.ebi.ac.uk/Tools/structural.html '''Structural Analysis''' page]. As part of its MSD Services, the SSM (Secondary Structure Matching service) provides a well thought out interface for searching files from the PDB or uploading coordinates.
  
From a Probcons alignment:
+
After uploading the coordinates for residues 36 to 76 of the 1MB1 structure running the search and sorting the results by alignment length, the top hits include a number of nucleotide binding proteins such as a replication terminator (1F4K), the LexA repressor (1MVD) and a "Winged Helix" protein (1KQ8). These are all members of a much larger superfamily, the "winged helix" DNA binding domains ([http://cathwww.biochem.ucl.ac.uk/cgi-bin/cath/GotoCath.pl?cath=1.10.10.10 CATH 1.10.10.10]), of which hundreds of structures have been solved. They represent one branch of the tree of helix-turn-helix (HTH) DNA binding modules. (A recent review on HTH proteins is linked from the resources section at the bottom of this page). Winged Helix domains typically bind their cognate DNA with a "recognition helix" which precedes the beta hairpin and binds into the major groove; additional stabilizing interactions are provided by the edge of the beta strand binding into the minor groove.
0447_DEBHA    ILKTE-K<span style="color:#FF0000;">-</span>T<span style="color:#FF0000;">---</span>K--SVVK      ILKTE----KTK---SVVK
 
9978_GIBZE    MLGLN<span style="color:#FF0000;">-</span>PGLKEIT--HSIT      MLGLNPGLKEIT---HSIT
 
1513_CANAL    ILKTE-K<span style="color:#FF0000;">-</span>I<span style="color:#FF0000;">---</span>K--NVVK      ILKTE----KIK---NVVK
 
6132_SCHPO    ELDDI-I<span style="color:#FF0000;">-</span>ESGDY--ENVD      ELDDI-IESGDY---ENVD
 
1244_ASPFU    ----N<span style="color:#FF0000;">-</span>PGLREIC--HSIT  ->  ----NPGLREIC---HSIT
 
0925_USTMA    LVKTC<span style="color:#FF0000;">-</span>PALDPHI--TKLK      LVKTCPALDPHI---TKLK
 
2599_ASPTE    VLDAN<span style="color:#FF0000;">-</span>PGLREIS--HSIT      VLDANPGLREIS---HSIT
 
9773_DEBHA    LLESTPKQYHQHI--KRIR      LLESTPKQYHQHI--KRIR
 
0918_CANAL    LLESTPKEYQQYI--KRIR      LLESTPKEYQQYI--KRIR
 
  
<small>Gaps marked in red were moved. The sequence similarity in the alignment does not change considerably, however the total number of indels in this excerpt is reduced to 13 from the original 22</small>
+
<!-- The other service the EBI structure links to is the DALI server. DALI was one of the first algorithms capable of large-scale protein structure searches; it was developed by Liisa Holm and is now hosted by her group in Helsinki. Submitting our search domain generates the e-mailed result linked to here. Both results (there are only two) are also found in the top 100 list of the SSM service. The winged helix domain 1DP7 merits some comment though: its structure shows a novel mode of binding for DNA. Here, it is the beta-wing, not the "recognition helix" that inserts into the major groove! We will consider this in more detail below.
  
* Move indels to more plausible position
+
First we shall explore some of the structures that SSM has returned. The SSM server presents its result details in Web pages, but it also allows to download the entire result set in an XML formatted file. This is a common method of data-interchange in bioinformatics but you would not want to actually read such a file and manually extract information (even though you could, in principle). Thus I have prepared a summary file of the alignment details of the SSM results. This should allow you to rapidly find the exact aligned residues in the matched domains. While I have derived this file from the output through a computer program I have written, you could easily have accessed the same information on the Web, had you run the query yourself. -->
  
From a CLUSTAL alignment:
+
This is good news: once we have determined that the APSES domain is actually an example of a larger group of transcription factors, we can pick one of these for which a DNA complex structure is known. I have picked one such structure from the list of hits that were returned by SSM: it is the Elk-1 transcription factor.
4966_CANGL    MKHEKVQ------GGYGRFQ---GTW      MKHEKV<span style="color:#00AA00;">Q</span>------GGYGRFQ---GTW
 
1513_CANAL    KIKNVVK------VGSMNLK---GVW      KIKNVV<span style="color:#00AA00;">K</span>------VGSMNLK---GVW
 
6132_SCHPO    VDSKHP<span style="color:#FF0000;">-</span>----------<span style="color:#FF0000;">Q</span>ID---GVW  ->  VDSKHP<span style="color:#00AA00;">Q</span>-----------ID---GVW
 
1244_ASPFU    EICHSIT------GGALAAQ---GYW      EICHSI<span style="color:#00AA00;">T</span>------GGALAAQ---GYW
 
  
<small>The two characters marked in red were swapped. This does not change the number of indels but places the "Q" into a a column in which it is more highly conserved (green). Progressive alignments are especially prone to this type of error.</small>
+
[[Image:A5_canonical_wHTH.jpg|frame|none|Stereo-view of the canonical DNA binding mode of the Winged Helix domain family. Shown here is the Elk-1 transcription factor - an ETS DNA binding domain - in complex with a high-affinity binding site (pdb|1DUX). Note how the "recognition helix" inserts into the major groove of the DNA molecule. The color gradient ramps from blue (34) to green (84). Note how the first helix of the "helix-turn-helix" architecture serves only to position the recognition helix and makes few interactions by itself.]]
  
* Conserve motifs
+
Now all that is left to do is to bring the DNA molecule  into the correct orientation for our '''model''' and then to combine the two files. We need to superimpose the Elk-1 protein/DNA complex onto our '''model'''.
  
From a CLUSTAL alignment:
+
;Structure superposition
6166_SCHPO      --DKR<span style="color:#FF0000;">V</span>A---<span style="color:#FF0000;">G</span>LWVPP      --DKR<span style="color:#FF0000;">V</span>A--<span style="color:#FF0000;">G</span>-LWVPP
+
There are quite a number of superposition servers available on the Web, a remarkably comprehensive overview can be found in [http://en.wikipedia.org/wiki/Structural_alignment Wikipedia]. However, overengineering and black-box mentality makes our task more difficult than it need be: most tools do not allow users to specify particular alignment zones but attempt to automatically define the zones of residues to be supoerimposed according to some geometric target function. Almost none return the actual rotation matrix and translation vector that is used for the superposition. And almost none transform the coordinates of heteratoms such as solvent, ligands or DNA molecules along with the protein coordinates. An exception that I have found to be very useable is the [http://www.predictioncenter.org/local/lga/lga.html Local-Global Alignment server ('''LGA''')], written by Adam Zemla. The procedure is quite straightforward:
XBP1_SACCE      GGYIK<span style="color:#FF0000;">I</span>Q---<span style="color:#FF0000;">G</span>TWLPM      GGYIK<span style="color:#FF0000;">I</span>Q--<span style="color:#FF0000;">G</span>-TWLPM
 
6355_ASPTE      --DE<span style="color:#FF0000;">I</span>A<span style="color:#FF0000;">G</span>---NVWISP  ->  ---DE<span style="color:#FF0000;">I</span>A--<span style="color:#FF0000;">G</span>NVWISP
 
5262_KLULA      GGYIK<span style="color:#FF0000;">I</span>Q---<span style="color:#FF0000;">G</span>TWLPY      GGYIK<span style="color:#FF0000;">I</span>Q--<span style="color:#FF0000;">G</span>-TWLPY
 
  
<small>The first of the two residues marked in red is a conserved, solvent exposed hydrophobic residue that may mediate domain interactions. The second residue is the conserved glycine in a beta turn that cannot be mutated without structural disruption. Changing the position of a gap and insertion in one sequence improves the conservation of both motifs.</small>
+
*Define the structure to be rotated (1DUX in this case). This is a dimer, so download the file from the PDB and manually edit to contain only DNA chains A and B and protein chain C.  
 +
*Define the structure to be held constant (1MB1 in this case). Download from PDB.
 +
*Use the "browse" option to define both files as input on the LGA inpput form
 +
*Use the option to have both coordinate sets included in your output: <code>-o2</code>
 +
*Submit
  
 
+
The results arrive per e-mail. I have linked the resulting PDB file to the [[Assignment_5_fallback_data|'''Fallback Data page''']]. <small>If you run this analysis on your own, you may want to review the types of edits the edits I made to the PDB file to get it displayed correctly in Rasmol.</small>
&nbsp;
 
&nbsp;
 
 
 
Please consider the following excerpt from the PSI-BLAST alignment:
 
 
 
'''Mbp1_SACCE  RILEKEV-LKET-HE--KVQG-GF-GK-----------Y-----------QGTW'''
 
MbpA_ASPTE  KTLEKEI-AAGE-HE--KVQG-GY-GK-----------Y-----------QGTW
 
MbpC_CANAL  NYFDNEI-LSNLKYF--GSSS-NT-PQ-----------YLDLRKHQNIYLQGIW
 
MbpB_CANAL  KLLESTP-KEYQ-QYIKRIRG-GF-LK-----------I-----------QGTW
 
MbpA_CANAL  KILEKGV-QQGL-HE--KVQG-GF-GR-----------F-----------QGTW
 
Swi4_CANGL  KILEKES-TNMK-HE--KVQG-GY-GR-----------F-----------QGTW
 
MbpA_COPCI  KMIDSQPDLAPL-IR--RVRG-GY-LK-----------I-----------QGTW
 
MbpA_CRYNE  RVLEREV-QKGE-HE--KVQG-GY-GK-----------Y-----------QGTW
 
MbpB_DEBHA  KLLESTP-KQYH-QHIKRIRG-GF-LK-----------I-----------QGTW
 
MbpA_DEBHA  KILEKGV-QQGL-HE--KIQG-GY-GR-----------F-----------QGTW
 
Swi4_DEBHA  NFLNNEI-LTNT-QY--LSSG-GSNPQFNDLRNHEVRDL-----------RGLW
 
Swi4_KLULA  KILEKEA-NEIK-HE--KIQG-GY-GR-----------F-----------QGTW
 
Swi4_SACCE  KILEKES-NDMQ-HE--KVQG-GY-GR-----------F-----------QGTW
 
Swi4_USTMA  KILEKSI-LTGE-HE--KIQG-GY-GK-----------F-----------QGTW
 
  
  
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
*Find at least one example where this alignment could be manually improved. Show the original version, the improved version, highlight the changes in red and explain your rationale for the change. (1 mark)
+
*Save the superimposed  coordinates in a file, open and view in Rasmol and note how well the "recognition helix" and adjacent beta strands superimpose! (Alternatively, copy and save the coordinates from the c to your harddisk.) Make an informative view, divergent stereo and paste it into your assignment. (4 marks)
</div>
 
 
 
&nbsp;
 
&nbsp;
 
 
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
 
 
===(4.2) Patterns of residue conservation  (1 mark)===
 
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
 
 
With any computational tool, we have to consider whether the program's objective function corresponds to our requirements. For example, the lack of conservation in a particular column does not necessarily mean that a residue has changed in evolution - sometimes this is simply a consequence of an alignment that has matched residues with a higher score at the expense of conserving columns we believe to be biologically important. MSAs can only take sequence information into account, while we may have complementary information available on structural and functional conservation patterns. This may include secondary structure (gaps should be moved out of regions of secondary structure, where possible), structurally required residues (these are expected to be conserved accross all structurally similar sequences), and functionally conserved residues (these are expected to have a high likelyhood of being conserved within groups of orthologues, but varying between paralogues).
 
 
In terms of structural conservation, we expect motif or consistency based alignments to be more accurate since they align to the "big picture". In terms of functional variation we expect progressive alignments to be more accurate, since they align to local similarities.
 
 
Let us consider the alignments in terms of their biological relevance. I have annotated the ligand-binding residues for the yeast Mbp1 APSES domain in the multiple sequence alignments by color coding the charged residues that putatively could bind DNA <span style="color:#FF0000;">'''red'''</span> (-) and <span style="color:#0066FF;">'''blue'''</span> (+).  Thus these residues label columns of the alignment in which we expect ''functional'' conservation. I have also highlighted two residues that are associated with important structural features of the APSES domain in <span style="color:#00AA33;">'''green'''</span>. These two residues are G75, a mandatory glycine in the third position of a particular type of beta-turn, and W77, a key component of the domain's hydrophobic core. Thus these two residues label columns in which we expect ''structural'' conservation. Let's assume (''i'') that all the APSES domains fold into similar structures and (''ii'') that they all bind DNA, but (''iii'') they do not necessarily bind the same cognate sequence, as a consequence of the functional diversification of paralogues. This should allow you to discuss the following questions:
 
 
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
Consider any '''one''' of the three APSES domain alignments. 
 
 
*Are the patterns of sequence variation for ''functionally conserved'' residues compatible with the notion that orthologues have conserved binding specificities and paralogues have acquired new functions by binding to different sequences?
 
*Are the patterns of sequence variation for ''structurally conserved'' residues compatible with the notion that all APSES domains have a common fold? (1 mark)
 
 
For both cases, state briefly (but with reference to specific sequences and residues) what you would expect (hypothesis) and whether the alignment supports or contradicts your expectations (observation). We have determined that the sequences labelled as Mbp1 are orthologues, and the other labels were constructed to identify the yeast gene that each sequence is most similar to (although a reciprocal search was not done). This means you may group Mbp1 sequences as orthologues, Swi4, Sok2, and Phd1 sequences are presumably orthologous, and all sequences originating from the same organism are of course groups of paralogues. However, labels such as MbpA, MbpB etc. are arbitrary: these sequences as a group are paralogous to e.g. Mbp1 but not necessarily orthologous to each other. Your discussion ''may'' be easier if you sort the sequences differently than they are presented, this is easy to do in a text editor. Re-sorting does not change the alignment.
 
</div>
 
 
&nbsp;
 
 
&nbsp;
 
&nbsp;
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
===(4.3)  Visualization and analysis of alignment with VMD  (2 marks)===
 
</div>
 
&nbsp;<br>
 
 
VMD offers a very well constructed set of tools for the analyis of sequence and structural conservation: the '''MultiSeq''' extension. In this part of the assignment you will use VMD to analyse and visualize conservation patterns and comment on the alignments the servers have produced. I highly recommend to familiarize yourself with MultiSeq and the developers have produced an [http://www.ks.uiuc.edu/Training/Tutorials/#evolution excellent tutorial on the evolution of tRNA synthetases] to showcase the program's capabilities. However I am not ''requiring'' this for the course and we will be using only a subset of the available Multiseq functions. The tool is intuitive enough, beginning to use it should require no more than following the steps below.
 
 
Proceed through the following steps:
 
:(1) Save an alignment of the APSES domains on your computer.
 
::(A) Choose either the CLUSTAL or MUSCLE alignment of all APSES domains, copy it from the Wiki page and save it on your computer, as a '''text file''' with some convenient filename and the extension .aln . This is a CLUSTAL formatted input file.
 
::(B) Edit it to contain only the aligned sequences, i.e. remove any header lines and rows of conservation symbols. Make sure you are not saving the file in MS-Word binary format (.doc).
 
 
:(2) Open the Multiseq extension in VMD.
 
::(A) start VMD and load one of the APSES domain structures (1BM8 or 1MB1).
 
::(B) choose a stereo representation that will show you the fold of the domain and the sidechains of key residues. For example you could use a Tube representation for the protein backbone and a Licorice representation for the selection <code>((sidechain or type CA) and not element H) and resid 30 to 90</code>.  (And switch the axes display off! The axes carry no information you need).
 
::(C) On the VMD Main form navigate to Extensions &rarr; Analysis &rarr; MultiSeq
 
::(D) When you run MultiSeq for the first time, you will be asked for a directory in which to store metadata. You can use the default or a directory of your choice; you may subsequently skip all steps that ask you to install "required" databases locally since we will not need them for this task.
 
::(E) A window will appear - the ''MultiSeq'' window -it contains the sequence of the APSES domain you are visualizing. MultiSeq will also generate an additional cartoon representation of the structure.
 
 
:(3) Load the APSES alignment.
 
::(A) In the MultiSeq Window, navigate to File &rarr; Import Data...; Choose "From Files" and Browse to the location of the alignment you have saved. The File navigation window gives you options which files to enable: choose to Enable ALN files (these are CLUSTAL formatted multiple sequence alignments).
 
::(B) Open the alignment file, click on Ok to Import Data, it will take a short while to load.
 
::(C) find the Mbp1_SACCE sequence in the list, click on it and move it to the top of the '''Sequences''' list with your mouse (the list is not static, you can re-order the sequences in any way you like).
 
 
You will see that the stucture's sequence and the APSES domain sequence do not match; at the beginning the structure has extra sequence extending its N-terminus and in the middle the APSES sequences have gaps inserted.
 
 
:(4) Bring the structure's sequence in register with the APSES alignment.
 
::(A) MultiSeq supports typical text-editor selection mechanisms. Clicking on a residue selects it, clicking on a row selects the whole sequence. Dragging with the mouse selects several residues, shift-clicking selects ranges, and option-clicking toggles the selection on or off for individual residues. Using the mouse and/or the shift key as required, select the '''entire first column''' of the sequence group.
 
::(B) Select Edit &rarr; Enable Editing... &rarr; Gaps only to allow changing indels.
 
::(C) Pressing the spacebar once should insert a gap character before the selected column in all sequences. Insert as many gaps as you need to align the beginning of sequences with the corresponding residues of the structure <code>S&nbsp;I&nbsp;M&nbsp;...</code>.
 
::(D) Now insert as many gaps as you need into the '''structure''' sequence, to align it completely with the Mbp1_SACCE APSES domain sequence. (Simply select residues in the sequence and use the space bar to insert gaps. <small>(Note: I have noticed a bug that sometimes prevents slider or keuyboard input to the MultiSeq window; it fails to ''regain focus'' after operations in a different window. I don't know whether this is a Mac related problem or a more general bug in MultiSeq. When this happens I quit VMD and restore the session from a saved state. It is a bit annoying but not mission-critical.)</small>
 
::(E) When you are done, it may be prudent to save the state of your alignment. Use File &rarr; Save Session...
 
 
:(5) Color by similarity
 
::(A) Use the View &rarr; Coloring &rarr; Sequence similarity &rarr; BLOSUM30 option to color the residues in the alignment and structure. This clearly shows you where conserved and variable residues are located and allows to analyze their structural context.
 
::(B) You can adjust the color scale in the usual way by navigating to VMD main &rarr; Graphics &rarr; Colors..., choosing the Color Scale tab and adjusting the scale midpoint (0.75 works well for me).
 
::(C) Navigate to the Representations window and apply the color scheme to your tube-and-sidechain representation: double-click on the NewCartoon representation to hide it and use '''User''' coloring of your Tube and Licorice representations to apply the sequence similarity color gradient that MultiSeq has calculated. The example below shows in principle what you could expect to see (without sidechains).
 
 
[[Image:A03_02.jpg|frame|none|Assignment 3, Figure 02<br>
 
Stereo view of a tube representation of an APSES domain structure, colored according to residue similarity of all fungal APSES domains as defined in this assignment. A BLOSUM30 similarity matrix was applied and a gradient midpoint of 0.75. The domain is oriented with the putative recognition helix towards the front, left and the "wing" on the right.]]
 
 
::(D) Now delete all non-Mbp1 sequences from the alignment and recalculate the similarity coloring using only the Mbp1 orthologues. You may want to shift the gradient midpoint to 0.9 or so since overall conservation is much higher. Again study the conservation patterns.
 
 
[[Image:A03_03.jpg|frame|none|Assignment 3, Figure 03<br>
 
Stereo view of a tube representation of an APSES domain structure, colored according to residue similarity of all Mbp1 orthologue APSES domains, as defined in this assignment. A BLOSUM50 similarity matrix was applied and a gradient midpoint of 0.90. The domain is oriented with the putative recognition helix towards the front, left and the "wing" on the right.]]
 
 
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
 
*Generate two  parallel stereo views that shows the APSES domain backbone and selected sidechains as described above. One should be colored by sequence similarity among all APSES domains, the other by similarity among only the Mbp1 orthologues. Scale and rotate the structure so that the putative DNA binding domain is easily visible. Paste both views into your assignment in a compressed format, as was explained for Assignment 2.
 
 
*Briefly discuss what you see (with reference to specific residues and sidechains) and what you conclude about residue conservation in the alignment of all APSES domains. Are the patterns of sequence variation for ''structurally conserved'' residues compatible with the notion that all APSES domains have a common fold?
 
 
*Briefly discuss how the situation changes when you compare only Mbp1 orthologues with each other. Never mind that overall conservation is higher: does the '''distribution''' of conserved residues in the context of the domain change, and if so, how? Are the patterns of sequence variation for ''functionally conserved'' residues compatible with the notion that all Mbp1 orthologues have a similar function?
 
 
*The structure makes it easy to confirm where gaps in the alignment have been placed. Discuss briefly (but with reference to specific instances) whether the indel placements of CLUSTAL or MUSCLE appear more plausible. To do this, define where you would expect to find indels and where they have been placed by the MSA program. (2 marks total)
 
 
</div>
 
 
&nbsp;
 
&nbsp;
 
  
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
  
==(5) Summary of Resources==
+
==(4) Summary of Resources==
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
  
 
;Links
 
;Links
:* [[Organism_list_2007|Assigned Organisms]]
+
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Peitsch_2002_UseOfModels.pdf '''Review (PDF, restricted)''' Manuel Peitsch on Homology Modeling]
:* [http://www.ncbi.nlm.nih.gov/blast '''BLAST''']
+
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Aravind_2005_HTHdomains.pdf '''Review (PDF, restricted)''' Aravind ''et al.'' Helix-turn-helix domains] (background reading, not required reading)
:* [http://www.pir.uniprot.org/search/idmapping.shtml '''Uniprot ID mapping''' service]
+
:* [[Organism_list_2006|Assigned Organisms]]
:* [http://www.ncbi.nlm.nih.gov/sutils/blink.cgi?pid=68465419  A '''BLink''' example]
+
:* [http://www.rcsb.org/pdb/file_formats/pdb/pdbguide2.2/guide2.2_frame.html '''PDB file format''']
:* [http://www.ebi.ac.uk/clustalw/ EBI '''CLUSTAL-W''' server]
+
:* [http://en.wikipedia.org/wiki/Structural_alignment Wikipedia on '''Structural Superposition'''] <small>(although the article is called "Structural Alignment")</small>
:* [http://www.ebi.ac.uk/muscle/ EBI '''MUSCLE''' server]
 
:* [http://www.ebi.ac.uk/t-coffee/ EBI '''T-Coffee''' server]
 
:* [http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi '''CDD''']
 
:* [http://smart.embl-heidelberg.de/ '''SMART''']
 
:* [http://www.ebi.ac.uk/thornton-srv/databases/sas/ '''SAS''']
 
  
;Sequences
+
:* [[Assignment_5_fallback_data|'''Fallback Data page''']]
:* [[All_Mbp1_proteins|'''All Mbp1 proteins''']]
 
:* [[All_APSES_domains|'''All APSES domains''']]
 
  
 
;Alignments
 
;Alignments
:'''Mbp1 proteins:'''
 
:* [[All_Mbp1_CLUSTAL|Mbp1 proteins '''CLUSTAL''' aligned]]
 
:* [[All_Mbp1_MUSCLE|Mbp1 proteins '''MUSCLE''' aligned]]
 
 
:* [[All_Mbp1_T-COFFEE|Mbp1 proteins '''T-Coffee''' aligned (text version)]]
 
:* [[All_Mbp1_T-COFFEE|Mbp1 proteins '''T-Coffee''' aligned (text version)]]
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/resources/T-coffee_scores.html Mbp1 proteins '''T-Coffee''' aligned (coloured according to scores)]
 
 
:'''APSES domains:'''
 
:* [[APSES_domains_PSI-BLAST|All APSES domains - alignment based on '''PSI-BLAST''' results]]
 
:* [[APSES_domains_CLUSTAL|All APSES domains -  '''CLUSTAL-W''' alignment]]
 
:* [[APSES_domains_MUSCLE|All APSES domains -  '''MUSCLE''' alignment]]
 
 
  
 
&nbsp;
 
&nbsp;

Revision as of 15:28, 27 October 2007

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 


   

Assignment 4 - Homology modeling

How could the search for ultimate truth have revealed so hideous and visceral-looking an object?
Max Perutz (on his first glimpse of the Hemoglobin structure)

   

Where is the hidden beauty in structure, and where, the "ultimate truth"? In the previous assignments we have studied sequence conservation in APSES family domains and looked at how these domains have evolved over time. We have seen that this is an ancient family, that had several members already in the cenancestor of all fungi, an organism that lived in the vendian period of the proterozoic era of precambrian times, more than 600,000,000 years ago.

In order to understand how particular residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to consider an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. In particular, it would be interesting to correlate the conservation patterns we have observed in the MSAs with specific DNA binding interactions. Unfortunately, the 1MB1 structure does not have DNA bound and the evidence we have considered in Assignment 2 (Taylor et al., 2000) is not sufficient to define the details of how a DNA double helix might be bound. These details would require the structure of a complex that contains protein as well as DNA. No such complex of an APSES domain has yet been crystallized.

In this assignment you will construct a molecular model of the Mbp1 orthologue in your assigned organism, identify similar structures of distantly related domains for which protein-DNA complexes are known, define whether the available evidence allows you to distinguish between different modes of ligand binding, and assemble a hypothetical complex structure.

For the following, please remember the following terminology:

Target
The protein that you are planning to model.
Template
The protein whose structure you are using as a guide to build the model.
Model
The structure that results from the modeling process. It has the Target sequence and is similar to the Template structure.

 

A brief overview article on the construction and use of homology models is linked to the resource section at the bottom of this page. That section also contains links to other sites and resources you might require.

Preparation, submission and due date

Read carefully.
Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we see too many assignments which, arduously effected, nevertheless intimate nescience of elementary tenets of molecular biology. If the sentence above did not trigger an urge to open a dictionary, you are trying to guess, rather than confirm possibly important information.

Review the guidelines for preparation and submission of BCH441 assignments.

The due date for the assignment is Monday, November 5 at 10:00 in the morning.

   


(1) Preparation



   

The input alignment (1 marks)

 

The sequence alignment between target and template is the single most important factor that determines the quality of your model.

No homology modeling process will repair an incorrect alignment and it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment, rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these just because they are convenient, rather than the more sophisticated methods and more informed procedures we have discussed. Detailed analysis of fallacious models rarely leads to good results.

The best possible alignment is usually constructed from a multiple sequence alignment that includes at least the target and template sequence and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Typically such an alignment will also include additional optimization steps to move insertions or deletions between target and template out of the secondary structure elements of the template structure.

Here is an excerpt from the T-coffee aligned Mbp1 sequences: it contains all the residues of the yeast sequence that are found in the 1MB1 crystal structure - the template sequence for our homology model - and it has been edited to remove the N-terminal gaps in the sequence. Thus the N-terminus is 21 amino acids longer than the definition of the APSES domain in CDD (which starts with SIMKR...), the C- terminus is slightly shorter.

Since the sequences are very similar between each other, there is no ambiguity in the alignment and the construction of a homology model should be straightforward. Normally one would spend considerable some effort at this stage to consider which parts of the target sequence and the template sequence appear to correctly aligned and to edit the alignment manually. In our case, evolutionary pressure was so strong that essentially all have evolved without a single indel in their sequence.

I have added to the alignment the APSES domain of XP_001224558, the Chaetomium globosum Mbp1 orthologue (MBP1_CHAGL). This will serve as the reference and fallback sequence.

1MB1            NQIYSARYSGVDVYEFIHSTG---SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEV
MBP1_CANGL      NQIYSAKYSGVDVYEFIHPTG---SIMKRKNDGWVNATHILKAANFAKAKRTRILEKEV
MBP1_EREGO      TQIYSAKYSGVEVYEFLHPTG---SIMKRKADDWVNATHILKAAKFAKAKRTRILEKEV
MBP1_KLULA      NQIYSAKYSGVDVYEFIHPTG---SIMKRKADNWVNATHILKAAKFPKAKRTRILEKEV
MBP1_CANAL      SQIYSATYSNVPAFEFVTSEG---PIMRRKKDSWINATHILKIAKFPKAKRTRILEKDV
MBP1_DEBHA      TQIYSATYSNVPVFEFVTLEG---PIMRRKLDSWINATHILKIAKFPKAKRTRILEKDV
MBP1_YARLI      MSIYKATYSGVPVYEFQCKNV---AVMRRKSDGWVNATHILKVAGFDKPQRTRILEKEV
MBP1_SCHPO      SAVHVAVYSGVEVYECFIKGV---SVMRRRRDSWLNATQILKVADFDKPQRTRVLERQV
MBP1_USTMA      KTIFKATYSGVPVYECIINNV---AVMRRRSDDWLNATQILKVVGLDKPQRTRVLEREI
MBP1_ASPNI      SNVYSATYSSVPVYEFKIGTD---SVMRRRSDDWINATHILKVAGFDKPARTRILEREV
MBP1_ASPTE      SKIYSATYSSVPVYEFKIEGD---SVMRRRADDWINATHILKVAGFDKPARTRILEREV
MBP1_CRYNE      PKVYASVYSGVPVFEAMIRGI---SVMRRASDSWVNATQILKVAGVHKSARTKILEKEV
MBP1_GIBZE      G-IYSASYSGVDVYEMEVNNI---AVMRRRNDSWLNATQILKVAGVDKGKRTKILEKEI
MBP1_NEUCR      IYSLQATYSGVGVYEMEVNNV---AVMRRQKDGWVNATQILKVANIDKGRRTKILEKEI
MBP1_MAGGR      P-IYTAVYSNVEVYEFEVNGV---AVMKRIGDSKLNATQILKVAGVEKGKRTKILEKEI
MBP1_ASPFU      PQIYKAVYSNVSVYEMEVNGV---AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEI
MBP1_CHAGL      AGIYSATYSGIPVYEYQFGPDMKEHVMRRREDNWINATHILKAAGFDKPARTRILERDV

1MB1            LKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLF
MBP1_CANGL      LKEMHEKVQGGFGKYQGTWVPLNIAINLAEKFDVYQDLKPLF
MBP1_EREGO      IKDTHEKVQGGFGKYQGTWVPLDIARRLAQKFEVLEELRPLF
MBP1_KLULA      ITDTHEKVQGGFGKYQGTWIPLELASKLAEKFEVLDELKPLF
MBP1_CANAL      QTGIHEKVQGGYGKYQGTYVPLDLGAAIARNFGVYDVLKPIF
MBP1_DEBHA      QTGVHEKVQGGYGKYQGTYVPLDLGADIAKNFGVFDSLRPIF
MBP1_YARLI      QKGVHEKVQGGYGKYQGTWVPLERAREIATLYDVDSHLAPIF
MBP1_SCHPO      QIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPIL
MBP1_USTMA      QKGIHEKVQGGYGKYQGTWIPLDVAIELAERYNIQGLLQPIT
MBP1_ASPNI      QKGVHEKVQGGYGKYQGTWIPLQEGRQLAERNNILDKLLPIF
MBP1_ASPTE      QKGVHEKVQGGYGKYQGTWIPLPEGRLLAERNNIIDKLRPIF
MBP1_CRYNE      LNGIHEKIQGGYGKYQGTWVPLDRGRDLAEQYGVGSYLSSVF
MBP1_GIBZE      QTGEHEKVQGGYGKYQGTWIKFERGLQVCRQYGVEELLRPLL
MBP1_NEUCR      QIGEHEKVQGGYGKYQGTWIPFERGLEVCRQYGVEELLSKLL
MBP1_MAGGR      QTGEHEKVQGGYGKYQGTWIKYERALEVCRQYGVEELLRPLL
MBP1_ASPFU      AAGEHEKVQGGYGKYQGTWVNYQRGVELCREYHVEELLRPLL
MBP1_CHAGL      QKDVHEKIQGGYGKYQGTWIPLEQGRALAQRNNIYDRLRPIF

 

It should be obvious to you by now how you can copy a string of amino acids from such an alignment and create a FASTA file. However we need to take a little detour: this detour brings us to the question of sequence numbers.

It is not straightforward at all how to number sequence in such a project. The "natural" way would be to start a sequential numbering from the start-codon of the full length protein and go sequentially from there. However imagine what would happen if a curator would discover that one of the splice-sites for a gene has been missed in automatic annotation. All of a sudden a corrected sequence would have a different length than the one that may have been used for earlier studies. Unfortunatlety, there is no mechanism (wouldn't it be nice!) that automatically goes back through the literature and your lab-journal and updates the revised sequence numbering... But there are other possible complications, regarding sequence numbers. The first residue of the CDD-APSES domain is not Residue 1 of the Mbp1 protein. The first residue of the 1MB1 FASTA file is the first residue of Mbp1 protein, but the last five residues are an artifiical His tag. Is H125 of 1MB1 the equivalent residue to R125 in MBP1_SACCE? The N-terminus of the Mbp1 crystal structure is disordered. The first residue in the structure is ASN 3, whereas the SEQRES records start with MET ... and so on. The take-home message is that a sequence number is nothing absolute, but something that makes sense only in a particular context. To emphasize this, we will write a FASTA header for our target sequence that lists the residues of the source sequence it correspond to. In terms of actual sequence numbering, we will adopt the numbering of the 1MB1 protein throughout to be able to consistently label particular amino acids.

Access the sequence of "your" organism's Mbp1 Orthologue at UniProt. (You can use the links I have provided in the table below).


Organism Uniprot Accession
Aspergillus fumigatus Q4WGN2
Aspergillus nidulans Q5B8H6
Aspergillus terreus Q0CQJ5
Candida albicans Q5ANP5
Candida glabrata Q6FWD6
Cryptococcus neoformans Q5KHS0
Debaryomyces hansenii Q6BSN6
Eremothecium gossypii Q752H3
Gibberella zeae Q4IEY8
Kluyveromyces lactis P39679
Magnaporthe grisea Q3S405
Neurospora crassa Q7SBG9
Saccharomyces cerevisiae P39678
Schizosaccharomyces pombe P41412
Ustilago maydis Q4P117
Yarrowia lipolytica Q6CGF5


  • Copy your organism's Mbp1 sequence from the alignment above. Then define the start- and end- sequence numbers of the target sequence relative to the full-length protein. Prepare a FASTA formatted file for the target sequence in your organism, giving it an appropriate header and include the sequence numbers. Refer to the Fallback data file if you are not sure about the format. (1 mark)

 

Your FASTA sequence should look similar to this:

>1MB1: Mbp1_SACCE 1..100
NQIYSARYSGVDVYEFIHSTG---SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEV
LKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLF

   

(2) Homology model

   

(2.1) SwissModel (1 mark)

 

Access the Swissmodel server at http://swissmodel.expasy.org . Navigate to the Alignment Interface.

 

  • Copy from the alignment above the 1MB1 sequence and the sequence from your organism, and paste it into the form field. Refer to the Fallback Data file if you are not sure about the format.
(You have to choose the format, and, if e.g. you choose a CLUSTAL format, you have to include a header line and a blank line. Other common problems uploading your alignment may include uploading a file that has not been saved as "text only" and periods i.e. "." in sequence names. Underscores appear to be safe.)
  • Click submit and define your target and template sequence. For the template sequence define the coordinate file and chain. (In our case the coordinate file is 1MB1 and the chain is "_" i.e. none, since the PDB file does not contain more than one chain.
  • Click submit and request the construction of a homology model: Enter your e-mail address and check the button for Normal Mode, not "Swiss-PDB Viewer mode. (Important, since there will be problems with the output otherwise). Click submit. You should receive four files files by e-mail within half an hour or so. (1 mark)

(You do not need to submit any coordinate files with your assignment.)

 
In case you do not wish to submit the modelling job yourself, you can access the result files for the from the Fallback Data file.


(3) Model analysis

   

(3.1) The PDB file (1 mark)

 

Open your model coordinates in a text-editor (make sure you view the PDB file in a fixed-width font) and consider the following questions: (Alternatively, view the coordinates linked to the Fallback Data file.)

 

  • What is the residue number of the first residue in the model? What should it be, based on the alignment? If the putative DNA binding region was reported to be residues 50-74 in the Mbp1 protein, which residues of the model correspond to that? (1 mark)


   

(3.2) first visualization (3 marks)

 

In assignment 2 you have already studied the 1MB1 coordinate file and compared it to your organism's Mbp1 APSES domain, Since a homology model inherits its structural details from the template, the model should look very similar to the original structure but contain the sequence of the target.

 

  • Save the attachment of your model coordinates to your harddisk and visualize it in RasMol. (Alternatively, copy and save the coordinates from the Fallback Data file to your harddisk.) Make an informative view, divergent stereo and paste it into your assignment. (3 marks)

 


Stereo-view of a subdomain within the 1MB1 structure that includes residues 36 to 76. The color gradient ramps from blue (36) to green (76).

   


(3.3) modeling a DNA ligand (4 marks)

 

The really interesting question we could begin to address with our model is how sequence variation might be converted into changing DNA recognition sites, and then lead to changed cognate DNA binding sequences. But in order to address this, we would need to add a plausible model for a bound DNA molecule to our model.

Since there is currently no software available that would accurately model such a complex from first principles, we will base this on homology modeling as well. This means we need to find a similar structure for which the complex structure is known. However, you may remember from the third assignment that the APSES domains in fungi seem to be a relatively small family. And there is no structure available of a protein-DNA complex. Now what?

Remember that homologous sequences can have diverged to the point where their sequence similarity is no longer recognizable, however their structure may be quite well conserved. Thus if we could find similar structures in the PDB, these might provide us with some plausible hypotheses for how DNA is bound by APSES domains. We thus need a tool similar to BLAST, but not for the purpose of sequence alignment, but for structure alignment. A kind of BLAST for structures.

However, very similar to BLAST, we might not want to search with the entire protein, if all we are interested in is a subdomain that binds to DNA. Attempting to match all structural elements in addition to the ones we are actually interested in is likely to make the search less specific - we would find false positives that are similar to some irrelevant part of our structure. However, defining too small of a subdomain would also lead to a loss of specificity: in the extreme it is easy to imagine that the search for e.g. a single helix would retrieve very many hits that would be quite meaningless. The arrangement of the residues from 50 to 74 that we have already discussed in Assignment 2 suggests that the compact subdomain from 36 to 76 (see the image above) might be a useful structure to search with: it contains the residues we are interested in and enough of connected secondary structure elements to be structurally meaningful.

At the NCBI, VAST is a search tool for structural similarity search tool for this purpose. Unfortunately it does not seem to be able to handle a query with such a structural subdomain (the process did not finish after several days) but at least you can get a list of structural neighbors of the 1MB1 full-length template structure, by entering the PDB ID in a small form field on the VAST home page, and then clicking on the colored bar labeled "Chain" on the MMDB structure summary page. This precomputed page for the 1MB1 structure shows a number of diverse proteins matching to various helices and strands of the structure.

At the EBI there are a number of very well designed structure analysis tools linked off the Structural Analysis page. As part of its MSD Services, the SSM (Secondary Structure Matching service) provides a well thought out interface for searching files from the PDB or uploading coordinates.

After uploading the coordinates for residues 36 to 76 of the 1MB1 structure running the search and sorting the results by alignment length, the top hits include a number of nucleotide binding proteins such as a replication terminator (1F4K), the LexA repressor (1MVD) and a "Winged Helix" protein (1KQ8). These are all members of a much larger superfamily, the "winged helix" DNA binding domains (CATH 1.10.10.10), of which hundreds of structures have been solved. They represent one branch of the tree of helix-turn-helix (HTH) DNA binding modules. (A recent review on HTH proteins is linked from the resources section at the bottom of this page). Winged Helix domains typically bind their cognate DNA with a "recognition helix" which precedes the beta hairpin and binds into the major groove; additional stabilizing interactions are provided by the edge of the beta strand binding into the minor groove.


This is good news: once we have determined that the APSES domain is actually an example of a larger group of transcription factors, we can pick one of these for which a DNA complex structure is known. I have picked one such structure from the list of hits that were returned by SSM: it is the Elk-1 transcription factor.

1DUX). Note how the "recognition helix" inserts into the major groove of the DNA molecule. The color gradient ramps from blue (34) to green (84). Note how the first helix of the "helix-turn-helix" architecture serves only to position the recognition helix and makes few interactions by itself.

Now all that is left to do is to bring the DNA molecule into the correct orientation for our model and then to combine the two files. We need to superimpose the Elk-1 protein/DNA complex onto our model.

Structure superposition

There are quite a number of superposition servers available on the Web, a remarkably comprehensive overview can be found in Wikipedia. However, overengineering and black-box mentality makes our task more difficult than it need be: most tools do not allow users to specify particular alignment zones but attempt to automatically define the zones of residues to be supoerimposed according to some geometric target function. Almost none return the actual rotation matrix and translation vector that is used for the superposition. And almost none transform the coordinates of heteratoms such as solvent, ligands or DNA molecules along with the protein coordinates. An exception that I have found to be very useable is the Local-Global Alignment server (LGA), written by Adam Zemla. The procedure is quite straightforward:

  • Define the structure to be rotated (1DUX in this case). This is a dimer, so download the file from the PDB and manually edit to contain only DNA chains A and B and protein chain C.
  • Define the structure to be held constant (1MB1 in this case). Download from PDB.
  • Use the "browse" option to define both files as input on the LGA inpput form
  • Use the option to have both coordinate sets included in your output: -o2
  • Submit

The results arrive per e-mail. I have linked the resulting PDB file to the Fallback Data page. If you run this analysis on your own, you may want to review the types of edits the edits I made to the PDB file to get it displayed correctly in Rasmol.


 

  • Save the superimposed coordinates in a file, open and view in Rasmol and note how well the "recognition helix" and adjacent beta strands superimpose! (Alternatively, copy and save the coordinates from the c to your harddisk.) Make an informative view, divergent stereo and paste it into your assignment. (4 marks)

 
 


(4) Summary of Resources

 

Links
Alignments

   

[End of assignment]

If you have any questions at all, don't hesitate to mail me at boris.steipe@utoronto.ca or post your question to the Course Mailing List