Difference between revisions of "Application Exam Questions"

Revision as of 22:26, 18 October 2007

Bioinformatics and computational biology are huge and growing areas of active research with a dynamic array of subspecialities - phylogenetic analysis has advanced in leaps and bounds with the availablity of molecular data, genomic-, transcriptomic-, proteomic and other cross-sectional views are slowly beginning to unravel some of the intricacies of the cell's inner workings, predictive models for molecular medicine and bioengineering may help shape our society's future. These tasks are what motivates bioinformatics worldwide, seeking to develp novel ways to reason about biology.

2003 - Consequences of Homology

A bacterial genome has been sequenced. Comparison with tRNA synthetase COGs has failed to turn up an annotation for a Glutaminyl tRNA synthetase gene. Assume that the genome sequence is complete, has been assembled without frameshift errors and all gene models are correct. The sequence for Glutaminyl tRNA synthetase has not simply been overlooked. Answer the following questions briefly.

What is a COG ?
State two reasons why the endpoint of a biochemical pathway could be present in an organism but a gene on the pathway would fail to be found in a COG analysis.
Briefly(!) state what wet-lab experiments you would propose to confirm (or disprove) the hypotheses you have stated above ?

Comment

2003 - Domains and Homology

Following an Entrez-provided links to "sh3" domains may ultimately bring you to the CDART entry for a protein. This screenshot shows the first of ten pages of results for the eukaryotic SH3 domains.

Briefly define homologs, orthologs and paralogs.
Are the proteins depicted above homologs ?
Briefly explain the contents, implications and use of this CDART entry.

(I would consider the words "contents, implications and use" to be too vague for a future exam question.)

2003 - Homology

The COGs database employs a definition of orthologs that is based on reciprocal highest similarity.

Briefly explain the principle and how it distinguishes orthologs from paralogs.

You have built a homology model based on the pairwise alignment of a C. elegans Glutaminyl-tRNA synthetase sequence to a template that you have identified with a BLAST search. Upon analysis of your model structure, you note that the presumed ATP-binding site of the model appears to be blocked by an arginine sidechain.

Briefly define the key steps involved in building a homology model.
Based on these steps, identify two probable sources of the discrepancy between the model and the function of the protein.
Suggest a bioinformatics strategy to improve your model, given the problems you have identified above.

2003 - Expression Analysis

In order to study the coordination of gene expression in the yeast cell cycle, you have created a synchronized, growing culture of Saccharomyces cerevisiaea cells and harvested mRNA at ten successive time points along one replication cycle. You now plan to perform two-color microarray (spotted array) experiments.

Describe the principle and key steps of this experiment, from mRNA samples to a series of scanned images.
Describe the key steps involved in going from a scanned image to expression profiles of individual genes.
After clustering genes according to similarity of expression profiles, you will have sets of genes with correlated expression profiles. You hypothesize that these may be coregulated genes. What bioinformatics procedure(s) can you suggest to help you annotate shared functions for your clusters of genes ?
What bioinformatics procedure(s) can you suggest to pursue the question whether these genes may be coregulated ?

2003 - Integrated processes

Read the following abstract:

Structure of TCTP reveals unexpected relationship with guanine nucleotidefree chaperones

Paul Thaw, Nicola J. Baxter, Andrea M. Hounslow, Clive Price, Jonathan P. Waltho and C. Jeremy Craven: Nature Struct Biol 8: 701–704 (2001)

The translationally controlled tumor-associated proteins (TCTPs) are a highly conserved and abundantly expressed family of eukaryotic proteins that are implicated in both cell growth and the human acute allergic response but whose intracellular biochemical function has remained elusive. We report here the solution structure of the TCTP from Schizosaccharomyces pombe, which, on the basis of sequence homology, defines the fold of the entire family. We show that TCTPs form a structural superfamily with the Mss4/Dss4 family of proteins, which bind to the GDP/GTP free form of Rab proteins (members of the Ras superfamily) and have been termed guanine nucleotide-free chaperones (GFCs). Mss4 also acts as a relatively inefficient guanine nucleotide exchange factor (GEF). We further show that the Rab protein binding site on Mss4 coincides with the region of highest sequence conservation in the TCTP family. This is the first link to any other family of proteins that has been established for the TCTP family and suggests the presence of a GFC/GEF at extremely high abundance in eukaryotic cells.

This abstract reports several pieces of data and mentions several pieces of prior information.

List the most important data and information entities and databases that have been used in this study.
Summarize on approximately one-half page or less the essential steps of how these entities were related to each other in this study. You may use any representation that is reasonable such as pseudocode, a flowchart, or other type of sketch.

Note that you are not required to understand the biochemical processes that are described here, nor are you required to comment on the cell-biological implications. One of the key steps has been underlined by me - you must understand how such a conclusion can be drawn in the situation that is described. You are to summarize the flow of data: the entities that are being referred to, and the experimental and computational procedures.

2003 - Integrated processes

Read the following abstract:

Insights into DNA recombination from the structure of a RAD51-BRCA2 complex.

Pellegrini L, Yu DS, Lo T, Anand S, Lee M, Blundell TL, Venkitaraman AR.: Nature 420:287-293 (2002)

In this landmark paper on the breast cancer susceptibility gene BRCA2, the authors provide an intriguing mechanistic link with RAD51, a recombinase that plays an essential role in DNA repair by facilitating homologous recombination of intact and damaged DNA strands. Sequence repeats in the BRCA2 gene were reported as early as 1996 and have allowed the authors to define and express an isolated repeat domain (BRC4). It was known that the sequence repeats mediate binding of BRCA2 to RAD51. Sequence similarity (30 % identity) between RAD51 and E. coli RecA had allowed to define and express a domain of RAD51. The BRC4-RAD51 complex was prepared, crystallized, and its structure was solved at high resolution. The crystal structure shows extensive specific interactions between BRC4 and RAD51, additionally it demonstrates structural similarity between RAD51 and RecA, supporting the idea of a shared mechanism. Two crucial insights emerge: firstly, the authors are able to map breast cancer associated mutations of BRCA2 (from the NIH Breast Cancer Information Core database) to specific BRC4 residues that appear to have critical roles for RAD51 binding. As a general rule: cancer associated mutations appear to disrupt interactions that seem to be critical for binding. And secondly, the mode of interaction between BRC4 and RAD51 allows to hypothesize about the function of the repaets. RecA forms helical nucleoprotein filaments as part of its function. Superposition of the RAD51 structure on the model of the RecA filament thus allows to model the interactions of RAD51 domains in a RAD51 filament. Comparison with the BRC4 complex structure shows that BRC4 binds RAD51 in a location that would be occupied by another RAD51 molecule, in a RAD51 helical filament. Furthermore, the conformation of BRC4 in this epitope mimics the conformation of the RAD51 epitope it apparently displaces, as well, the BRC consensus sequence motif of this epitope (GFXTASG) is similar to the highly conserved sequence of this epitope in RAD51 homologues (GFTTATE).

List briefly they key entities (models, abstractions, observations, database resources ... ) that have been used in this analysis.
Sketch the key bioinformatic procedures with which these entities were related to each other in this study. You may use any representation that is reasonable such as a Data Flow Diagram, or other type of sketch, or a description in pseudocode.

2003 - Phylogenetic analysis

Phylogenetic relationships among 64 placental mammals and two marsupials based on analysis of 9,779 bp from 15 nuclear and three mtDNA genes. (from: Murphy WJ, Eizirik E, Johnson WE, Zhang YP, Ryder OA, O'Brien SJ (2001) Molecular phylogenetics and the origins of placental mammals. Nature 409:614-618)

Briefly explain the principles of this graph in terms of what can be observed, what must be inferred and how this graph integrates this information.
What do the numbers among the branches mean ?
Which parts of this graph correspond to the following terms:
- Clade
- Bifurcation
- Root
Briefly state the principle of maximum parsimony and briefly explain how it is applied in principle to construct graphs of the type shown above.

2003 - Phylogenetic analysis

You have studied the relationship between GlnRS and GluRS throughout your assignments (in 2003). An important contribution to this topic was made in September of 2003, with the discovery of a GluRS isozyme with intriguing specificity in H. pylori. It appears that GluRS-1 aminoacylates tRNAGlu while GluRS-2 specifically aminoacylates tRNAGln. Is this a missing link in the evolution of laterally transferred GlxRS ? Is it an alternative route to independent evolution of GlnRS activity ? Or is it just a refinement of the GluRS → GatABC pathway, perhaps for better metabolic regulation ? Here is some of the evidence:

(After: Skouloubris S, DePouplana LR, DeReuse H, Hendrickson TL (2003) A noncognate aminoacyl-tRNA synthetase that may resolve a missing link in protein evolution. PNAS 100:11297-11302.) Phylogenetic tree of GluRS sequences in proteobacteria. Species with ORFs similar to helicobacter GluRS-1 resp. GluRS-2 are boxed; those few not in the box have been identified with the suffix "-1" resp. "-2". Bacteria with a GlnRS ORF are labelled with a black diamond. GlnRS sequences are more distantly related to any of these GluRS than all of them with each other.

Briefly explain the principles of this phylogenetic tree in terms of what can be observed, what must be inferred and how this tree integrates this information.
What do the numbers in the figure mean ?
Briefly state how the principle of maximum parsimony is applied to construct phylogenetic trees such as this one.
Briefly discuss the evidence in the tree for (1) independent events of horizontal gene transfer into bacteria, and (2) gene duplication leading to GluRS-1 and -2.

Here are the 2004 questions to the same sketch:

Briefly describe all input data that is required to build a phylogenetic tree like the one shown above and how the data needs to be prepared in principle.
What do the numbers in the figure mean? How could they have they been derived? What are they useful for?
Briefly discuss the evidence the figure shows for more than one independent event of horizontal gene transfer of GlnRS into bacteria.

2004 - Gene regulation

Here is an excerpt from a recent abstract using microarray technology to study pathogen gene regulation:

Identification of iron-activated and -repressed Fur-dependent genes by transcriptome analysis of Neisseria meningitidis group B.

Grifantini R, Sebastian S, Frigimelica E, Draghi M, Bartolini E, Muzzi A, Rappuoli R, Grandi G, Genco CA. Proc Natl Acad Sci U S A (2003) 100:9542-9547

Iron is limiting in the human host, and bacterial pathogens respond to this environment by activating genes required for bacterial virulence. Transcriptional regulation in response to iron in Gram-negative bacteria is largely mediated by the ferric uptake regulator protein Fur, which in the presence of iron binds to a specific sequence in the promoter regions of genes under its control and acts as a repressor. Here we describe DNA microarray, computational and in vitro studies to define the Fur regulon in the human pathogen Neisseria meningitidis group B (strain MC58). After iron addition to an iron-depleted bacterial culture, 153 genes were up-regulated and 80 were down-regulated. [...] Forty-two promoter regions were amplified and 32 of these were shown to bind Fur by gel-shift analysis. Among these genes, many of which had never been described before to be Fur-regulated, 10 were up-regulated on iron addition, demonstrating that Fur can also act as a transcriptional activator. [...]

Describe the principle and key steps of this experiment up to the generation of expression profiles.
What is the "Fur regulon" ?
After clustering genes according to the similarity of expression profiles, you obtain sets of genes with correlated expression profiles. You hypothesize that these genes may be coregulated. What bioinformatics procedure(s) can you suggest to support this idea and to identify further genes that may be regulated by the same factors?

2004 - Genomes and Pathways

Far from being dull components of dusty textbook knowledge about basic cellular metabolism, tRNA synthetases appear in surprising variations in the microbial world. Consider the excerpt from the following abstract about the very problematic human pathogen Pseudomonas aeruginosa:

Direct glutaminyl-tRNA biosynthesis and indirect asparaginyl-tRNA biosynthesis in Pseudomonas aeruginosa PAO1.

Akochy PM, Bernard D, Roy PH, Lapointe J. J Bacteriol. (2004) 186:767-776

The genomic sequence of Pseudomonas aeruginosa PAO1 was searched for the presence of open reading frames (ORFs) encoding enzymes potentially involved in the formation of Gln-tRNA and of Asn-tRNA. We found ORFs similar to known glutamyl-tRNA synthetases (GluRS), glutaminyl-tRNA synthetases (GlnRS), aspartyl-tRNA synthetases (AspRS), and trimeric tRNA-dependent amidotransferases (AdT) but none similar to known asparaginyl-tRNA synthetases (AsnRS).

[ The authors then describe biochemical experiments to conclude ... ]

These results show that P. aeruginosa PAO1 is the first organism known to synthesize Asn-tRNA via the indirect pathway and to synthesize Gln-tRNA via the direct pathway. The essential role of AdT in the formation of Asn-tRNA in P. aeruginosa and the absence of a similar activity in the cytoplasm of eukaryotic cells identifies AdT as a potential target for antibiotics to be designed against this human pathogen. Such novel antibiotics could be active against other multidrug-resistant gram-negative pathogens such as Burkholderia and Neisseria as well as all pathogenic gram-positive bacteria.

Briefly describe the computational strategy the authors probably used and what data/resources were required.
Discuss briefly how you could expand this procedure, to systematically identify many more potential P. aeruginosa drug targets. (To answer this, you will need to think a little about what makes a promising drug target ...)

2004 - Protein Structure Domains

The recognition that proteins are frequently organized into domains and that domains appear to provide an efficient means of combinatorial generation of function, is one of the remarkable insights we have gained from structural studies and sequence analysis. Among the numerous different definitions for "protein domains" are the following:

A domain is:

an independently folding fragment of protein sequence
a separately inherited sequence fragment
a compact set of amino acids in a structure
a functional unit

Briefly sketch a possible computational strategy for two of these definitions in order to use data from primary(!) databases to compile a list of the domains they contain.

You could structure this as follows:

Propose a computable procedure that relates to one of the definitions
Define the source data it would operate on
Describe how the source data would be used in a bioinformatics procedure
Note what results this would yield and how the result would relate to the definition

2004 - Homology modeling and model analysis

We have studied protein structure and homology models in the assignments and discussed them at length in various lectures. The single most important factor affecting the quality of a homology model is the sequence alignment to the template and this can be a problem, since optimal sequence alignment and "true" structure alignment need not always coincide. However, there are also other factors to be considered, especially when you have more than one homologuos structure that could serve as a template.

Short answers please!

Discuss briefly how the overall degree of target–template sequence similarity would affect your choice of a template structure for homology modeling if you had several options available. Would you rather choose a closely related paralogue or a more distantly related orthologue?
Discuss briefly how the presence of insertions or deletions in target–template alignments would affect your choice of a template structure for homology modeling if you had several options available.
The Thermus thermophilus GluRS structure with the PDB-ID 1N78 contains the substrate analogon glutamol-AMP. Can you copy and paste coordinates for glutamate and ATP from the structure file 1J09 (Thermus thermophilus GluRS structure solved at 1.9Å resolution and containing both glutamate and ATP.) into a homology model based on 1N78? If yes, how? If no, why not?
You have commented in several parts of your assignments on the presence of hydrogen bonds. Please sketch an approximate distribution of N – O distances in protein structure hydrogen bonds, pointing out minimum, maximum and mode of the distribution.
The following two records have been copied from the coordinate file 1N78.PDB. From the information given here, explain what these atoms appear to be a part of, what relationship they could have to each other and how one could calculate the distance between the atoms (in Å)?

[...]
ATOM     66  OG  SER A   9      42.867  81.416  54.733  1.00 28.69           O  
[...]
ATOM   9214  O3*   A C 576      41.539  83.821  54.257  1.00 73.54           O  
[...]

2005 - Genome Analysis

Monodelphis Domestica - the short-tailed opposum, a marsupial - is one of the organisms whose genome has recently been sequenced and made available on the Ensembl Website. This section discusses some issues in assembling the genome, defining the genes and arriving at annotations.

From the Ensembl Web site:

"This site presents the first preliminary genome assembly (version 0.5) of the gray short-tailed opossum. [...] The assembly, from the Broad Institute, has a base coverage of approximately 7.19X, constructed from 19348 supercontigs. [...] This release of M.domestica data is assembled into scaffolds, so there are no chromosomes available to browse."

Monodelphis domestica

Briefly decribe how the genome has been sequenced and assembled.

From the Ensembl Web site:

"The gene set for Opossum was built using a modified version of the standard Ensembl genebuild pipeline. The species-specific sequence resources (opossum cDNA and protein) are very limited, so the vast majority of gene models are based on genewise alignments of proteins from other species. Most of the proteins being aligned were from species genetically distant to opossum. To improve the accuracy of models generated from these proteins, the genewise alignments were made to stretches of genomic sequence[...]. Opossum and human cDNAs were aligned and used to add UTRs to the genewise predictions where possible. The gene models were assessed by generating sets of potential orthologs to genes from other mammalian species. [...]"

What is a gene model?

BLASTing the Ets-1 transcription factor winged helix domain sequence from 1DUX.PDB against the M. Domestica scaffolds (TBLASTN) yields a highly conserved match:

Query                 Scaffold                                  Stats
Start   End  Ori      Name           Start     End     Ori      Score    E-val    %ID    Length
1       70   +        scaffold_134   566740    566946  +        551      1.4e-58  91.43  70

This match has a link to the ensembl genome browser, leading to the following screen (slightly edited):

Result screen of the ensembl gene browser: Unigene is a nonredundant set of Genbank sequences. UniProtKB is the successor database to Swissprot. Genscan is a program to predict genes in genome sequences. This is a clickable map, many of its features are hyperlinked to sequences or other information.

A feature in the sense used here is an annotation that is attached to a position of the genomic sequence.

Identify the feature that shows where the region of sequence similarity has been found.
Identify the feature that corresponds to the current gene model which the similarity search has retrieved.
Identify the feature that corresponds to the observed protein sequence which best supports the current gene model.
Identify the features that correspond to nucleotide sequences of similar proteins.
Is the current gene model consistent with all observations integrated here? Explain.

This question intentionally focussed on a site that the class had (most likely) not encountered so far, in order to evaluate the transfer of knowledge into a new context. Tools on the Web change rapidly.

2005 - Model and Algorithm

In the post-genomic era, many of us perceive the comprehensive enumeration of all protein-protein interactions to be the logical and necessary next step, following after genome analysis. Obviously, careful experimental analysis does not scale well for a problem of this magnitude (a chordate organism may have on the order of half a million protein-protein interactions, not even counting small molecules, DNA and RNA) and for the foreseeable future computational procedures are going to play a crucial role in this endeavor. After all, experimental procedures can only determine the physical aspect of interactions; hoewever, whether an interaction is important for the organism, i.e. whether it contributes to the organism's fitness function - the interaction is conserved - is a question from the domain of theoretical analysis.

In a recent seminar, Elisabeth Tillier discussed computational methods for the prediction of protein-protein interactions. While her own approach is based on the analysis of co-evolutionary change in protein pairs, four other computational methods were briefly mentioned. These are the analysis of:

Conservation of gene order;
Gene fusion events;
Correlated expression profiles; and
Phylogenetic profiles.

For two of these methods describe briefly (short answers, keywords):
- the underlying biology that gives rise to the observation;
- the input data that is required;
- the computational process;
- the types of results that can be expected and how they would be interpreted;
- if the result is a yes/no answer: how the process' performance can be evaluated;
- if the process results in a probability for interaction: how a decision-threshold can be determined

2006 Data, Concept and Procedure

STRING at the EBI is a search tool for known and predicted functional protein/protein interactions. STRING quantitatively integrates interaction data for a large number of organisms, and transfers information between these organisms where applicable. Obviously, experimental results are included in the prediction, but there are also a number of purely theoretical approaches:

The following theoretical approaches contribute to the prediction:

Gene neighborhood: in prokaryotes, functionally related genes are often found in operons.
Gene fusion: interacting proteins are sometimes found to be fused in related organisms.
Gene occurrence (Phylogenetic profile): the conservation of a gene across a large number of species can be summarized in a profile that characterises the distribution of this gene accross evolution. Functionally related genes often have similar profiles.
Gene sets in databases: Curated datasets such as MIPS complexes or KEGG pathways contain sets of genes that form known functional units.

For two of these approaches:

define the source data that is being used,
describe the rationale behind applying the biological observation to predict a physical or functional interaction,
outline the computational procedure for the interaction prediction, and
explain how to interpret the results.

@@ Line 317: / Line 317: @@
 &nbsp;<br>
 &nbsp;<br>
+==2006 Data, Concept and Procedure==
+STRING at the EBI is a search tool for known and predicted functional protein/protein interactions. STRING quantitatively integrates interaction data for a large number of organisms, and transfers information between these organisms where applicable. Obviously, experimental results are included in the prediction, but there are also a number of purely theoretical approaches:
+The following theoretical approaches contribute to the prediction:
+* Gene neighborhood: in prokaryotes, functionally related genes are often found in operons.
+* Gene fusion: interacting proteins are sometimes found to be fused in related organisms.
+* Gene occurrence (Phylogenetic profile): the conservation of a gene across a large number of species can be summarized in a profile that characterises the distribution of this gene accross evolution. Functionally related genes often have similar profiles.
+* Gene sets in databases: Curated datasets such as MIPS complexes or KEGG pathways contain sets of genes that form known functional units.
+&nbsp;<br>
+<div style="padding: 5px; background: #DDDDDD;  border:solid 1px #000000;">
+*'''For two of these approaches:
+:#define the source data that is being used,
+:#describe the rationale behind applying the biological observation to predict a physical or functional interaction,
+:#outline the computational procedure for the interaction prediction, and
+:#explain how to interpret the results.
+'''
+</div>
+&nbsp;<br>
+&nbsp;<br>
 <!--

Difference between revisions of "Application Exam Questions"

Revision as of 22:26, 18 October 2007

2003 - Consequences of Homology

2003 - Domains and Homology

2003 - Homology

2003 - Expression Analysis

2003 - Integrated processes

2003 - Integrated processes

2003 - Phylogenetic analysis

2003 - Phylogenetic analysis

2004 - Gene regulation

2004 - Genomes and Pathways

2004 - Protein Structure Domains

2004 - Homology modeling and model analysis

2005 - Genome Analysis

2005 - Model and Algorithm

2006 Data, Concept and Procedure

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools