Database Exam Questions

From "A B C"
Jump to navigation Jump to search

Databases Exam Questions

   


   

One aspect of Bioinformatics concerns itself with the storage, organisation, and retreival of biological information. The questions in this section consider the contents and use of some of the key abstractions (sequences, structures, graphs ...) that we deal with, and the databases we store them in.

   

2003

In the excerpt from the PDBsum database shown here, please comment briefly on the following points. Questions correspond to the numbers shown on the image. The region within the circle is an enlargement of the original, for better legibility.

 

 

 

 

 

 

  • (5.1) What is the relationship between PDBsum and the database linked from this button ("PDB")?
  • (5.2) What do these terms mean and what use can you make of this information when you analyse the structure ("Resolution", "R-Factor", "R-free") ?
  • (5.3) What is the purpose of the database linked from this button and what use can you make from its contents when you analyse the structure ("CATH")?
  • (5.4) What is this information and what use can you make of it ("Residue interactions: * with DNA + with ligand")?
  • (5.5) What is this sequence and why is it here ?
PDBsum page for Glutamyl-tRNA Synthetase tRNA complex 1EAD


2004 - PDB Format

Despite its many shortcomings and inconsistencies, the PDB format for coordinate datasets is still the most widely accepted format, chiefly due to the large number of legacy programs that use it, but also because it is human readable. The following is an excerpt from the PDB file of pea defensin (1JKZ.PDB).

[...]
ATOM    404  N   ALA A  28       1.084   7.614   2.493  1.00  0.00           N  
ATOM    405  CA  ALA A  28       0.164   7.660   3.616  1.00  0.00           C  
ATOM    406  C   ALA A  28       0.842   7.090   4.856  1.00  0.00           C  
ATOM    407  O   ALA A  28       0.731   5.902   5.139  1.00  0.00           O  
ATOM    408  CB  ALA A  28      -1.123   6.911   3.287  1.00  0.00           C  
ATOM    409  H   ALA A  28       1.535   6.768   2.288  1.00  0.00           H  
ATOM    410  HA  ALA A  28      -0.085   8.696   3.802  1.00  0.00           H  
ATOM    411 1HB  ALA A  28      -1.278   6.918   2.218  1.00  0.00           H  
ATOM    412 2HB  ALA A  28      -1.957   7.396   3.773  1.00  0.00           H  
ATOM    413 3HB  ALA A  28      -1.047   5.891   3.634  1.00  0.00           H  
[...]


  • Which atom numbers correspond to the backbone atoms and which atomnumbers correspond to the sidechain of this aminoacid ?
  • Describe the information in the following columns (indicated by the values in the first record): " N  ", "A", "1.084", "1.00", "0.00".
  • Are any of these columns optional, and if yes, would their absence shift the positions of the other columns?
  • Briefly discuss the relationship between SEQRES records in a PDB file, the genetic sequence of a protein, and the sequence that can be derived from the coordinate records.

Here is a sequence file for this protein.

>gi|20139322|sp|P81929|PSD1_PEA Defense-related peptide 1
KTCEHLADTYRGVCFTNASCDDHCKNKAHLISGTCHNWKCFCTQNC

  • What is the name of this file-format?
  • Find the amino acid for which the coordinates were given above, in this sequence. (Write its one letter code into your exam booklet together with the preceeding and the following amino acid and underline it e.g. ABC ).

   

In the coordinate file of the immunoglobulin domain 2IMM.pdb you find the following record.

HETATM  877  O   HOH     1      -4.169  60.050  40.145  1.00  3.00           O 
  • What does this record describe ?
  • When you display the structure of 2IMM.pdb with RasMol, the protein is displayed as a wireframe model but you see nothing that corresponds to the above record. What do you need to do ?

(Indeed, since the RasMol tutorial was a task of the first assignment, a question like this may turn up every now and then.)

2003 - Entrez

This is a screenshot of the result of searching the NCBI Entrez database with the search string "sh3".

Briefly discuss each database that the following terms link to, what relationship the results have to the search term and what use one can make of the links.

  • PubMed
  • Protein
  • CDD
  • OMIM

 
 

(A similar question was given in the 2004 practice exam and I like the following format much better:)

Discuss briefly which of the links you would follow to solve the following problems and summarize what the respective database contains and how you would use it. (Be reasonably complete, more than one link may be needed or helpful. Assume you know nothing about the problem but what is stated in the question.)

  • Retrieve 1 kb of upstream sequence for each yeast protein that contains an SH3 domain.
  • Check whether a mutation in a residue of yeast protein is known to cause disease in its closest human homologue.

 
 

2005

Mbp1 contains ankyrin repeats, these are common protein-protein interaction modules.  

This image is a screenshot of a page retrieved while pursuing an Entrez query for the string "ankyrin" into NCBI's CDD.

 

  • Where does the information that is presented here come from?
  • Explain the semantics (meaning) of the identifier that is indicated with the label 3.2 1IKN_D.
  • Describe one way to identify the source organism of the protein in the third row of the alignment.
  • Explain what the numbers in square brackets mean (".[n].") and why they are absent in some rows.
  • Describe how the nucleotide sequence for the protein in the third row of the alignment can be retrieved.

 
 

2006 Yeast Genome Browser

The summary paragraph for Mbp1 on the record at the Saccharomyces Genome Database (SGD) states:

Mbp1p is a DNA-binding protein that forms MBF complex (Mlu1 cell cycle box [MCB] Binding Factor) with Swi6p. MBF is a sequence-specific transcription factor that regulates gene expression during the G1/S transition of the cell cycle. Several genes activated or repressed by MBF have been identified, many of which are involved in DNA synthesis and DNA repair (for example, CDC21, CDC8, and CDC9, and also G1 cyclins).

The record for CDC21 links to the following genome-browser page:

 

View of a part of the saccharomyces cerevisiae genome with a number of annotation tracks, created online, on-the-fly from the genome database with GBrowse.

 

  • Which genomic region is being displayed?
  • What is the signifcance of the arrows? Why do they go in different directions?
  • What makes YOR073W-A particularly "dubious"?
  • Where is the information that Cdc21 is regulated by Mbp1 shown?
  • The label "Harbison et al (2004)" is a so called "Track"; such Tracks can be switched on or off to customize the information that is being presented. What information does this particular track contain?

This question related to biological facts that were known in principle from course assignments, but this particular view and the information that is contained in such genome browser pages was not spelled out explicitly. I expect that this would have been new to most students. One of the course's objectives is to teach how to use novel services and results: analyze what you see, apply your background knowledge, interpret the meaning of the presentation.