Difference between revisions of "Glossary"
(18 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | + | <div style="padding: 5px; background: #A6AFD0; border:solid 1px #AAAAAA; font-size:200%;font-weight:bold;"> | |
+ | A Bioinformatics Glossary | ||
+ | </div> | ||
+ | | ||
+ | | ||
− | ====FASTA format | + | ;Glossary requests: |
+ | :If you think a term should be included in this glossary, please e-mail the course coordinator. | ||
+ | <br> | ||
+ | <br> | ||
+ | <br> | ||
+ | |||
+ | ==Accession Number== | ||
+ | ;The unique identifier for a database entry (also ID) | ||
+ | |||
+ | Accession numbers are unique identifiers, i.e. an accession number matches exactly one record. The term "identifier" is a synonym to "accession number". It is useful to be familiar with the syntax of common accession numbers since they can tell us the source database of a data object and often allow us to infer the level of curation and the contents of a record. | ||
+ | |||
+ | In the '''NCBI''' system, three sets of accession numbers are commonly used: accession numbers (such as <tt>U12345</tt> or <tt>AA123456</tt>), RefSeq IDs (such as <tt>NM_12345</tt> or <tt>XP_123456</tt>) and GI numbers (such as <tt>1234567</tt>). GI (or GenInfo Identifiers) is an identifier system that runs in parallel to Genbank, Genpept and RefSeq identifiers. Every NCBI data record has a GI, even though it may also have another identifier. GI numbers form the basis of NCBI's crossreferencing system, i.e. these are the numbers that NCBI uses internally. Accession numbers and RefSeq IDs can be versioned (e.g. <tt>123456.1</tt>, <tt>123456.2</tt> etc.) but GIs are assigned new for every update of a sequence. | ||
+ | |||
+ | RefSeq IDs are prefixed to show the contents of a record: | ||
+ | |||
+ | <tt>NT_123456 </tt>constructed genomic contigs<br> | ||
+ | <tt>NC_123456 </tt>chromosomes<br> | ||
+ | <tt>NM_123456 </tt>mRNAs<br> | ||
+ | <tt>NP_123456 </tt>proteins<br> | ||
+ | ... and the <tt>XM_</tt>, <tt>XP_</tt> etc. prefix refers to predicted or inferred sequence, e.g. from genome annotation. | ||
+ | |||
+ | The '''EBI''' database world is moving towards the UniProt system and older identifiers are being phased out. In addition to EMBL identifiers such as <tt>A1BC23</tt>, we might encounter the prefix <tt>UniRef100_</tt> for an entry from the UniRef project of clusters of similar sequences, or <tt>UPI000000AB12</tt> for the UniParc project on non-redundant unique sequences. Moreover, the highly annotated, carefully curated '''swiss-prot''' records have their own identifier set: GeneName_OrganismTag e.g. <tt>MBP1_YEAST</tt>. | ||
+ | |||
+ | '''PDB''' identifiers are of the form <tt>1ABC</tt>, where the first character is a number and the following three characters are uppercase letters or numbers; sometimes the chain identifier of a protein or nucleic acid is specified as well, as in <tt>1ABCA</tt>, <tt>1ABC:A</tt> or <tt>1ABC-A</tt>. | ||
+ | |||
+ | ;In depth: | ||
+ | :[http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html the GenBank sample record] | ||
+ | |||
+ | ==E-value== | ||
+ | ;Expectation-value of a BLAST search | ||
+ | |||
+ | The E-value reported for each BLAST-hit represents the number of alternate alignments, with the same or better total score, that could be expected to occur in the database purely by chance. Thus, the lower the E-value, the more significant the match. The value depends upon the quality and length of the alignment, as well as the size of the database. | ||
+ | |||
+ | ==FASTA format== | ||
:FASTA is a simple, ASCII based, text-file format for biological sequences. Minimally a FASTA file comprises a header line, initiated with the ">" character, followed by one or more lines containing nucleic acid or protein sequence in one-letter code. This is the most common input format for bioinformatics analysis programs and services. | :FASTA is a simple, ASCII based, text-file format for biological sequences. Minimally a FASTA file comprises a header line, initiated with the ">" character, followed by one or more lines containing nucleic acid or protein sequence in one-letter code. This is the most common input format for bioinformatics analysis programs and services. | ||
Line 14: | Line 51: | ||
GKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDGSASPPPAPKHHHASKVDHHHHHH | GKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDGSASPPPAPKHHHASKVDHHHHHH | ||
− | + | ;In depth: | |
+ | *[http://en.wikipedia.org/wiki/Fasta_format Wikipedia entry] | ||
+ | |||
+ | |||
+ | ==Homology== | ||
+ | ;Homologue | ||
+ | Two sequences are ''homologues'' if they are derived from a common ancestor. | ||
+ | |||
+ | ;Orthologue | ||
+ | Two homologous sequences are ''orthologues'' if their divergence is the result of a speciation event. | ||
+ | |||
+ | ;Paralogue | ||
+ | Two homologous sequences are 'paralogues' if their divergence is the result of a gene duplication event. | ||
+ | <br> | ||
+ | <br> | ||
+ | |||
+ | These concepts are the keystones of bioinformatics analysis. Since evolution usually proceeds gradually, homologous proteins usually have similar functions. Without exception they also have similar 3D-structure. We expect orthologues to have as nearly as possible '''identical roles''' in differnet species; we expect that paralogues have acquired distinctly '''different roles''', else they would not have been propagated as duplicated, indpendently selected genes. | ||
+ | It is a common mistake to speak of sequences as e.g. ''"35% homologous"''. As defined above, homology is a quality, not a quantity. The quantity we should speak of is '''similarity''' and we usually measure sequence similarity to infer whether two genes may be homologues. Strictly speaking, this cannot be proven, since in general we have no way of confirming the evolutionary path through every single ancestral generation. | ||
− | ====multi FASTA | + | In practice we usually use the ''reciprocal best match'' condition to create a mapping of orthologues between genomes. |
+ | |||
+ | |||
+ | ;In depth: | ||
+ | *[http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Fitch_HomologyDefinitions.pdf '''Walter Fitch''' on "homology"] | ||
+ | |||
+ | ==HSP== | ||
+ | ;High Scoring Pair | ||
+ | The fundamental ''unit'' of BLAST output. An HSP consists of an ungapped, local alignment result. HSPs are extended by the algorithm to so-called BLAST hits. | ||
+ | |||
+ | ;Example | ||
+ | |||
+ | Query 42 KVQGGFGKYQGTWVPLNIAKQLAEK | ||
+ | +++GG+ K QGTW+P+ I++ L + | ||
+ | Sbjct 347 RIRGGYIKIQGTWLPMEISRLLCLR | ||
+ | |||
+ | ==multi FASTA file== | ||
:A sequence file that contains more than one [[#FASTA_format|FASTA formatted]] sequence. The sequences are simply concatenated. This is a common input format for multiple sequence alignment or motif-finding programs. | :A sequence file that contains more than one [[#FASTA_format|FASTA formatted]] sequence. The sequences are simply concatenated. This is a common input format for multiple sequence alignment or motif-finding programs. | ||
Line 29: | Line 99: | ||
KQTEVDCELLKRCCETLTDENRRLHRELQELRALKLATAAAAPHH | KQTEVDCELLKRCCETLTDENRRLHRELQELRALKLATAAAAPHH | ||
− | ([http://en.wikipedia.org/wiki/ | + | ;In depth: |
+ | *[http://en.wikipedia.org/wiki/Fasta_format Wikipedia entry] | ||
+ | |||
+ | ==Phylogenetic Tree== | ||
+ | :A graph that represents evolutionary relationships. Current species or genes are represented as the endpoints (leaves, OTUs) of the graph, branchpoints (internal nodes) represent ancestral species or genes and the lines between the nodes represent inheritance from ancestral genes/species to their descendants. | ||
+ | |||
+ | ;Example | ||
+ | |||
+ | ;In depth: | ||
+ | *[http://en.wikipedia.org/wiki/Phylogenetic_tree Wikipedia entry] | ||
+ | |||
+ | ==PSSM== | ||
+ | :('''Position Specific Scoring Matrix''', synonm ''Weight Matrix'') A matrix that scores all possible characters (usually nucleic- or amino acids) in each position of a pattern. A PSSM will typically be used in probabilistic (as opposed to ''deterministic'') pattern matching. Each substring of the sequence is evaluated against the PSSM and an aggregate score is computed. This aggregate score then characterizes the probability that this substring is or is not an example of the pattern represented by the PSSM. Often the terms ''profile'' and ''PSSM'' are used interchangeably; a possible distinction would be that a profile is a special case of a PSSM in which the scores are derived from a sequence alignment. | ||
+ | |||
+ | ; Example - excerpt from the TRANSFAC matrix entry for Gal4 | ||
+ | |||
+ | ... | ||
+ | AC M00049 | ||
+ | ID F$GAL4_01 | ||
+ | DE GAL4 | ||
+ | BF T00302 GAL4; Species: yeast, ''Saccharomyces cerevisiae''. | ||
+ | XX | ||
+ | PO A C G T | ||
+ | 01 1 5 3 2 N | ||
+ | 02 5 2 1 3 N | ||
+ | 03 3 2 1 5 N | ||
+ | 04 1 10 0 0 C | ||
+ | 05 0 0 10 1 G | ||
+ | 06 0 1 10 0 G | ||
+ | 07 4 3 3 1 N | ||
+ | 08 1 3 4 3 N | ||
+ | 09 2 4 4 1 N | ||
+ | 10 7 0 2 2 A | ||
+ | 11 1 8 2 0 C | ||
+ | 12 4 1 0 6 W | ||
+ | 13 1 3 5 2 N | ||
+ | 14 0 2 1 8 T | ||
+ | 15 1 6 2 2 C | ||
+ | 16 1 5 4 1 S | ||
+ | 17 2 1 1 7 T | ||
+ | 18 0 10 1 0 C | ||
+ | 19 0 11 0 0 C | ||
+ | 20 0 0 11 0 G | ||
+ | 21 8 0 0 3 A | ||
+ | 22 7 0 4 0 R | ||
+ | 23 2 6 3 0 S | ||
+ | XX | ||
+ | ... | ||
+ | |||
+ | ;In depth: | ||
+ | *[http://bioinformatics.oxfordjournals.org/cgi/content/abstract/16/1/16?ijkey=6086844f842fa18cbd55821a0dd25e365b7b6a1a&keytype2=tf_ipsecsha Review by Gary D. Stormo] | ||
+ | |||
+ | |||
+ | ==UPGMA== | ||
+ | :('''Unweighted Pair Group Method With Arithmetic Mean''') The simplest tree-building method for distance data. Trees are contsructed in a bottom-up fashion, where at first each OTU is in its own cluster. At each step, the two clusters nearest to each other are combined into a higher-level cluster which replaces the originals. The distance between any two clusters A and B is taken to be the average of all distances between pairs of objects a in A and b in B. The result is a rooted tree. However, the implicit assumption that the evolutionary rates are constant is frequently not justified, thus UPGMA is not a very highly regarded method in phylogenetic analysis. | ||
+ | |||
+ | ;In depth: | ||
+ | *[http://www.icp.ucl.ac.be/~opperd/private/upgma.html Detailed example and illustration, Fred Opperdoes at UCL, Belgium] |
Latest revision as of 11:57, 24 September 2008
A Bioinformatics Glossary
- Glossary requests
- If you think a term should be included in this glossary, please e-mail the course coordinator.
Contents
Accession Number
- The unique identifier for a database entry (also ID)
Accession numbers are unique identifiers, i.e. an accession number matches exactly one record. The term "identifier" is a synonym to "accession number". It is useful to be familiar with the syntax of common accession numbers since they can tell us the source database of a data object and often allow us to infer the level of curation and the contents of a record.
In the NCBI system, three sets of accession numbers are commonly used: accession numbers (such as U12345 or AA123456), RefSeq IDs (such as NM_12345 or XP_123456) and GI numbers (such as 1234567). GI (or GenInfo Identifiers) is an identifier system that runs in parallel to Genbank, Genpept and RefSeq identifiers. Every NCBI data record has a GI, even though it may also have another identifier. GI numbers form the basis of NCBI's crossreferencing system, i.e. these are the numbers that NCBI uses internally. Accession numbers and RefSeq IDs can be versioned (e.g. 123456.1, 123456.2 etc.) but GIs are assigned new for every update of a sequence.
RefSeq IDs are prefixed to show the contents of a record:
NT_123456 constructed genomic contigs
NC_123456 chromosomes
NM_123456 mRNAs
NP_123456 proteins
... and the XM_, XP_ etc. prefix refers to predicted or inferred sequence, e.g. from genome annotation.
The EBI database world is moving towards the UniProt system and older identifiers are being phased out. In addition to EMBL identifiers such as A1BC23, we might encounter the prefix UniRef100_ for an entry from the UniRef project of clusters of similar sequences, or UPI000000AB12 for the UniParc project on non-redundant unique sequences. Moreover, the highly annotated, carefully curated swiss-prot records have their own identifier set: GeneName_OrganismTag e.g. MBP1_YEAST.
PDB identifiers are of the form 1ABC, where the first character is a number and the following three characters are uppercase letters or numbers; sometimes the chain identifier of a protein or nucleic acid is specified as well, as in 1ABCA, 1ABC:A or 1ABC-A.
- In depth
- the GenBank sample record
E-value
- Expectation-value of a BLAST search
The E-value reported for each BLAST-hit represents the number of alternate alignments, with the same or better total score, that could be expected to occur in the database purely by chance. Thus, the lower the E-value, the more significant the match. The value depends upon the quality and length of the alignment, as well as the size of the database.
FASTA format
- FASTA is a simple, ASCII based, text-file format for biological sequences. Minimally a FASTA file comprises a header line, initiated with the ">" character, followed by one or more lines containing nucleic acid or protein sequence in one-letter code. This is the most common input format for bioinformatics analysis programs and services.
- Example
>gi|3402004|pdb|1MB1| Mbp1 From Saccharomyces Cerevisiae MSNQIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGF GKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDGSASPPPAPKHHHASKVDHHHHHH
- In depth
Homology
- Homologue
Two sequences are homologues if they are derived from a common ancestor.
- Orthologue
Two homologous sequences are orthologues if their divergence is the result of a speciation event.
- Paralogue
Two homologous sequences are 'paralogues' if their divergence is the result of a gene duplication event.
These concepts are the keystones of bioinformatics analysis. Since evolution usually proceeds gradually, homologous proteins usually have similar functions. Without exception they also have similar 3D-structure. We expect orthologues to have as nearly as possible identical roles in differnet species; we expect that paralogues have acquired distinctly different roles, else they would not have been propagated as duplicated, indpendently selected genes.
It is a common mistake to speak of sequences as e.g. "35% homologous". As defined above, homology is a quality, not a quantity. The quantity we should speak of is similarity and we usually measure sequence similarity to infer whether two genes may be homologues. Strictly speaking, this cannot be proven, since in general we have no way of confirming the evolutionary path through every single ancestral generation.
In practice we usually use the reciprocal best match condition to create a mapping of orthologues between genomes.
- In depth
HSP
- High Scoring Pair
The fundamental unit of BLAST output. An HSP consists of an ungapped, local alignment result. HSPs are extended by the algorithm to so-called BLAST hits.
- Example
Query 42 KVQGGFGKYQGTWVPLNIAKQLAEK +++GG+ K QGTW+P+ I++ L + Sbjct 347 RIRGGYIKIQGTWLPMEISRLLCLR
multi FASTA file
- A sequence file that contains more than one FASTA formatted sequence. The sequences are simply concatenated. This is a common input format for multiple sequence alignment or motif-finding programs.
- Example
>Homeobox associated Leucine Zipper from gi|3868845 (134..178) KQTEVDCELLRKCCASLTEENRRLQMEVDQLRALSTTQLHFSDFV >Homeobox associated Leucine Zipper from gi 21264431 (168..212) KQTEVDCEFLKKCCETLADENIRLQKEIQELKTLKLTQPFYMHMP >Homeobox associated Leucine Zipper from gi|6634483 (212.. 256) KQTEVDCELLKRCCETLTDENRRLHRELQELRALKLATAAAAPHH
- In depth
Phylogenetic Tree
- A graph that represents evolutionary relationships. Current species or genes are represented as the endpoints (leaves, OTUs) of the graph, branchpoints (internal nodes) represent ancestral species or genes and the lines between the nodes represent inheritance from ancestral genes/species to their descendants.
- Example
- In depth
PSSM
- (Position Specific Scoring Matrix, synonm Weight Matrix) A matrix that scores all possible characters (usually nucleic- or amino acids) in each position of a pattern. A PSSM will typically be used in probabilistic (as opposed to deterministic) pattern matching. Each substring of the sequence is evaluated against the PSSM and an aggregate score is computed. This aggregate score then characterizes the probability that this substring is or is not an example of the pattern represented by the PSSM. Often the terms profile and PSSM are used interchangeably; a possible distinction would be that a profile is a special case of a PSSM in which the scores are derived from a sequence alignment.
- Example - excerpt from the TRANSFAC matrix entry for Gal4
... AC M00049 ID F$GAL4_01 DE GAL4 BF T00302 GAL4; Species: yeast, Saccharomyces cerevisiae. XX PO A C G T 01 1 5 3 2 N 02 5 2 1 3 N 03 3 2 1 5 N 04 1 10 0 0 C 05 0 0 10 1 G 06 0 1 10 0 G 07 4 3 3 1 N 08 1 3 4 3 N 09 2 4 4 1 N 10 7 0 2 2 A 11 1 8 2 0 C 12 4 1 0 6 W 13 1 3 5 2 N 14 0 2 1 8 T 15 1 6 2 2 C 16 1 5 4 1 S 17 2 1 1 7 T 18 0 10 1 0 C 19 0 11 0 0 C 20 0 0 11 0 G 21 8 0 0 3 A 22 7 0 4 0 R 23 2 6 3 0 S XX ...
- In depth
UPGMA
- (Unweighted Pair Group Method With Arithmetic Mean) The simplest tree-building method for distance data. Trees are contsructed in a bottom-up fashion, where at first each OTU is in its own cluster. At each step, the two clusters nearest to each other are combined into a higher-level cluster which replaces the originals. The distance between any two clusters A and B is taken to be the average of all distances between pairs of objects a in A and b in B. The result is a rooted tree. However, the implicit assumption that the evolutionary rates are constant is frequently not justified, thus UPGMA is not a very highly regarded method in phylogenetic analysis.
- In depth