Glossary

E-value

Expectation-value of a BLAST search

The E-value reported for each BLAST-hit represents the number of alternate alignments, with the same or better total score, that could be expected to occur in the database purely by chance. Thus, the lower the E-value, the more significant the match. The value depends upon the quality and length of the alignment, as well as the size of the database.

FASTA format

FASTA is a simple, ASCII based, text-file format for biological sequences. Minimally a FASTA file comprises a header line, initiated with the ">" character, followed by one or more lines containing nucleic acid or protein sequence in one-letter code. This is the most common input format for bioinformatics analysis programs and services.

Example

>gi|3402004|pdb|1MB1|  Mbp1 From Saccharomyces Cerevisiae
MSNQIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGF
GKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDGSASPPPAPKHHHASKVDHHHHHH

In depth

Wikipedia entry

Homology

Homologue

Two sequences are homologues if they are derived from a common ancestor.

Orthologue

Two homologous sequences are orthologues if their divergence is the result of a speciation event.

Paralogue

Two homologous sequences are 'paralogues' if their divergence is the result of a gene duplication event.

These concepts are the keystones of bioinformatics analysis. Since evolution usually proceeds gradually, homologous proteins usually have similar functions. Without exception they also have similar 3D-structure. We expect orthologues to have as nearly as possible identical roles in differnet species; we expect that paralogues have acquired distinctly different roles, else they would not have been propagated as duplicated, indpendently selected genes.

It is a common mistake to speak of sequences as e.g. "35% homologous". As defined above, homology is a quality, not a quantity. The quantity we should speak of is similarity and we usually measure sequence similarity to infer whether two genes may be homologues. Strictly speaking, this cannot be proven, since in general we have no way of confirming the evolutionary path through every single ancestral generation.

In practice we usually use the reciprocal best match condition to create a mapping of orthologues between genomes.

HSP

High Scoring Pair

The fundamental unit of BLAST output. An HSP consists of an ungapped, local alignment result. HSPs are extended by the algorithm to so-called BLAST hits.

Example

Query  42   KVQGGFGKYQGTWVPLNIAKQLAEK
            +++GG+ K QGTW+P+ I++ L  + 
Sbjct  347  RIRGGYIKIQGTWLPMEISRLLCLR

multi FASTA file

A sequence file that contains more than one FASTA formatted sequence. The sequences are simply concatenated. This is a common input format for multiple sequence alignment or motif-finding programs.

Example

>Homeobox associated Leucine Zipper from gi|3868845  (134..178)
KQTEVDCELLRKCCASLTEENRRLQMEVDQLRALSTTQLHFSDFV
>Homeobox associated Leucine Zipper from gi 21264431 (168..212)
KQTEVDCEFLKKCCETLADENIRLQKEIQELKTLKLTQPFYMHMP
>Homeobox associated Leucine Zipper from gi|6634483  (212.. 256)
KQTEVDCELLKRCCETLTDENRRLHRELQELRALKLATAAAAPHH

In depth

Wikipedia entry

PSSM

(Position Specific Scoring Matrix, synonm Weight Matrix) A matrix that scores all possible characters (usually nucleic- or amino acids) in each position of a pattern. A PSSM will typically be used in probabilistic (as opposed to deterministic) pattern matching. Each substring of the sequence is evaluated against the PSSM and an aggregate score is computed. This aggregate score then characterizes the probability that this substring is or is not an example of the pattern represented by the PSSM. Often the terms profile and PSSM are used interchangeably; a possible distinction would be that a profile is a special case of a PSSM in which the scores are derived from a sequence alignment.

Example - excerpt from the TRANSFAC matrix entry for Gal4

...
AC   M00049
ID   F$GAL4_01 
DE   GAL4
BF   T00302 GAL4; Species: yeast, Saccharomyces cerevisiae.
XX
PO      A      C      G      T 
01      1      5      3      2      N
02      5      2      1      3      N
03      3      2      1      5      N
04      1     10      0      0      C
05      0      0     10      1      G
06      0      1     10      0      G
07      4      3      3      1      N
08      1      3      4      3      N
09      2      4      4      1      N
10      7      0      2      2      A
11      1      8      2      0      C
12      4      1      0      6      W
13      1      3      5      2      N
14      0      2      1      8      T
15      1      6      2      2      C
16      1      5      4      1      S
17      2      1      1      7      T
18      0     10      1      0      C
19      0     11      0      0      C
20      0      0     11      0      G
21      8      0      0      3      A
22      7      0      4      0      R
23      2      6      3      0      S
XX
...

In depth

Review by Gary D. Stormo

UPGMA

(Unweighted Pair Group Method With Arithmetic Mean) The simplest tree-building method for distance data. Trees are contsructed in a bottom-up fashion, where at first each OTU is in its own cluster. At each step, the two clusters nearest to each other are combined into a higher-level cluster which replaces the originals. The distance between any two clusters A and B is taken to be the average of all distances between pairs of objects a in A and b in B. The result is a rooted tree. However, the implicit assumption that the evolutionary rates are constant is frequently not justified, thus UPGMA is not a very highly regarded method in phylogenetic analysis.

In depth

Detailed example and illustration, Fred Opperdoes at UCL, Belgium

Glossary

E-value

FASTA format

Homology

HSP

multi FASTA file

PSSM

UPGMA

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools