Difference between revisions of "Glossary"

From "A B C"
Jump to navigation Jump to search
Line 7: Line 7:
 
The E-value reported for each BLAST-hit represents the number of alternate alignments, with the same or better total score, that could be expected to occur in the database purely by chance. Thus, the lower the E-value, the more significant the match. The value depends upon the quality and length of the alignment, as well as the size of the database.
 
The E-value reported for each BLAST-hit represents the number of alternate alignments, with the same or better total score, that could be expected to occur in the database purely by chance. Thus, the lower the E-value, the more significant the match. The value depends upon the quality and length of the alignment, as well as the size of the database.
  
====FASTA format====
+
==FASTA format==
 
:FASTA is a simple, ASCII based, text-file format for biological sequences. Minimally a FASTA file comprises a header line, initiated with the ">" character, followed by one or more lines containing nucleic acid or protein sequence in one-letter code. This is the most common input format for bioinformatics analysis programs and services.
 
:FASTA is a simple, ASCII based, text-file format for biological sequences. Minimally a FASTA file comprises a header line, initiated with the ">" character, followed by one or more lines containing nucleic acid or protein sequence in one-letter code. This is the most common input format for bioinformatics analysis programs and services.
  
Line 20: Line 20:
  
  
====HSP====
+
==HSP==
 
;High Scoring Pair
 
;High Scoring Pair
 
The fundamental ''unit'' of BLAST output. An HSP consists of an ungapped, local alignment result. HSPs are extended by the algorithm to so-called BLAST hits.
 
The fundamental ''unit'' of BLAST output. An HSP consists of an ungapped, local alignment result. HSPs are extended by the algorithm to so-called BLAST hits.
Line 30: Line 30:
 
  Sbjct  347  RIRGGYIKIQGTWLPMEISRLLCLR
 
  Sbjct  347  RIRGGYIKIQGTWLPMEISRLLCLR
  
====multi FASTA file====
+
==multi FASTA file==
 
:A sequence file that contains more than one [[#FASTA_format|FASTA formatted]] sequence. The sequences are simply concatenated. This is a common input format for multiple sequence alignment or motif-finding programs.
 
:A sequence file that contains more than one [[#FASTA_format|FASTA formatted]] sequence. The sequences are simply concatenated. This is a common input format for multiple sequence alignment or motif-finding programs.
  
Line 45: Line 45:
 
*[http://en.wikipedia.org/wiki/Fasta_format Wikipedia entry]
 
*[http://en.wikipedia.org/wiki/Fasta_format Wikipedia entry]
  
====PSSM====
+
==PSSM==
 
:('''Position Specific Scoring Matrix''', synonm ''Weight Matrix'') A matrix that scores all possible characters (usually nucleic- or amino acids) in each position of a pattern. A PSSM will typically be used in probabilistic (as opposed to ''deterministic'') pattern matching. Each substring of the sequence is evaluated against the PSSM and an aggregate score is computed. This aggregate score then characterizes the probability that this substring is or is not an example of the pattern represented by the PSSM. Often the terms ''profile'' and ''PSSM'' are used interchangeably; a possible distinction would be that a profile is  a special case of a PSSM where the scores are derived from a sequence alignment.  
 
:('''Position Specific Scoring Matrix''', synonm ''Weight Matrix'') A matrix that scores all possible characters (usually nucleic- or amino acids) in each position of a pattern. A PSSM will typically be used in probabilistic (as opposed to ''deterministic'') pattern matching. Each substring of the sequence is evaluated against the PSSM and an aggregate score is computed. This aggregate score then characterizes the probability that this substring is or is not an example of the pattern represented by the PSSM. Often the terms ''profile'' and ''PSSM'' are used interchangeably; a possible distinction would be that a profile is  a special case of a PSSM where the scores are derived from a sequence alignment.  
  
Line 82: Line 82:
 
  XX
 
  XX
 
  ...
 
  ...
 
  
 
;In depth:
 
;In depth:
 
*[http://bioinformatics.oxfordjournals.org/cgi/content/abstract/16/1/16?ijkey=6086844f842fa18cbd55821a0dd25e365b7b6a1a&keytype2=tf_ipsecsha Review by Gary D. Stormo]
 
*[http://bioinformatics.oxfordjournals.org/cgi/content/abstract/16/1/16?ijkey=6086844f842fa18cbd55821a0dd25e365b7b6a1a&keytype2=tf_ipsecsha Review by Gary D. Stormo]

Revision as of 23:48, 22 November 2006


E-value

Expectation-value of a BLAST search

The E-value reported for each BLAST-hit represents the number of alternate alignments, with the same or better total score, that could be expected to occur in the database purely by chance. Thus, the lower the E-value, the more significant the match. The value depends upon the quality and length of the alignment, as well as the size of the database.

FASTA format

FASTA is a simple, ASCII based, text-file format for biological sequences. Minimally a FASTA file comprises a header line, initiated with the ">" character, followed by one or more lines containing nucleic acid or protein sequence in one-letter code. This is the most common input format for bioinformatics analysis programs and services.
Example
>gi|3402004|pdb|1MB1|  Mbp1 From Saccharomyces Cerevisiae
MSNQIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGF
GKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDGSASPPPAPKHHHASKVDHHHHHH
In depth


HSP

High Scoring Pair

The fundamental unit of BLAST output. An HSP consists of an ungapped, local alignment result. HSPs are extended by the algorithm to so-called BLAST hits.

Example
Query  42   KVQGGFGKYQGTWVPLNIAKQLAEK
            +++GG+ K QGTW+P+ I++ L  + 
Sbjct  347  RIRGGYIKIQGTWLPMEISRLLCLR

multi FASTA file

A sequence file that contains more than one FASTA formatted sequence. The sequences are simply concatenated. This is a common input format for multiple sequence alignment or motif-finding programs.
Example
>Homeobox associated Leucine Zipper from gi|3868845  (134..178)
KQTEVDCELLRKCCASLTEENRRLQMEVDQLRALSTTQLHFSDFV
>Homeobox associated Leucine Zipper from gi 21264431 (168..212)
KQTEVDCEFLKKCCETLADENIRLQKEIQELKTLKLTQPFYMHMP
>Homeobox associated Leucine Zipper from gi|6634483  (212.. 256)
KQTEVDCELLKRCCETLTDENRRLHRELQELRALKLATAAAAPHH
In depth

PSSM

(Position Specific Scoring Matrix, synonm Weight Matrix) A matrix that scores all possible characters (usually nucleic- or amino acids) in each position of a pattern. A PSSM will typically be used in probabilistic (as opposed to deterministic) pattern matching. Each substring of the sequence is evaluated against the PSSM and an aggregate score is computed. This aggregate score then characterizes the probability that this substring is or is not an example of the pattern represented by the PSSM. Often the terms profile and PSSM are used interchangeably; a possible distinction would be that a profile is a special case of a PSSM where the scores are derived from a sequence alignment.
Example - excerpt from the TRANSFAC matrix entry for Gal4
...
AC   M00049
ID   F$GAL4_01 
DE   GAL4
BF   T00302 GAL4; Species: yeast, Saccharomyces cerevisiae.
XX
PO      A      C      G      T 
01      1      5      3      2      N
02      5      2      1      3      N
03      3      2      1      5      N
04      1     10      0      0      C
05      0      0     10      1      G
06      0      1     10      0      G
07      4      3      3      1      N
08      1      3      4      3      N
09      2      4      4      1      N
10      7      0      2      2      A
11      1      8      2      0      C
12      4      1      0      6      W
13      1      3      5      2      N
14      0      2      1      8      T
15      1      6      2      2      C
16      1      5      4      1      S
17      2      1      1      7      T
18      0     10      1      0      C
19      0     11      0      0      C
20      0      0     11      0      G
21      8      0      0      3      A
22      7      0      4      0      R
23      2      6      3      0      S
XX
...
In depth