BIO bootstrapping with PHYLIP

From "A B C"
Jump to navigation Jump to search

Bootstrapping PHYLIP trees

A brief overview how to produce bootstrapping results for PHYLIP trees.



 

Principle

  1. Create multiple boostrapped copies (e.g. 100) of your input data using seqboot.
  2. Run your tree estimation program of choice using the M input option (analyze multiple trees).
  3. Use the program consense to calculate your consensus tree.


 

Input data

Create a PHYLIP input file with the usual infile filename. Something like this:

 7 77
KilA_ESCCO   ---------R AKDGYINATS MCRTAGKLLS DYTRLLSRDM GIPISEIQSF
Mbp1_SACCE   IHSTGSIMKR KKDDWVNATH ILKAANFAKA KRTRILEKEV LKE--THEKV
Mbp1_NEUCR   -VNNVAVMRR RHDDWVNATH ILKAAGFDKP ARTRILEREV QKD--THEKI
Mbp1_CANAL   VTSEGPIMRR KKDSWINATH ILKIAKFPKA KRTRILEKDV QTG--IHEKV
Mbp1_USTMA   IINNVAVMRR RSDDWLNATQ ILKVVGLDKP QRTRVLEREI QKG--IHEKV
Mbp1_ASPNI   -----SVMRR RSDDWINATH ILKVAGFDKP ARTRILEREV QKG--VHEKV
Mbp1_SCHPO   -IKGVSVMRR RRDSWLNATQ ILKVADFDKP QRTRVLERQV QIG--AHEKV

             KGGRPENQGT WVHPDIAINL AQ-----
             QGGFGKYQGT WVPLNIAKQL AEKFSVY
             QGGYGRYQGT WIPLEQAEAL ARRNNIY
             QGGYGKYQGT YVPLDLGAAI ARNFGVY
             QGGYGKYQGT WIPLDVAIEL AERYNI-
             QGGYGKYQGT WIPLQEGRQL AERNNI-
             QGGYGKYQGT WVPFQRGVDL ATKYKV-


 

seqboot

  1. Read the documentation for the seqboot program.
  2. Run seqboot on your infile.
  3. Set your parameters. I have used the defaults for this example. The random seed should be of the form 4n+1.
  4. The usual outfile is created. Here is the first bootstrap replicate from the run.
    7    77
KilA_ESCCO ---------- -RKKGGGYIA TTMMCCRRRL SIISSEIQQQ GGRRRNQQQQ GTWVPIIIAI
Mbp1_SACCE HHSSTGSIMK KRKKDDDWVA TTIILLKRRL E----THEEE GGFFFYQQQQ GTWVLIIIAK
Mbp1_NEUCR VVNNNVAVMR RRHHDDDWVA TTIILLKRRL E----THEEE GGYYYYQQQQ GTWILQQQAE
Mbp1_CANAL TTSSEGPIMR RRKKSSSWIA TTIILLKRRL E----IHEEE GGYYYYQQQQ GTYVLLLLGA
Mbp1_USTMA IINNNVAVMR RRSSDDDWLA TTIILLKRRL E----IHEEE GGYYYYQQQQ GTWILVVVAI
Mbp1_ASPNI ------SVMR RRSSDDDWIA TTIILLKRRL E----VHEEE GGYYYYQQQQ GTWILEEEGR
Mbp1_SCHPO IIKKGVSVMR RRRRSSSWLA TTIILLKRRL E----AHEEE GGYYYYQQQQ GTWVFRRRGV

           INNLLAAAQQ Q------
           KQQLLAAAEE EKKSSVY
           EAALLAAARR RRRNNIY
           AAAIIAAARR RNNGGVY
           IEELLAAAEE ERRNNI-
           RQQLLAAAEE ERRNNI-
           VDDLLAAATT TKKKKV-

Note how approximately 1/3 of the columns are replicates.


 

proml

The output of seqboot works for most of the tree estimation programs. Be aware that running time will increase by a factor of 100 for 100 bootstrap replicates.

  1. Read the documentation for the proml program.
  2. Rename the previous outfile as the new infile.
  3. Run proml on your infile.
  4. Set your parameters. I have used the defaults for this example, except for choosing the option S (not speedy and rough), the M option for multiple datasets and as prompted D for data (not weights), the number of replicates (100), and a random seed, and "jumbling" only once. (While this is running – 5 minutes or so for my example – you can read about common input options such as what "jumble means here.)
  5. The usual outfile and outtree is created. Have a look. Here are the first two trees from my outfile:
Data set # 1:
           +-Mbp1_USTMA
  +--------2  
  |        |  +-Mbp1_ASPNI
  |        +--3  
  |           |      +-Mbp1_NEUCR
  |           +------1  
  |                  |      +--------Mbp1_SCHPO
  |                  +------5  
  |                         +----------Mbp1_CANAL
  |  
  4---Mbp1_SACCE
  |  
  +-----------------------KilA_ESCCO
Ln Likelihood =  -818.27365

Data set # 2:
                  +----Mbp1_USTMA
               +--3  
               |  |   +------Mbp1_NEUCR
      +--------1  +---4  
      |        |      +Mbp1_ASPNI
  +---5        |  
  |   |        +-----Mbp1_SCHPO
  |   |  
  |   +---Mbp1_SACCE
  |  
  2-------Mbp1_CANAL
  |  
  +-----------------------------------------------KilA_ESCCO
Ln Likelihood =  -825.36962


 

consense

You can use consense to calculate a consensus tree.

  1. Read the documentation for the consense program.
  2. Rename the previous outtree as the new intree.
  3. Run consense on your intree.
  4. Set your parameters. I have used the defaults for this example.
  5. The usual outfile is created, and the consensus tree (outtree). Have a look.
          +-------------------------------Mbp1 SCHPO
          |
          |                       +-------Mbp1 SACCE
  +-------|               +--61.0-|
  |       |       +--52.0-|       +-------Mbp1 CANAL
  |       |       |       |
  |       +--26.0-|       +---------------KilA ESCCO
  |               |
  |               |               +-------Mbp1 NEUCR
  |               +----------69.0-|
  |                               +-------Mbp1 ASPNI
  |
  +---------------------------------------Mbp1 USTMA

The bootstrap values are poor overall. The reason is that the sequences are short to begin with, and eliminating 1/3 of the information by resampling makes the estimation process quite brittle. The topology of the tree is not quite right either: in order to get the correct species tree, the (SCHPO/YSTMA) clade braqnchpoint would need to be moved up in the tree one level.

This is what the tree looks like when I use retree to redraw it with KilA-N as the outgroup. However the bootstrap values had to be entered by hand from the data in outfile, PHYLIP can't do that for you :-(

  ┌─────────│KilA ESCCO
  │  
  │                                 ┌───────────────────│Mbp1 NEUCR
  │                   ┌────────0.69─│  
──│                   │             └───────────────────│Mbp1 ASPNI
  │         ┌──0.52───│  
  │         │         │    ┌───────────────────────────────────────│Mbp1 USTMA
  │         │         └0.26│  
  └─────────│              └───────────────────│Mbp1 SCHPO
            │  
            │           ┌───────────────────│Mbp1 SACCE
            └──────0.61─│
                        └───────────────────│Mbp1 CANAL



 

 

Further reading and resources