BIO bootstrapping with PHYLIP
Bootstrapping PHYLIP trees
A brief overview how to produce bootstrapping results for PHYLIP trees.
Contents
Principle
- Create multiple boostrapped copies (e.g. 100) of your input data using seqboot.
- Run your tree estimation program of choice using the
M
input option (analyze multiple trees). - Use the program consense to calculate your consensus tree.
Input data
Create a PHYLIP input file with the usual infile
filename. Something like this:
7 77
KilA_ESCCO ---------R AKDGYINATS MCRTAGKLLS DYTRLLSRDM GIPISEIQSF
Mbp1_SACCE IHSTGSIMKR KKDDWVNATH ILKAANFAKA KRTRILEKEV LKE--THEKV
Mbp1_NEUCR -VNNVAVMRR RHDDWVNATH ILKAAGFDKP ARTRILEREV QKD--THEKI
Mbp1_CANAL VTSEGPIMRR KKDSWINATH ILKIAKFPKA KRTRILEKDV QTG--IHEKV
Mbp1_USTMA IINNVAVMRR RSDDWLNATQ ILKVVGLDKP QRTRVLEREI QKG--IHEKV
Mbp1_ASPNI -----SVMRR RSDDWINATH ILKVAGFDKP ARTRILEREV QKG--VHEKV
Mbp1_SCHPO -IKGVSVMRR RRDSWLNATQ ILKVADFDKP QRTRVLERQV QIG--AHEKV
KGGRPENQGT WVHPDIAINL AQ-----
QGGFGKYQGT WVPLNIAKQL AEKFSVY
QGGYGRYQGT WIPLEQAEAL ARRNNIY
QGGYGKYQGT YVPLDLGAAI ARNFGVY
QGGYGKYQGT WIPLDVAIEL AERYNI-
QGGYGKYQGT WIPLQEGRQL AERNNI-
QGGYGKYQGT WVPFQRGVDL ATKYKV-
seqboot
- Read the documentation for the
seqboot
program. - Run
seqboot
on yourinfile
. - Set your parameters. I have used the defaults for this example. The random seed should be of the form
4n+1
. - The usual
outfile
is created. Here is the first bootstrap replicate from the run.
7 77
KilA_ESCCO ---------- -RKKGGGYIA TTMMCCRRRL SIISSEIQQQ GGRRRNQQQQ GTWVPIIIAI
Mbp1_SACCE HHSSTGSIMK KRKKDDDWVA TTIILLKRRL E----THEEE GGFFFYQQQQ GTWVLIIIAK
Mbp1_NEUCR VVNNNVAVMR RRHHDDDWVA TTIILLKRRL E----THEEE GGYYYYQQQQ GTWILQQQAE
Mbp1_CANAL TTSSEGPIMR RRKKSSSWIA TTIILLKRRL E----IHEEE GGYYYYQQQQ GTYVLLLLGA
Mbp1_USTMA IINNNVAVMR RRSSDDDWLA TTIILLKRRL E----IHEEE GGYYYYQQQQ GTWILVVVAI
Mbp1_ASPNI ------SVMR RRSSDDDWIA TTIILLKRRL E----VHEEE GGYYYYQQQQ GTWILEEEGR
Mbp1_SCHPO IIKKGVSVMR RRRRSSSWLA TTIILLKRRL E----AHEEE GGYYYYQQQQ GTWVFRRRGV
INNLLAAAQQ Q------
KQQLLAAAEE EKKSSVY
EAALLAAARR RRRNNIY
AAAIIAAARR RNNGGVY
IEELLAAAEE ERRNNI-
RQQLLAAAEE ERRNNI-
VDDLLAAATT TKKKKV-
Note how approximately 1/3 of the columns are replicates.
proml
The output of seqboot works for most of the tree estimation programs. Be aware that running time will increase by a factor of 100 for 100 bootstrap replicates.
- Read the documentation for the
proml
program. - Rename the previous
outfile
as the newinfile
. - Run
proml
on yourinfile
. - Set your parameters. I have used the defaults for this example, except for choosing the option
S
(not speedy and rough), theM
option for multiple datasets and as promptedD
for data (not weights), the number of replicates (100), and a random seed, and "jumbling" only once. (While this is running – 5 minutes or so for my example – you can read about common input options such as what "jumble means here.) - The usual
outfile
andouttree
is created. Have a look. Here are the first two trees from myoutfile
:
Data set # 1:
+-Mbp1_USTMA
+--------2
| | +-Mbp1_ASPNI
| +--3
| | +-Mbp1_NEUCR
| +------1
| | +--------Mbp1_SCHPO
| +------5
| +----------Mbp1_CANAL
|
4---Mbp1_SACCE
|
+-----------------------KilA_ESCCO
Ln Likelihood = -818.27365
Data set # 2:
+----Mbp1_USTMA
+--3
| | +------Mbp1_NEUCR
+--------1 +---4
| | +Mbp1_ASPNI
+---5 |
| | +-----Mbp1_SCHPO
| |
| +---Mbp1_SACCE
|
2-------Mbp1_CANAL
|
+-----------------------------------------------KilA_ESCCO
Ln Likelihood = -825.36962
consense
You can use consense
to calculate a consensus tree.
- Read the documentation for the
consense
program. - Rename the previous
outtree
as the newintree
. - Run
consense
on yourintree
. - Set your parameters. I have used the defaults for this example.
- The usual
outfile
is created, and the consensus tree (outtree
). Have a look.
+-------------------------------Mbp1 SCHPO
|
| +-------Mbp1 SACCE
+-------| +--61.0-|
| | +--52.0-| +-------Mbp1 CANAL
| | | |
| +--26.0-| +---------------KilA ESCCO
| |
| | +-------Mbp1 NEUCR
| +----------69.0-|
| +-------Mbp1 ASPNI
|
+---------------------------------------Mbp1 USTMA
The bootstrap values are poor overall. The reason is that the sequences are short to begin with, and eliminating 1/3 of the information by resampling makes the estimation process quite brittle. The topology of the tree is not quite right either - this is what it looks like when I use retree
to redraw it with KilA-N as the outgroup. However the bootstrap values had to be put in by hand from the data in outfile
, PHYLIP can't do that for you :-(
┌─────────│KilA ESCCO
│
│ ┌───────────────────│Mbp1 NEUCR
│ ┌────────0.69─│
──│ │ └───────────────────│Mbp1 ASPNI
│ ┌──0.52───│
│ │ │ ┌───────────────────────────────────────│Mbp1 USTMA
│ │ └0.26│
└─────────│ └───────────────────│Mbp1 SCHPO
│
│ ┌───────────────────│Mbp1 SACCE
└──────0.61─│
└───────────────────│Mbp1 CANAL
Further reading and resources