Tools for the bioinformatics lab
Title ...
Summary ...
Contents
EMBOSS
EMBOSS installation
- Outdated (written 2006).
Download
- 1. navigate to the EMBOSS download page on sourceforge and read the information on the latest download there. As of this writing, the latest major release is version 3.0.
- 2. Download this compressed archive.
- 3. open a terminal session, navigate to your download directory and type the usual (remember to use the tab key for filename completion :-):
gunzip EMBOSS-3.0.0.tar.gz tar -xvf EMBOSS-3.0.0.tar rm EMBOSS-3.0.0.tar cd EMBOSS-3.0.0
Before you begin, it may be a good idea to browse through some of the files that have been downloaded to get you oriented, these include:
INSTALL KNOWN_BUGS (this is an empty file in this release) README
Compile
EMBOSS requires a number of system specific options to be set and thus will generate its makefile before it can be used, by running the program configure. Type:
configure
Then type
make
Compilation will run for some time. Then type
sudo make install
and finally
make clean
Test
First see whether installation was successful in principle. Typing
ls /usr/local/share/EMBOSS/data/
should list some of the data resources that have been installed and where they are located. Now open a new shell and type
tfm needle
You should see man-like help pages for EMBOSS commands. In fact tfm is itself an EMBOSS command, it runs a program that formats and displays help files. If the above works, it tells you two things: (i) that EMBOSSS programs have been compiled and installed, and (ii) that the installation is on your PATH.
Next, try a simple pairwise alignment. Create two sequence files (2):
- HBA.fa
>HBA_HUMAN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKL LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
- HBB.fa
>HBB_HUMAN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
and type the following comand (note: I am using the multiline command character "\" here to wrap the command, but it could also be type all on one line:
needle -asequence HBA.fa -bsequence HBB.fa \ -gapopen 10.0 -gapextend 0.5 -datafile EBLOSUM62 \ -outfile test.ali
Then typing
cat test.ali
should give you the following output:
######################################## # Program: needle # Rundate: Sun Mar 02 2006 14:32:03 # Align_format: srspair # Report_file: test2.ali ######################################## #======================================= # # Aligned_sequences: 2 # 1: HBA_HUMAN # 2: HBB_HUMAN # Matrix: EBLOSUM62 # Gap_penalty: 10.0 # Extend_penalty: 0.5 # # Length: 148 # Identity: 63/148 (42.6%) # Similarity: 88/148 (59.5%) # Gaps: 9/148 ( 6.1%) # Score: 290.5 # # #======================================= HBA_HUMAN 1 -VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DL 48 .|:|.:|:.|.|.|||| :..|.|.|||.|:.:.:|.|:.:|..| || HBB_HUMAN 1 VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDL 48 HBA_HUMAN 49 S-----HGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRV 93 | .|:.:||.|||||..|.::.:||:|::....:.||:||..||.| HBB_HUMAN 49 STPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHV 98 HBA_HUMAN 94 DPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR 141 ||.||:||.:.|:..||.|...||||.|.|:..|.:|.|:..|..||. HBB_HUMAN 99 DPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH 146 #--------------------------------------- #---------------------------------------
- That should be all
- as usual: e-mail me in case things do not work as expected.
Notes
(1) In these notes, I assume the "." directory is on your PATH - if it is not, you may have to prepend "./" to commands to tell the operating system the executable file for the command is in your current working directory.
(2) My favorite quick and dirty way to create a text file (e.g. called file.txt) based on something I can copy and paste, is to type
cat > file.txt
then I simply paste the contents and close the file with <ctrl>d. Ask me if you don't understand how this works.
Clustal
Clustal installation
- Outdated (written 2006).
- Download
- 1. navigate to the CLUSTAL homepage at the EBI:
- 2. at the top of the page, there are icons for Mac and Linux installations (Windows too). Clicking on the Apple folder downloads the latest precompiled Max OS X version (clustalw1.82.mac-osx.tar.gz as of this writing). Don't do this even if you are on a Mac! (1). Clicking on the folder with the penguin icon takes you to an ftp directory which also contains sources for parallel architecture machines. clustalw1.83.UNIX.tar.gz appears to be the current UNIX version as of this writing. Download this compressed archive.
- 3. open a terminal session, navigate to your download directory and type the usual (remember to use the tab key for filename completion :-):
gunzip clustalw1.83.UNIX.tar.gz tar -xvf clustalw1.83.UNIX.tar cd clustalw1.83
- Compile
Type:
make
On my system this compiles with one warning about a redefined symbol which does not appear to be of any consequence. The executable clustalw is being generated. The makefile is not really up to standard since it has no provisions for make test or make install. So we will run our own very simple test and installation. Remove the object files (they are no longer needed after being compiled and linked into the executable. Type
make clean
Since you also do not require the C sources anymore and could download them from the server at anytime if you did, you may also type
rm *.c *.h
to clean up the directory.
- Test
The directory contains a test input by the name globin.pep. These are globin sequences in the dated PIR format. I have transformed them into Fasta format below. Copy the following sequences and save them in a file by the name globin.mfa.
>HBB_HUMAN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH >HBB_HORSE VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSN PGAVMGNPKVKAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDP ENFRLLGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH >HBA_HUMAN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKL LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR >HBA_HORSE VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSH GSAQVKAHGKKVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKL LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR >MYG_PHYCA VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLK TEAEMKASEDLKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIP IKYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELG YQG >GLB5_PETMA PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQE FFPKFKGLTTADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRD LSGKHAKSFQVDPQYFKVLAAVIADTVAAGDAGFEKLMSMICILLRSAY >LGB2_LUPLU GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGT SEVPQNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSK GVADAHFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMN DAA
Now type
clustalw -options
to verify that the program runs in principle (this will print a list of the commandline options). Then run the following command:
clustalw -infile=globin.mfa
This should have created the two files globin.aln and globin.dnd with the following contents:
- globin.aln
CLUSTAL W (1.83) multiple sequence alignment HBB_HUMAN --------VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLST HBB_HORSE --------VQLSGEEKAAVLALWDKVN--EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN HBA_HUMAN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- HBA_HORSE ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- GLB5_PETMA PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT MYG_PHYCA ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT LGB2_LUPLU --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . HBB_HUMAN PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL HBB_HORSE PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL HBA_HUMAN ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL HBA_HORSE ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL GLB5_PETMA ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV MYG_PHYCA EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF LGB2_LUPLU VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG-VADAHFPV . .:: *. : . : *. * . : : . HBB_HUMAN LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ HBB_HORSE LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------ HBA_HUMAN LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ HBA_HORSE LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------ GLB5_PETMA LAAVIADTVAAG---------DAGFEKLMSMICILLRSAY------- MYG_PHYCA ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG LGB2_LUPLU VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: ... . :
- globin.dnd
( ( ( ( HBB_HUMAN:0.08080, HBB_HORSE:0.08359) :0.23578, ( HBA_HUMAN:0.06516, HBA_HORSE:0.05541) :0.19444) :0.07579, GLB5_PETMA:0.37023) :0.02699, MYG_PHYCA:0.37220, LGB2_LUPLU:0.47094);
- Install
To be able to run clustal from the commandline, it needs to be in a directory on your PATH. This could either be done by putting the program into a directory on the path, or by modifying the PATH appropriately. My preferred way to do this is to keep the executables in /usr/local/bin. First I copy the executable to the directory in which I keep my locally installed programs (type echo $PATH if you are not sure that this directory exists and is on your path on your own machine).
sudo cp clustalw /usr/local/bin
Finally I copy the help file into the same directory, so clustalw can find it:
sudo cp clustalw_help /usr/local/bin
- That should be all
- as usual: e-mail me in case things do not work as expected.
- Notes
- (1) The Mac Os X archives contain two compiled binaries and an unintelligible readme.html. The binaries appear to run but miss any helpfiles or documentation. This is useless pseudosupport, the only thing you save yourself is the trivial task of compiling the executables but when you compile from source at least you get the complete kit and everything is nicely in its place.
GBrowse
GBrowse:viewing annotations
Viewing anotations in GBrowse is actually quite straightforward.
If you study the section on third-party annotations in the GBrowse tutorial, you will notice that you can load GFF files from a remote server. So all you actually need to do is write a cgi-script that uploads a GFF formatted record. Try the following: put the following file into your /usr/local/apache/cgi-bin directory, call it annotest:
#!/usr/bin/perl -w use strict; print"Content-type: text/plain\n"; # MIME header print "\n"; # Blank line: payload begins here print "ctgA example motif 1 15000 . + . Motif mxy ; Note \"this is a test\""; exit;
Note the special MIME type text/plain!
Now first execute this by typing into your browser:
http://localhost/cgi-bin/annotest
Then acess the GBrowse tutorial volvox example and type the same URL into the URL field for "Add remote annotations"...
This shows the principle. Of course, to do something useful, we would like to send some parameters with the request. Type the following script and save it as /usr/local/apache/cgi-bin/annotate. Set the right ownership (sudo chown root annotate) and permissions (sudo chmod 755 annotate).
#!/usr/bin/perl -w # reads input from CGI in the form # http://localhost/cgi-bin/annotate?id=ctgA;start=1020;end=12250;accession=1XYZ:20..250 # returns an annotation in GFF format; # cf. http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml use strict; use CGI; my $input = CGI->new(); my $acc_ID = 'XXX'; my $acc_start = '000'; my $acc_end = '000'; my $accession=$input->param('accession'); if ($accession =~ m/^([^:]+):(\d+)\.\.(\d+)$/) { $acc_ID = $1; $acc_start = $2; $acc_end = $3; } my $seqid = $input->param('id'); my $source = "Annotbot"; my $type = "region"; # cf. SOFA ontology # http://cvs.sourceforge.net/viewcvs.py/song/ontology/sofa.ontology my $start = $input->param('start'); my $end = $input->param('end'); my $score = 0.0; my $strand = '-'; my $phase = 0; my @attributes; $attributes[0]= "$type \"Test annotation\";"; $attributes[1]= "Note \"Accession No. $acc_ID from $acc_start to $acc_end\";"; print"Content-type: text/plain\n"; # MIME header print "\n"; # Blank line: payload begins here print "$seqid\t"; print "$source\t"; print "$type\t"; print "$start\t"; print "$end\t"; print "$score\t"; print "$strand\t"; print "$phase\t"; foreach my $att (@attributes) { print $att; } exit;
Then try this out by typing the following into your browser
http://localhost/cgi-bin/annotate?id=ctgA;start=1020;end=1250;accession=1XYZ:20..250
... and finally paste this into the "remote annotations" field of the Volvox example database. Then try changing some of the parameters.
Gbrowse installation
(Outdated: written 2006)
Refer to http://www.gmod.org/ to ensure the installation instructions are current.
- Download
- 1. navigate to the GMOD download pages on sourceforge
- 2. Find the most recent version of the Generic-Genome-Browser (1.64 as of this writing). Download this compressed archive.
- 3. open a terminal session, navigate to your download directory and type the usual (remember to use the tab key for filename completion :-):
gunzip Generic-Genome-Browser-1.64.tar.gz tar -xvf Generic-Genome-Browser-1.64.tar cd Generic-Genome-Browser-1.64
- Compile
Before you continue, read through the entire page of installatio information. There is information on how to install into non-default directories and how to install without requiring root access, and this may be useful for your specific situation. If you decide to go the default way, it is simply a question of typing:
perl Makefile.PL make sudo make install make clean
- Test
The installation instruction page discuss a quick test run with data that is supplied in the installation. Point your browser to http://localhost/cgi-bin/gbrowse (of course your Apache server has to be running for this to work).
More instructions and a more detailed tutorial are found at http://localhost/gbrowse/tutorial/tutorial.html .
Further reading and resources