Tools for the bioinformatics lab

From "A B C"
Revision as of 10:58, 18 September 2012 by Boris (talk | contribs)
Jump to navigation Jump to search

Title ...


The contents of this page has recently been imported from an older version of this Wiki. This page may contain outdated information, information that is irrelevant for this Wiki, information that needs to be differently structured, outdated syntax, and/or broken links. Use with caution!


Summary ...



EMBOSS

EMBOSS installation

Outdated (written 2006).

Download

1. navigate to the EMBOSS download page on sourceforge and read the information on the latest download there. As of this writing, the latest major release is version 3.0.
2. Download this compressed archive.
3. open a terminal session, navigate to your download directory and type the usual (remember to use the tab key for filename completion :-):
gunzip EMBOSS-3.0.0.tar.gz
tar -xvf EMBOSS-3.0.0.tar
rm EMBOSS-3.0.0.tar
cd EMBOSS-3.0.0

Before you begin, it may be a good idea to browse through some of the files that have been downloaded to get you oriented, these include:

INSTALL
KNOWN_BUGS  (this is an empty file in this release)
README

Compile

EMBOSS requires a number of system specific options to be set and thus will generate its makefile before it can be used, by running the program configure. Type:

configure

Then type

make

Compilation will run for some time. Then type

sudo make install

and finally

make clean

Test

First see whether installation was successful in principle. Typing

ls /usr/local/share/EMBOSS/data/

should list some of the data resources that have been installed and where they are located. Now open a new shell and type

tfm needle

You should see man-like help pages for EMBOSS commands. In fact tfm is itself an EMBOSS command, it runs a program that formats and displays help files. If the above works, it tells you two things: (i) that EMBOSSS programs have been compiled and installed, and (ii) that the installation is on your PATH.

Next, try a simple pairwise alignment. Create two sequence files (2):

HBA.fa
>HBA_HUMAN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKL
LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
HBB.fa
>HBB_HUMAN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP
ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

and type the following comand (note: I am using the multiline command character "\" here to wrap the command, but it could also be type all on one line:

needle -asequence HBA.fa -bsequence HBB.fa \
-gapopen 10.0 -gapextend 0.5 -datafile EBLOSUM62 \
-outfile test.ali

Then typing

cat test.ali

should give you the following output:

########################################
# Program: needle
# Rundate: Sun Mar 02 2006 14:32:03
# Align_format: srspair
# Report_file: test2.ali
########################################

#=======================================
#
# Aligned_sequences: 2
# 1: HBA_HUMAN
# 2: HBB_HUMAN
# Matrix: EBLOSUM62
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 148
# Identity:      63/148 (42.6%)
# Similarity:    88/148 (59.5%)
# Gaps:           9/148 ( 6.1%)
# Score: 290.5
# 
#
#=======================================

HBA_HUMAN          1 -VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DL     48
                      .|:|.:|:.|.|.||||  :..|.|.|||.|:.:.:|.|:.:|..| ||
HBB_HUMAN          1 VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDL     48

HBA_HUMAN         49 S-----HGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRV     93
                     |     .|:.:||.|||||..|.::.:||:|::....:.||:||..||.|
HBB_HUMAN         49 STPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHV     98

HBA_HUMAN         94 DPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR    141
                     ||.||:||.:.|:..||.|...||||.|.|:..|.:|.|:..|..||.
HBB_HUMAN         99 DPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH    146


#---------------------------------------
#---------------------------------------



Notes

(1) In these notes, I assume the "." directory is on your PATH - if it is not, you may have to prepend "./" to commands to tell the operating system the executable file for the command is in your current working directory.

(2) My favorite quick and dirty way to create a text file (e.g. called file.txt) based on something I can copy and paste, is to type

cat > file.txt

then I simply paste the contents and close the file with <ctrl>d. Ask me if you don't understand how this works.


Phylip

Phylip installation

Outdated (written 2006).
Download
1. navigate to the download section of the PHYLIP homepage.
2. read the instructions ... depending on your platform, there may be an easier way than installing from source. Neverthless, since this is the most general, here I will compile from source.
2. Download the compressed archive.
3. open a terminal session, navigate to your download directory and type the usual (remember to use the tab key for filename completion :-):
gunzip phylip-3.65.tar.gz
tar -xvf phylip-3.65.tar
cd phylip3.65/src
Compile

PHYLIP uses graphical routines in some of its programs. These have to be linked against X-terminal libraries. The Makefile should know where to find them on your system. On my Mac I need to type make -f Makefile.osx install, on your Linux boxes, it should simply work in the standard way: Type:

make install

On my system the whole package compiles with almost no nag. Bravo Joe Felsenstein, for understanding the benefit of writing plain, robust, portable code.

The excutables are being put into the directory distribution.exe and as usual have to be put on on your PATH, your PATH changed or (my preferred way) copied to /usr/local/bin.

cd ..
ll exe
sudo cp exe/* /usr/local/bin
Test

...

That should be all


Clustal

Clustal installation

Outdated (written 2006).


Download
1. navigate to the CLUSTAL homepage at the EBI:
2. at the top of the page, there are icons for Mac and Linux installations (Windows too). Clicking on the Apple folder downloads the latest precompiled Max OS X version (clustalw1.82.mac-osx.tar.gz as of this writing). Don't do this even if you are on a Mac! (1). Clicking on the folder with the penguin icon takes you to an ftp directory which also contains sources for parallel architecture machines. clustalw1.83.UNIX.tar.gz appears to be the current UNIX version as of this writing. Download this compressed archive.
3. open a terminal session, navigate to your download directory and type the usual (remember to use the tab key for filename completion :-):
gunzip clustalw1.83.UNIX.tar.gz
tar -xvf clustalw1.83.UNIX.tar
cd clustalw1.83
Compile

Type:

make

On my system this compiles with one warning about a redefined symbol which does not appear to be of any consequence. The executable clustalw is being generated. The makefile is not really up to standard since it has no provisions for make test or make install. So we will run our own very simple test and installation. Remove the object files (they are no longer needed after being compiled and linked into the executable. Type

make clean

Since you also do not require the C sources anymore and could download them from the server at anytime if you did, you may also type

rm *.c *.h

to clean up the directory.


Test

The directory contains a test input by the name globin.pep. These are globin sequences in the dated PIR format. I have transformed them into Fasta format below. Copy the following sequences and save them in a file by the name globin.mfa.

>HBB_HUMAN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP
ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

>HBB_HORSE
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDP
ENFRLLGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH

>HBA_HUMAN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKL
LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

>HBA_HORSE
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSH
GSAQVKAHGKKVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKL
LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR

>MYG_PHYCA
VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLK
TEAEMKASEDLKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIP
IKYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELG
YQG

>GLB5_PETMA
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQE
FFPKFKGLTTADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRD
LSGKHAKSFQVDPQYFKVLAAVIADTVAAGDAGFEKLMSMICILLRSAY

>LGB2_LUPLU
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGT
SEVPQNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSK
GVADAHFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMN
DAA

Now type

clustalw -options

to verify that the program runs in principle (this will print a list of the commandline options). Then run the following command:

clustalw -infile=globin.mfa

This should have created the two files globin.aln and globin.dnd with the following contents:

globin.aln
CLUSTAL W (1.83) multiple sequence alignment


HBB_HUMAN       --------VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLST
HBB_HORSE       --------VQLSGEEKAAVLALWDKVN--EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
HBA_HUMAN       ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-
HBA_HORSE       ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-
GLB5_PETMA      PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
MYG_PHYCA       ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
LGB2_LUPLU      --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
                          *:  :   :   *  .           :  .:   * :   *  :   . 

HBB_HUMAN       PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
HBB_HORSE       PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
HBA_HUMAN       ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
HBA_HORSE       ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
GLB5_PETMA      ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
MYG_PHYCA       EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
LGB2_LUPLU      VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG-VADAHFPV
                      . .:: *.  :   .                  :  *.  *  .  :    : .

HBB_HUMAN       LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------
HBB_HORSE       LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------
HBA_HUMAN       LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------
HBA_HORSE       LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------
GLB5_PETMA      LAAVIADTVAAG---------DAGFEKLMSMICILLRSAY-------
MYG_PHYCA       ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
LGB2_LUPLU      VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA---
                :   :  .:            ...       .   :         
globin.dnd
(
(
(
(
HBB_HUMAN:0.08080,
HBB_HORSE:0.08359)
:0.23578,
(
HBA_HUMAN:0.06516,
HBA_HORSE:0.05541)
:0.19444)
:0.07579,
GLB5_PETMA:0.37023)
:0.02699,
MYG_PHYCA:0.37220,
LGB2_LUPLU:0.47094);

Install

To be able to run clustal from the commandline, it needs to be in a directory on your PATH. This could either be done by putting the program into a directory on the path, or by modifying the PATH appropriately. My preferred way to do this is to keep the executables in /usr/local/bin. First I copy the executable to the directory in which I keep my locally installed programs (type echo $PATH if you are not sure that this directory exists and is on your path on your own machine).

sudo cp clustalw /usr/local/bin

Finally I copy the help file into the same directory, so clustalw can find it:

sudo cp clustalw_help /usr/local/bin
That should be all
as usual: e-mail me in case things do not work as expected.



Notes
(1) The Mac Os X archives contain two compiled binaries and an unintelligible readme.html. The binaries appear to run but miss any helpfiles or documentation. This is useless pseudosupport, the only thing you save yourself is the trivial task of compiling the executables but when you compile from source at least you get the complete kit and everything is nicely in its place.


GBrowse

GBrowse:viewing annotations

Viewing anotations in GBrowse is actually quite straightforward.

If you study the section on third-party annotations in the GBrowse tutorial, you will notice that you can load GFF files from a remote server. So all you actually need to do is write a cgi-script that uploads a GFF formatted record. Try the following: put the following file into your /usr/local/apache/cgi-bin directory, call it annotest:

#!/usr/bin/perl -w
use strict;

print"Content-type: text/plain\n";   # MIME header
print "\n";                         # Blank line: payload begins here
print "ctgA   example   motif   1   15000   .   +   .   Motif mxy ; Note \"this is a test\"";

exit;

Note the special MIME type text/plain!

Now first execute this by typing into your browser:

http://localhost/cgi-bin/annotest

Then acess the GBrowse tutorial volvox example and type the same URL into the URL field for "Add remote annotations"...

This shows the principle. Of course, to do something useful, we would like to send some parameters with the request. Type the following script and save it as /usr/local/apache/cgi-bin/annotate. Set the right ownership (sudo chown root annotate) and permissions (sudo chmod 755 annotate).


#!/usr/bin/perl -w
# reads input from CGI in the form
# http://localhost/cgi-bin/annotate?id=ctgA;start=1020;end=12250;accession=1XYZ:20..250
# returns an annotation in GFF format;
# cf. http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml

use strict;
use CGI;

my $input = CGI->new();

my $acc_ID = 'XXX';
my $acc_start = '000';
my $acc_end = '000';

my $accession=$input->param('accession');
if ($accession =~ m/^([^:]+):(\d+)\.\.(\d+)$/) {
    $acc_ID = $1;
    $acc_start = $2;
    $acc_end = $3;
}

my $seqid = $input->param('id');
my $source = "Annotbot";
my $type = "region";        # cf. SOFA ontology
                            # http://cvs.sourceforge.net/viewcvs.py/song/ontology/sofa.ontology
my $start = $input->param('start');
my $end = $input->param('end');
my $score = 0.0;
my $strand = '-';
my $phase = 0;
my @attributes;

$attributes[0]= "$type \"Test annotation\";";
$attributes[1]= "Note \"Accession No. $acc_ID from $acc_start to $acc_end\";";

print"Content-type: text/plain\n";   # MIME header
print "\n";                         # Blank line: payload begins here

print "$seqid\t";
print "$source\t";
print "$type\t";
print "$start\t";
print "$end\t";
print "$score\t";
print "$strand\t";
print "$phase\t";
foreach my $att (@attributes) {
    print $att;
}

exit;

Then try this out by typing the following into your browser

http://localhost/cgi-bin/annotate?id=ctgA;start=1020;end=1250;accession=1XYZ:20..250

... and finally paste this into the "remote annotations" field of the Volvox example database. Then try changing some of the parameters.


Gbrowse installation

(Outdated: written 2006)

Refer to http://www.gmod.org/ to ensure the installation instructions are current.

Download
1. navigate to the GMOD download pages on sourceforge
2. Find the most recent version of the Generic-Genome-Browser (1.64 as of this writing). Download this compressed archive.
3. open a terminal session, navigate to your download directory and type the usual (remember to use the tab key for filename completion :-):
gunzip Generic-Genome-Browser-1.64.tar.gz
tar -xvf Generic-Genome-Browser-1.64.tar
cd Generic-Genome-Browser-1.64
Compile

Before you continue, read through the entire page of installatio information. There is information on how to install into non-default directories and how to install without requiring root access, and this may be useful for your specific situation. If you decide to go the default way, it is simply a question of typing:

perl Makefile.PL
make
sudo make install
make clean
Test

The installation instruction page discuss a quick test run with data that is supplied in the installation. Point your browser to http://localhost/cgi-bin/gbrowse (of course your Apache server has to be running for this to work).

More instructions and a more detailed tutorial are found at http://localhost/gbrowse/tutorial/tutorial.html .


   

Further reading and resources