Unix

From "A B C"
Revision as of 01:48, 16 September 2012 by Boris (talk | contribs)
Jump to navigation Jump to search

Unix


The contents of this page has recently been imported from an older version of this Wiki. This page may contain outdated information, information that is irrelevant for this Wiki, information that needs to be differently structured, outdated syntax, and/or broken links. Use with caution!



Related Pages


 

Introductory reading

Stein (2007) Unix survival guide. Curr Protoc Bioinformatics Appendix 1:Appendix 1C. (pmid: 18428775)

PubMed ] [ DOI ] For a mixture of historical and practical reasons, much of the bioinformatics software discussed in this series runs on Linux, Mac OSX, Solaris, or one of the many other Unix variants. This appendix provides the novice with easy-to-understand information needed to survive in the Unix environment.


 

Principles

[...]

Unix is case sensitive!
  • Unix commands are case sensitive.
  • Usernames are case sensitive.
  • Passwords are case sensitive.
  • Filenames are case sensitive.

Abort, stop and end

ctrl c
INTERRUPT running program
ctrl d
END text input (EOF)
ctrl z
SUSPEND a running program in the background. Use jobs to list all suspended jobs (programs); use fg to bring the program into the foreground and keep it running.
other ...
Other possible keys to stop a running program depend on the program - they may include q, quit, stop, exit, bye, ZZ, :! ... see: PLOKTA.

Operators

command &
run command in the background. This is useful when the command runs for a long time, or opens a second, interactive window and running it would block your terminal.
acroread  manuscript.pdf &
command > filename
redirect output of command into filename. Normally the output would be sent to STDOUT, using the > operator redirects it into a file. The file is created if it does not yet exist, it is overwritten if it does.
cat myfile.txt > new.txt
command >> filename
append output of command to filename. This works like the > operator, except that an existing file is not overwritten, but the data is appended to it.
command_1 | command_2
redirect output of command_1 to be used as input for comand_2. Normally the input would be read from STDIN, the < operator redirects it to read from a file. However, many commands accept an input filename as a parameter. Therefore
cat test.txt

has the same effect as
cat < test.txt

but for a very different reason. In the first case, test.txt is a filename parameter. The cat program opens the file that the parameter names, and reads from it. In the second case, cat reads from STDIN, but STDIN has been redirected to refer to test.txt instead. Make sure you understand the subtle difference. Once you do, you understand the concept of redirection.

Login

command meaning example
passwd password change your password to password  
ssh user@hostname establish a secure terminal connection to the machine called hostname and login as user. ssh root@biochemistry.utoronto.ca

 

Commands

This is a survival guide containing the most frequently used Unix commands, the ones you must know.

[...]

 

The Pipe

The pipeline operator "|" is a powerful tool to string together elementary commands.

In this section, we will cover a number of Unix commands for text manipulation and practice their use in a pipelined command sequence.

Read about the following UNIX commands;

 cat
 wc
 grep
 sort
 cut
 tr

We will pipeline these commands together, to extract the sequence from a PDB file.

Using your browser or wget download and save the file 1JKZ.pdb from the PDB.

Remember that many Unix commands read their input from a file, but if none is specifed they will read from STDIN by default. Redirection "<" and ">" or ">>" can be used to associate STDIN resp. STDOUT with other files, and the "|" pipeline character can be used to connect STDOUT of one command with STDIN of another.


cat

Try the following two commands. Note that they do exactly the same thing. Make sure you understand why this is the case.

cat 1JKZ.pdb

(a) Command reads from filename that was given as argument.

cat < 1JKZ.pdb

No filename was specified, thus command reads from STDIN, but STDIN is redirected to read from the file.


wc

There is a lot of output on the terminal because all 13000-something records, for all 20 models in this multiple-model NMR structure file get sent to the terminal. How many exactly ? Lets use the command wc (wordcount) to find out. Think about what the following command will do, then type it and see if you were right.

 cat 1JKZ.pdb | wc

The three numbers that wc reports are lines, words and characters. Lines and characters may be obvious, but what are words ? For wc, they are strings of characters that are delimited by whitespace: blank, tab, return. So the first number is the number of lines in the file.

To be exact, the third number is the number of bytes.


grep

Now let's discuss how to limit the output, ultimately to the only the sequence of the protein. We will use the following strategy.

  • Limit output to only coordinate records (using grep )
  • then from these, print only CA records (again using grep )
  • then print only the first of the 20-fold repeated set of CA records (using sort )
  • then extract the three-letter amino acid code from each line (using cut )
  • then remove the newline characters (using tr )

grep is an extremely versatile utility to filter information from the contents of a file. In principle it works like cat, but it takes an expression as its first argument and then prints only lines that contain a match to that expression. Let's use this to count the number of coordinate records.

cat 1JKZ.pdb | grep 'ATOM  ' | wc


We include the two spaces of the six-character record type identifier, since the string 'ATOM' might also occur in the header section.

Try this without the wc command - or better (for brevity) try this with only the first 150 lines of the coordinate file:

head -150 1JKZ.pdb | grep 'ATOM  '

The file contains a large header section, and in fact the first 150 lines cover only the first 20 coordinate records.

Consider the structure of a typical PDB coordinate record:

ATOM      2  CA  LYS A   1     -10.853  -4.810  13.198  1.00  0.00           C  

Obviously, when we want to ouptut only records that contain CA, we run into the problem that this string might also occur in other, non-ATOM records. But, searching for the entire string explicitely won't do, as you'll immediately realize when you type:

 cat 1JKZ.pdb | grep 'ATOM      2  CA'

If you did not get 20 records of the first amino acid, you probably did not include the right number of spaces: here they are again, as periods, for clarity 'ATOM......2..CA'. But even if you did, we actually wanted all CA records, not just those of the first amino acid.

The problem is that the atom numbers - "2" in this case, are different for every CA atom and the expression is specific for only one of them. The most powerful feature of grep is its ability to use "regular expressions" to specify complex search patterns. If you have ever searched anything with a "*" wildcard character, you have used a regular expression. We will discuss regular expressions in detail at a later time - for now, we will use a simple alternative strategy - we run grep a second time, on the output of the first command.


Using "." to denote "any character", the command would be

cat 1JKZ.pdb | grep 'ATOM   ....  CA'

i.e. four wildcard single characters would be in the place where digits could appear in the coordinate record.


cat 1JKZ.pdb | grep 'ATOM  ' | grep '  CA  '

Again, use wc on the output and confirm that we are being returned 920 records - one for each of 46 amino acids in twenty models.

To restrict the output to coordinates only from the first model, we could simply truncate the output to the first 46 lines, using head ....

cat 1JKZ.pdb | grep 'ATOM  ' | grep '  CA  ' | head -46

... but this requires us to know ahead of time how many residues are in a single model of our file.


sort

To make our command more general, we will use the Unix sort commmand (in a somewhat non-obvious way) to achieve the desired result. sort accepts data from STDIN and writes results to STDOUT. Its behaviour can be controlled by commmand line flags. Try the following:

cat 1JKZ.pdb | grep 'ATOM  ' | grep '  CA  ' | sort

You should see all CA records neatly grouped. sort works from left to right, sorting the records in alphabetic order, thus the records will be sorted according to the atom number and since the atom numbers for CA records are the same for all models, all CAs for a residue will be sorted one after another. An immensely useful feature of sort is that you can sort on specific "keys" of a record - individual fields or field ranges, where each field is delimited by the transition of a non- to a whitespace character. Thus we can sort the output globally on e.g. the Y coordinate alone, in the following way:

cat 1JKZ.pdb | grep 'ATOM  ' | grep '  CA  ' | sort -k 8,8

But note that the output starts sorted on increasing positive Y coordinates, then decreases on negative Y coordinates. This is because by default sort sorts alphabetically and "-" is larger than " ". To sort numerically, you have to use the -n flag, as in

cat 1JKZ.pdb | grep 'ATOM  ' | grep '  CA  ' | sort -nk 8,8

This concept of specifying defined field ranges for sort to work on has many additional uses. For example sort can be instructed to output only unique records, with the -u flag, i.e. unique, not overall, but in the specified key. Thus we can specifiy that we want only one each, from the batch of 20 versions of every coordinate record in the following way:

cat 1JKZ.pdb | grep 'ATOM  ' | grep '  CA  ' | sort -unk 2,2

Here the field restriction -k 2,2 specifies the second field - atom numbers - to be sorted uniquely; to be on the safe side, we are using the -n flag as well, to ensure numeric sort order for the field. A quick check of the residue number column confirms that we are now getting a single, unique set of CA atoms. Are these the CA atoms of the first model ? No. Instructing sort to sort on the second key does not cause it to ignore the other fields and keep them in input order. Thus they are additionaly sorted on the following fields which in our case causes the command to return the CA atoms with the lowest X-coordinate from each set.


cut

But we were planning to extract only the sequence of the protein. cut extracts specific ranges of characters from a line. Fields can be specified as character positions (eg. 1,2,4 ) or ranges (eg. 1-4 ). PDB format stores the three letter amino acid type in columns 18 to 21.

cat 1JKZ.pdb | grep 'ATOM  ' | grep '  CA  ' | sort -unk 2,2 | cut -c 18-20

Is this input getting too long for your command line ? You can continue commands on several lines by typing the "\" as the last character.


tr

Finally, in order to put all amino acids on the same line, we will use tr. This command takes two patterns as argument, it then replaces every occurence of the first pattern with the second pattern. Thus we can use it to replace the linefeed '\n' with a blank ' ' thus deleting the linefeed characters.

cat 1JKZ.pdb | grep 'ATOM  ' | grep '  CA  ' | sort -unk 2,2 | cut -c 18-20 | tr '\n' ' '

The final output is

LYS THR CYS GLU HIS LEU ALA ASP THR TYR ARG GLY VAL CYS PHE THR ASN ALA SER CYS ASP ASP HIS CYS LYS ASN LYS ALA HIS LEU ILE SER GLY THR CYS HIS ASN TRP LYS CYS PHE CYS THR GLN ASN CYS 

If you are picky about wanting a linefeed after the last character, so it wont be directly followed by the prompt, you could add a second command after the first one (multiple commands can be put on the command line if they are separated by ";") and add echo "".

cat 1JKZ.pdb | grep 'ATOM  ' | grep '  CA  ' | sort -unk 2,2 | cut -c 18-20 | tr '\n' ' '; echo ""


   

Further reading and resources