RPR-RegEx

Regular Expressions (regex) with R

Keywords: Regular expressions

1 Abstract
2 This unit ...
- 2.1 Prerequisites
- 2.2 Objectives
- 2.3 Outcomes
- 2.4 Deliverables
- 2.5 Evaluation
3 Contents
4 Regular Expressions
5 Regular Expressions in Perl
6 Syntax
- 6.1 Specifying symbols
- 6.2 Character Sets
- 6.3 The complement
- 6.4 Specifying quantity
- 6.5 Specifying position (anchoring)
- 6.6 Operators that use regular expressions
7 Behaviour
- 7.1 Returning values
  - 7.1.1 Capturing matches directly
- 7.2 Modifiers
- 7.3 Greed
8 Regular Expressions in PHP
9 Regular expressions in Python
- 9.1 Example
10 Regular Expressions in R
11 Regular Expressions in Javascript
12 Regular Expressions in POSIX (Unix, the shell)
13 Discussion points
14 Exercises
- 14.1 Counting lines
  - 14.1.1 ...CA atoms only
- 14.2 eMail addresses
- 14.3 Mutiple sequence alignment
- 14.4 Screenscraping
- 14.5 Labeling
15 Appendix I: Metacharacters and their meaning
16 Appendix II: Character classes and their meaning
17 Appendix III: Anchor codes and their meaning
18 Appendix IV: Modifiers and their meaning
- 18.1 A Brief First Encounter of Regular Expressions
19 Further reading, links and resources
20 Notes
21 Self-evaluation

Sorry!

This page is only a stub; it is here as a placeholder to establish the logical framework of the site but there is no significant content as yet. Do not work with this material until it is updated to "live" status.

Abstract

...

This unit ...

Prerequisites

You need to complete the following units before beginning this one:

RPR-Introduction

Objectives

...

Outcomes

...

Deliverables

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
Journal: Document your progress in your course journal.
Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Evaluation

Evaluation: NA

This unit is not evaluated for course marks.

Regular Expressions

A Regular Expression is a specification of a pattern of characters. The typical use of a regular expression is to find occurrences of the pattern in a string. Regular expressions have a flexible syntax that allows them to handle a range of tasks - from trivial substring matches to complex nested motifs. The syntax of regular expressions is a programming language in its own right, and is a powerful way of concisely and uniquely defining a pattern.

Regular expressions are examples of deterministic pattern matching - they either match a particular expression or not. This is in contrast to probabilistic pattern matching in which a pattern is more or less similar to an example.

Theory
- According to the Chomsky hierarchy regular expressions are a Type-3 (regular) grammar, thus their use forms a regular language. Therefore, like all Type-3 grammatical expressions they can be decided by a finite-state machine, i.e. a "machine" that is defined by possible states, and triggering conditions that control transitions between states. Think of such automata as a (possibly elaborate) if ... else construct. The "regex" processor translates the search pattern into such an automaton, which is then applied to the search domain - the string in which the occurrence of the pattern is to be sought.

What are they good for?
- Most pattern matching tasks in screen scraping, data reformatting, simple parsing of log files, search through large tables, etc. etc. This means, they ought to be part of your everyday toolkit.

When should they not be used; what are alternatives for these cases?
- Since they are Type-3 grammars, they will fail when trying to parse any more complex grammar. In particular, you can't reliably parse HTML with regular expressions. Use a real XML parser instead. There is a long discussion on this particular topic however, e.g. see here, and many other similar threads on stackoverflow, and see here for a discussion of when regular expressions should not be used.

Regular Expressions in Perl

Many programming languages support their own style of regular expressions - the one we are dicusssing here is the one that Perl uses - although most of its syntax would be the same as that of Unix or PHP regular expressions. The support of regular expressions in Perl is one of its main strengths. Regular expressions in Perl can be used

to match patterns in strings for use in if() or while() conditions, or to retrieve specific instances of patterns,
to substitute occurrences of patterns with strings,
to translate all occurrences of a pattern into different characters, or
to split strings into substrings that are delimited by the occurrence of a pattern.

Accordingly, a basic knowledge of regular expressions is needed to reade and write code, especially code that parses text.

Syntax

Regular expressions are formed of characters and/or numbers, enclosed in special quotation marks.

/a/

is a regular expression. The lowercase "a" is the expression, the "/" are delimiters that bound the expression. This expression specifies the single character a exactly.

Specifying symbols

The power of regular expressions lies in their flexible syntax that allows to specify character ranges, classes of characters, unspecified characters and much more. This sometimes can be confusing, because the symbols that specify ranges, options, wildcards and the like are of course themselves characters. Characters that specify information about other characters are called metacharacters, they include ".", "*", "[" and "]" and more. And the opposite is also possible: some plain characters can be turned into metacharacters to symbolize character classes.

In Perl the "\" - Perl's escape character - allows to distinguish when a character is to be taken literally and when it is to be interpreted as a metacharacter. Note that some symbols have to be escaped to be read literally, while some letters have to be escaped to be read as metacharacters.

Letters whose special meaning as a metacharacter is turned on with the escape character:

Character	Means
w the letter "w"
\w a "word" character, ie one of A-Z, a-z, 0-9 and "_"
s the letter "s"
\s a "space" character, i.e. one of " ", tab or newline
b the letter "b"
\b a word boundary

Metacharacters whose special meaning is turned off with the escape character:

Character	Means
`+`	One or more repetitions of the preceeding expression
`\+`	the character "+"
`\`	the escape character
`\\`	the character "\"
`.`	any single character except the newline (\n)
`\.`	a period

Note that these examples are not exhaustive. A (nearly) complete table of symbols and meanings is given in the appendix.

Character Sets

Square brackets specify when more than one specific character can match at a position.

Expression	Means
`[acgtACGT]`	Any non-degenerate nucleotide

For example: /[AGR]AATT[CTY]/ matches all occurrences of an ApoI restriction site, either specified explicitly, or through the nucleotide ambiguity codes R (purines) or Y (pyrimidines).

Within character sets, hyphens can specify character ranges.

Expression	Means
`[a-z]`	letters
`[0-9]`	digits
`[0-9+*\/=^-]`	digits and arithmetic symbols

Within character sets, some metacharacters that otherwise have special meanings do not need to be escaped. In the example above, only "/" is escaped, it would otherwise terminate the regular expression. Other characters that need to be escaped include "$", "%" and "@" since the Perl compiler would try to interpolate them as variables.

The complement

The caret character "^" denotes the complement of a character set; i.e. everything that is not that expression.

Expression	Means
`[^9]`	Everything but the digit "9"
`[^ACGT]`	Not a nucleotide code letter

Note that outside of character sets, the "^" character denotes "beginning of the string". This can be confusing.

For character classes, the class in upper case denotes the complement. This can also be confusing !

Character	Means
`\W`	not a word character
`\S`	not a space character

Specifying quantity

Special characters in regular expressions control how often a pattern must be present in order to match:

Expression	What it means	Example (meaning)
`?`	match zero or one times	"? (there may or may not be a quote mark)
`+`	match one or more	[A-Z]+ (there's at least one uppercase letter)
`*`	match any number	.* (there may be some characters)
`{min,max}`	match between min and max times (assumes 0 and infinity respectively if not specified)	[acgt]{20,200} (a stretch of between 20 and 200 non-ambiguous bases)

For example: /AAUAAA[ACGU]{10,30}$/ defines a polyadenylation site - a AAUAAA motif, followed by 10 to 30 of any nucleotide before the end of the RNA.

Specifying position (anchoring)

If a pattern must be matched at a particular location, special terms denote string anchors.

Anchoring Term	Meaning
`^`	Start of a line or string
`$`	End of a line or string
`\A`	Start of the string
`\Z`	End of the string
`\G`	Last global match end

Operators that use regular expressions

Of course specifying a regular expression does not yet do anything with it. Below are the most important Perl operators that use regular expressions. Write the small Perl program samples that are provided below and test how the operators and regular expressions work.

Matching

Matching is the default behaviour of Perl regular expressions. The matching operator is

and the syntax is

m/>Expression</>Modifier<

>Expression< is a regular expression.
>Modifier< is one or more characters from a list of modifiers detailed below.

Since m is the default behaviour for a regular expression in a Perl program ...

/>Expression</

... works the same way.

There is one difference though: if the m operator is specified the default delimiter "/" can be replaced with any other character, for matching. Thus ...

/a/
m/a/
m:a:

... are all valid regular matching operations, but ...

:a:

... is not.

The matching (binding) operators =~ and !~

The =~ operator makes Perl apply the regular expression on the right to the variable on the left. It returns TRUE if the variable contains the pattern, FALSE otherwise. This can be used in conditional expressions (if (...) { }) while matching (m//), substituting (s/abc/xyz/) or transposing (tr/[A-Z]/[a-z]/).

$test =~ /\w/;

is TRUE if the variable $test contains word-characters.

Its inverse is the !~ operator, for example

$line !~ m/^\s*#/;

is TRUE if the string contained in $line does not start "#", which may or may not be preceeded by a number of whitespaces. This would be useful to ignore comment lines.

The regular expression above is decomposed as follows:

m the matching operator (optional)
/ the opening delimiter of the regular expression
^ the beginning of the line
\s any whitespace character ...
* ... repeated 0 or more times
# the hash character
/ the closing delimiter of the regular expression

The following example would process a file and store all lines that are not comments in an array:

#!/usr/bin/perl
use strict;
use warnings;

my @input;
while (my $line = <STDIN>) {     # while something is being read
   if ($line !~ m/^\s*#/) {      # if its not a comment ...
      push(@input, $line);       # ... store line in array
   }
}
print(@input,"\n");              # print whole array

exit();

Substitution - s

The substitution operator s substitutes the expression in the first part with the expression in the second part once per line. Its syntax is

s/>Expression</>Replacement</>Modifier<

>Expression< is a regular expression. >Replacement< is a specific pattern. >Modifier< is one or more characters from a list of modifiers detailed below.

Example (substitutes the first instance of ugly in a line with pretty):

$line =~ s/ugly/pretty/;

Try the folowing example:

#!/usr/bin/perl
use strict;
use warnings;

print("input>");
my $line = <STDIN>;
$line =~ s/[^0-9+*\/=^-]//g;  # substitute
print($line,"\n");

exit();

The key is the following command:

$line =~ s/[^0-9+*\/=^-]//g;

The substitution is applied to the contents of the variable $line. It is of the form

s/...//g;

which means substitute all occurrences ( g modifier !) of the pattern […] with nothing (because the replacement pattern is empty). This deletes all matching characters from the string.

The expression itself is a character set. It matches any character which is not a digit (0-9), a "+" or "*" character, a "/" character (which has to be preceded with an escape, as "\/", otherwise it would be parsed as the delimiter of the expression), or an "=", "^", or "-" character. Since it is itself a negation, only the characters specified thus are not deleted.

For example the input

aa2bb^4cc,.<>=16....

is changed into the output:

2^4=16

Transliteration - tr

The transliteration operator tr substitutes a range of characters with another range of characters.

$line =~ tr/[a-z]/[A-Z]/;

turns the contents of $line all into uppercase.

split()

Another operator that makes use of regular expressions is the split operator. You can split on a regular expression and thus remove unneeded characters from input, as in the following example:

#!/usr/bin/perl -w
use strict;
my $string = "A :colon:delimited: string: with:  random :spaces";
my ( @lines ) = split(/\s*:\s*/, $string);
# splits on colons surrounded by optional spaces
...

@lines now contains each entry in its own array element, without colons or whitespace.

In practice, when should you use matching, and when is split() more appropriate?

Use matching when you know what you want to keep

@words = $input =~ /\w+/g; # captures all blocks of characters

Use split() when you know what you want to discard

@words = split( /\s+/, $input); # splits on whitespace
                                # and discards it

Consider how punctuation marks would influence the results of these examples.

The most frequent use of the split function is for processing structured input data, such as comma- or tab delimited text:

#!/usr/bin/perl
use strict;
use warnings;
my @fields;
while (@fields = split(/\t/, <STDIN>) { #tab separated values
   # ... process fields
}
exit();

Behaviour

Returning values

It is often desirable to group terms together. This is done with various forms of parentheses. By default, grouping values with parentheses allows to capture the actual match to the special variables $1, $2, $3, etc. in the order in which the complete phrases of the groups are defined, from outermost to innermost !

Here is one example - the groupings are shown below the parentheses.

This is how it works:

( ( ) ( ( ) ) )
1-------------1
  2-2
      3-----3
        4-4

This is how it does not work:

( ( ) ( ( ) ) )
1---1
  2-------2
      3-----3
        4-----4

Grouping Syntax	Meaning	Where it occurs in the regex
`()`	Group what's between the brackets and remember match	Anywhere
`(?: … )`	Group what's between the brackets, but discard match	Anywhere
`(?= … )`	must follow the match	End of a regex
`(?! … )`	must not follow the match	End of a regex

In terms of saved values, also note that string parts are saved to special global variables.

Variable	What it contains
$`	Part of string before match
`$&`	Part of string matched
`$'`	Part of string after match

Note the following: if these are not used anywhere in your code, Perl doesn't bother to maintain them, when your program is compiled. This makes all regexes much faster. It seems sensible to avoid them for all but quick and dirty programming work; use parentheses when you need to capture matches and never to put such special variables in modules!

Capturing matches directly

In addition to using parentheses and the special variables, you can capture values directly by assignment from the match operator if you use the "global" modifier.

#!/usr/bin/perl
use strict;
use warnings;

my $Ubiquitin ="
MQIFVKTLTG KTITLEVEPS\n
DTIENVKAKI QDKEGIPPDQ\n
QRLIFAGKQL EDGRTLSDYN\n
IQKESTLHLV LRLRGG\n";

my @hydrophobics = $Ubiquitin =~ m/[FAMILYVW]/gs;
print @hydrophobics;

exit();

You can also use grep() and collect matching lines in an array. Here is an example that downloads a coordinate file from the PDB and extracts the ATOM records.

#!/usr/bin/perl
use strict;
use warnings;

my $PDBpref = "http://pdb.org/pdb/files/";
my $PDB_ID  = uc("2imm");
my $PDBsuff = ".pdb";
my $URL = $PDBpref . $PDB_ID . $PDBsuff;

my @raw = split(/\n/, `curl -s $URL`); # backtick operator captures output of system commandline function "curl"
my @atoms = grep(/^ATOM  /, @raw);

print (join("\n", @atoms), "\n"); # join lines with linebreaks, add a final linebreak at the end

exit();

Modifiers

After the trailing / delimiter of the regular expression, an i makes the match case insensitive (e.g. /foo/i will match FOO too). An x causes Perl to ignore whitespace in the regex (e.g. /foo s?/x will match foo and foos, but not "foo s"; this is useful when an expression is long and may span several lines - just insert linebreaks, tabs or characters as needed.

For example the following is a valid regular expression in a Perl program that parses a Fasta file into header and sequence.

#!/usr/bin/perl
use strict;
use warnings;

my $fasta ='';
while (my $line .= <STDIN>) { $fasta .= $line; }

$fasta =~ /    # Begin regular expression
    (?:.*)     # discard whatever precedes next match
    \s*        # there could be whitespaces
    >(.*\n)    # match the header line and collect its contents
    \s*        # there could again be whitespaces
    ((.*\n)*)  # match everything else to the end
    /x;        # ignore whitespace in the regex

my $header = $1;
my $sequence = $2;
$sequence =~ s/\s//g;   # remove all whitespace from sequence

print($header,"\n");
print($sequence,"\n");

exit();

Here the Perl compiler first discards the comments and the "x" modifier discards all the whitespaces inside the regular expressions.

Contrast this to the impenetrable expression you would have had to write otherwise !

$fasta =~ /(?:.*)\s*>(.*\n)\s*((.*\n)*)/;

The s modifier treats multi-line strings (with new-line characters in them) as a single line, otherwise matching ends at the first new-line (e.g. /fo\no/s will match foo split over two lines). The g modifier is useful in loops, making consecutive attempts to match, starting at the place in the string where the previous match ended (e.g. while($foo =~ /o/g){$o_count++} will give an o_count of two if $foo contains "foo" because there are two o's in "foo").

All of the modifiers can be used together. Just type them one after another after the delimiter.

Greed

By default, quantitative matches except zero/one (i.e. the ? character) are "greedy", i.e. they will match the largest possible number of characters. For example

/(\w+)(\d+)/

against "abc123" yields "abc12" and "3" for $1 and $2 respectively

This is because \w+ is greedy and grabs as many alphanumeric characters as it can before \d+ gets a chance to match. A ? after a quantity specifier makes it non-greedy, therefore

/(\w+?)(\d+)/

against "abc123" yields "abc" and"123" for $1 and $2 respectively.

Regular Expressions in PHP

<?php
$string = "The quick brown fox jumps over a lazy dog";

$words = preg_split('/\s+/', $string);
print_r($words);

preg_match('/.\W./', $string, $matches);
print_r($matches);

preg_match_all('/.\W./', $string, $matches);
print_r($matches);

#indexed preg_replace, iterates over array elements
$pat = array(); #broken
$pat[0] = '/quick brown/';
$pat[1] = '/fox/';
$pat[2] = '/lazy/';
$pat[3] = '/dog/';
$rep = array();
$rep[0] = 'lazy';
$rep[1] = 'dog';
$rep[2] = 'quick brown';
$rep[3] = 'fox';
print(preg_replace($pat, $rep, $string));
print("\n");

$pat = array();
$pat[0] = '/quick brown fox/';
$pat[1] = '/lazy dog/';
$pat[2] = '/foo/';
$pat[3] = '/bar/';
$rep = array();
$rep[0] = 'foo';
$rep[1] = 'bar';
$rep[2] = 'lazy dog';
$rep[3] = 'quick brown fox';
print(preg_replace($pat, $rep, $string));
print("\n");


?>

Regular expressions in Python

Python regular expression are provided through the module re. See here for documentation.

.re functions in general operate on a string and return a MatchObject. The MatchObject is then further analyzed by supplied methods.

The most frequently used functions are:

re.match(pattern, string) matches only at the beginning of a line.
re.search(pattern, string) matches anywhere in a line.
re.split(pattern, string) returns the split string as a list.
re.findall(pattern, string) returns all matches in a list.

Example

Download this .svg file to experiment.

# parse_SVG_example.py
# Read an svg file line by line and process path data
# to write commands separately to an output file, line by line.

import re

filePath = "/my/working/directory/whatever/"

myIn  = filePath + "sample.svg"
myOut = filePath + "test.svg"

IN  = open(myIn)
OUT = open(myOut, "w")

for line in IN:
   path = re.search('\sd=\"(.*?)\"', line) # returns the MatchObject "path"
   if path:
       # Found. Process the result with a second regex.
       # path.group() is a method of the MatchObject
       pathData = re.findall('([aAcChHlLmMqQsStTvVzZ]|-?\d*\.?\d+)',
                             path.group(1))
       # Write it nicely formatted to output, one command per line
       OUT.write("d=\"")
       s = ""    # we accumulate output lines in this variable
       for token in pathData:
           if re.match('[aAcChHlLmMqQsStTvVzZ]', token):
               # it's a letter:
               OUT.write("\n    "+s)     # flush s to output
               s = token + " "       # new s
           else:
               s = s + token + " "   # append to s
       OUT.write("\n    " + s + "\"\n")  # flush s, close string, and add \n

   else:
       OUT.write(line)

IN.close()
OUT.close()

Regular Expressions in R

The online help page is here. Default behaviour is not standard POSIX. To be sure, pass the perl=TRUE parameter.

# R regular expressions in base R

string <- "The quick brown fox jumps over a lazy dog"
vector <- unlist(strsplit(string, "\\s"))

# Not all pattern searches use (and need) regular expression. Sometimes
# simple string-matching is enough.

# R has match(), the %in% operator, and grep()

# match test for string equality
match("fox", vector)   # 4, i.e. the 4th element matches the string
match("o", vector)     # NA matches have to be to the WHOLE element

# equivalent to...
which(vector == "fox")

# %in% can be used for creating intersections
# find whether elements from one vector are
# contained in another:

english <- unlist(strsplit(
"what's in a name ? that which we call a rose by any other name would smell as sweet ."
                           , "\\s"))
german <- unlist(strsplit(
"was ist ein name ? was uns rose heißt , wie es auch hieße , würde lieblich duften ."
                           , "\\s"))
english
german
german %in% english
german[german %in% english]



# grep() is like match(), but uses regular expressions. parts of the string
# may match. The result is a logical vector.

grep("fox", vector)
grep("o", vector)
grep("[opq]", vector)
english[grep("a", english)]



# strsplit()  Note: the regex comes *after* the string in default ordering
# we have seen its use to split on whitespace (\s) above.
# NOTE: the regular expression in the pattern is <backslash> "s". But if
# we write "\s" into the string, R thinks we are "escaping" the s. That's
# not what we want. We have to escape the backslash, then write "s". The
# "escaped" backslash is "\\". Thus the regex pattern as R string is "\\s".

# The return value of strsplit() is a list, thus we unlist() to use
# the result as a vector.

unlist(strsplit(english, "\\s"))



# regexpr(), regmatches()
#get all word characters adjacent to "o"
pattern <- "\\w{0,1}o\\w{0,1}" # 0-1 "\w" character left and
                               # right of "o"
regexpr(pattern, vector) # positions of matches
M <- regexpr(pattern, vector) # assign the result object
regmatches(vector, M) # use regmatches to process
                      # the match-object M against the
                      # source vector


# regexec()
# capture groups from a string. Here we don't just want to know
# whether a match exists, but what it is. Example: is there
#  a three-consonant cluster in our string?
pattern <- "([bcdfghjklmnpqrstvwxz]{3})"  # Note the parentheses
                                          # that indicate the match
                                          # should be "captured"
grep(pattern, string)

M <- regexec(pattern, string) #
regmatches(string, M)
regmatches(string, M)[[1]]
regmatches(string, M)[[1]][1]


# Unfortunately there is no option to capture multiple matches
# in base R: regexec() lacks a corresponding gregexec()...
M <- regexec("(. .)", string)
regmatches(string, M)

# ... matches only the first character/blank/character pattern,
# not all of them.


# Solution 1 (base R): you can use multiple matches in an sapply()
# statement...
pattern <- "(. .)"   # the regex: capture two characters adjacent to a single blank
sapply(regmatches(string, gregexpr(pattern, string))[[1]],
       function(M){regmatches(M, regexec(pattern, M))})


# Solution 2 (probably preferred): you can use
# str_match_all() from the very useful library "stringr" ...
if (!require(stringr, quietly=TRUE)) {
    install.packages("stringr")
    library(stringr)
}

str_match_all(string, pattern)[[1]][,2]
# [1] "e q" "k b" "n f" "x j" "s o" "r a" "y d"

An interesting new alternative/complement to the base R regex libraries is the package "ore" that uses the Oniguruma libraries and supports multiple character encodings.

if (!require(ore)) {
    install.packages("ore")
    library(ore)
}

S <- "The quick brown fox jumps over a lazy dog"

ore.search(". .", S)
ore.search(". .", S, all=TRUE)
M <- ore.search(". .", S, all=TRUE)
M$nMatches
M$match[2:4]

According to the author John Clayden, key advantages include:

Search results focus around the matched substrings (including

parenthesised groups), rather than the locations of matches. This saves extra work with "substr" or similar to extract the matches themselves.

Substantially better performance, especially when matching against

long strings.

Substitutions can be functions as well as strings.
Matches can be efficiently obtained over only part of the strings.
Fewer core functions, with more consistent names.

Regular Expressions in Javascript

Javascript is probably the most convenient choice if you want to interact directly with the DOM (Document Object Model) of a Web page. Here is example code for a "bookmarklet" that rewrites the URL of a journal-article page for access from outside the UofT network.

javascript:(function(){
   var url=window.location.href;
   var re=/\/([\w.]+)\/(.*$)/;
   var match=url.match(re);
   var newURL="http://"+match[1]+".myaccess.library.utoronto.ca/"+match[2];
   window.location.href=newURL;
})();
void 0

Regular Expressions in POSIX (Unix, the shell)

Use in:

grep

grep finds patterns in files. Patterns are regular expressions and can come in basic or extended flavors. In GNU grep there is no difference between these; in implementations where there is, you switch from basic to extended syntax with the grep -E flag which is the same as invoking egrep.

Example: what demons run on your system?

ps -ax | egrep -o "/([^A-Z]\w+d)\b" | sort -u

Other uses of regular expressions in:

find
sed
awk
cut

... see the man pages.

Discussion points

Revisit the stackoverflow thread on regex and HTML parsing. What's your opinion on the OP's question?

Exercises

Counting lines

Hint

Write a unix command that returns the number of atoms in a PDB file.

Expand

Sample data ...

...CA atoms only

Hint

Change your unix command to count C-alpha atoms only. Work only with regular expressions. Don't get fooled by calcium atoms!

eMail addresses

Hint

Write a program in a language of your choice that reads a file from STDIN and prints any valid e-mail address this file might contain !

Expand

What is a valid eMail address ... ?

Expand

Sample input data ^[1]...

Solution

Match something at a word boundary, followed by "@", followed by something, bounded by whitespace. Group this appropriately. Then return $1, $2, $3.

Mutiple sequence alignment

Hint

Write a program in a language of your choice that extracts the multi-line sequences from a CLUSTAL or MSF formatted multiple sequence alignment and concatenates them into single sequences .

Expand

Sample input data ...

CLUSTAL formatted alignment

CLUSTAL multiple sequence alignment by MUSCLE (3.8)

SOK2_SACCE      --NGISVVRRADNDMVNGTKLLN-----VTKMTRGRRDGILKAEKIR----------HVV
PHD1_SACCE      --NGISVVRRADNNMINGTKLLN-----VTKMTRGRRDGILRSEKVR----------EVV
KILA_ESCCO      -IDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQSF
MBP1_SACCE      IHSTGSIMKRKKDDWVNATHILK-----AANFAKAKRTRILEKEVLKETH-------EKV
SWI4_SACCE      ---TKIVMRRTKDDWINITQVFK-----IAQFSKTKRTKILEKESNDMQH-------EKV
                      :  * .:. :* * : .      :. :. .    :  *               .

SOK2_SACCE      KIGSMHLKGVWIPFERALAIAQREKI-
PHD1_SACCE      KIGSMHLKGVWIPFERAYILAQREQI-
KILA_ESCCO      KGGRPENQGTWVHPDIAINLAQ-----
MBP1_SACCE      QGGFGKYQGTWVPLNIAKQLAEKFSVY
SWI4_SACCE      QGGYGRFQGTWIPLDSAKFLVNKYEI-

MSF formatted alignment

PileUp

  MSF: 87  Type: A  Check: 0000  ..

 Name: SOK2_SACCE  Len: 87  Check:  9836  Weight: 0.160458
 Name: PHD1_SACCE  Len: 87  Check:  2117  Weight: 0.160458
 Name: KILA_ESCCO  Len: 87  Check:  6044  Weight: 0.256296
 Name: MBP1_SACCE  Len: 87  Check:  4979  Weight: 0.211395
 Name: SWI4_SACCE  Len: 87  Check:  5197  Weight: 0.211395

//

SOK2_SACCE    ..NGISVVRR ADNDMVNGTK LLN.....VT KMTRGRRDGI LKAEKIR...
PHD1_SACCE    ..NGISVVRR ADNNMINGTK LLN.....VT KMTRGRRDGI LRSEKVR...
KILA_ESCCO    .IDGEIIHLR AKDGYINATS MCRTAGKLLS DYTRLKTTQE FFDELSRDMG
MBP1_SACCE    IHSTGSIMKR KKDDWVNATH ILK.....AA NFAKAKRTRI LEKEVLKETH
SWI4_SACCE    ...TKIVMRR TKDDWINITQ VFK.....IA QFSKTKRTKI LEKESNDMQH

SOK2_SACCE    .......HVV KIGSMHLKGV WIPFERALAI AQREKI.
PHD1_SACCE    .......EVV KIGSMHLKGV WIPFERAYIL AQREQI.
KILA_ESCCO    IPISELIQSF KGGRPENQGT WVHPDIAINL AQ.....
MBP1_SACCE    .......EKV QGGFGKYQGT WVPLNIAKQL AEKFSVY
SWI4_SACCE    .......EKV QGGYGRFQGT WIPLDSAKFL VNKYEI.

Screenscraping

Hint

Write a PHP program that screenscrapes resolution data for a protein structure file from the PDB.

Here is a link to a PDB record to illustrate the URL format.

Labeling

Hint

Write an R script that creates meaningful labels for data elements from metadata and shows them in a plot. Use the sample data below - or any other data you are interested in.

Expand

Sample input data from GEO, and task description ...

These data were downloaded from the NCBI GEO database using the GEO2R tool, this is a microarray expression data study that compares tumor and metastasis tissue. You can access the dataset here. Grouping primary PDAC (pancreatic ductal adenocarcinoma) as "tumor" and liver/peritoneal metastasis as "metastasis", an R script on the server calculates significantly differentially expressed genes using the {{Bioconductor limma package. I have selected the top 100 genes, and now would like to plot significance (adjusted P value) vs. level of differential expression (logFC). Moreover I would like to vaguely identify the function of each gene if that is discernible from the "Gene title".

"ID"	"adj.P.Val"	"P.Value"	"t"	"B"	"logFC"	"Gene.symbol"	"Gene.title"
"238376_at"	"3.69e-19"	"4.53e-23"	"-49.138515"	"42.43328"	"-2.202043"	"LOC100505564///DEXI"	"uncharacterized LOC100505564///Dexi homolog (mouse)"
"214041_x_at"	"2.36e-17"	"8.74e-21"	"38.089228"	"37.60995"	"4.541989"	"RPL37A"	"ribosomal protein L37a"
"241662_x_at"	"2.36e-17"	"1.03e-20"	"-37.793765"	"37.45851"	"-2.105123"	""	""
"231628_s_at"	"2.36e-17"	"1.16e-20"	"-37.574182"	"37.34507"	"-1.97516"	"SERPINB6"	"serpin peptidase inhibitor, clade B (ovalbumin), member 6"
"224760_at"	"3.23e-17"	"2.10e-20"	"36.500909"	"36.77932"	"3.798724"	"SP1"	"Sp1 transcription factor"
"214149_s_at"	"3.23e-17"	"2.38e-20"	"36.282193"	"36.66167"	"4.246787"	"ATP6V0E1"	"ATPase, H+ transporting, lysosomal 9kDa, V0 subunit e1"
"243177_at"	"4.15e-17"	"3.57e-20"	"-35.573827"	"36.275"	"-1.801709"	""	""
"243800_at"	"5.63e-17"	"5.52e-20"	"-34.825113"	"35.85663"	"-2.018088"	"NR1H4"	"nuclear receptor subfamily 1, group H, member 4"
"238398_s_at"	"1.10e-16"	"1.21e-19"	"-33.519208"	"35.10201"	"-2.245806"	""	""
"1569856_at"	"1.48e-16"	"1.82e-19"	"-32.860752"	"34.70891"	"-1.810438"	"TPP2"	"tripeptidyl peptidase II"
"1555116_s_at"	"1.51e-16"	"2.14e-19"	"-32.598656"	"34.55"	"-1.990665"	"SLC11A1"	"solute carrier family 11 (proton-coupled divalent metal ion transporters), member 1"
"218733_at"	"1.51e-16"	"2.23e-19"	"32.535823"	"34.51169"	"2.764663"	"MSL2"	"male-specific lethal 2 homolog (Drosophila)"
"201225_s_at"	"2.72e-16"	"4.33e-19"	"31.497695"	"33.86667"	"3.447828"	"SRRM1"	"serine/arginine repetitive matrix 1"
"217052_x_at"	"4.45e-16"	"7.64e-19"	"30.636232"	"33.31345"	"1.601527"	""	""
"1569348_at"	"5.24e-16"	"9.65e-19"	"-30.289176"	"33.08577"	"-1.793925"	"TPTEP1"	"transmembrane phosphatase with tensin homology pseudogene 1"
"219492_at"	"6.96e-16"	"1.37e-18"	"29.777415"	"32.74483"	"3.586919"	"CHIC2"	"cysteine-rich hydrophobic domain 2"
"215047_at"	"7.51e-16"	"1.58e-18"	"-29.567379"	"32.60307"	"-2.033635"	"TRIM58"	"tripartite motif containing 58"
"232877_at"	"7.51e-16"	"1.66e-18"	"-29.491388"	"32.55151"	"-1.65225"	""	""
"229265_at"	"7.51e-16"	"1.75e-18"	"29.419139"	"32.50236"	"3.933071"	"SKI"	"v-ski sarcoma viral oncogene homolog (avian)"
"1553842_at"	"8.16e-16"	"2.00e-18"	"-29.226409"	"32.37061"	"-1.832581"	"BEND2"	"BEN domain containing 2"
"220791_x_at"	"1.11e-15"	"2.87e-18"	"-28.71601"	"32.01715"	"-1.969381"	"SCN11A"	"sodium channel, voltage-gated, type XI, alpha subunit"
"212911_at"	"1.17e-15"	"3.15e-18"	"28.584094"	"31.92471"	"2.143175"	"DNAJC16"	"DnaJ (Hsp40) homolog, subfamily C, member 16"
"243464_at"	"1.22e-15"	"3.43e-18"	"-28.463254"	"31.83963"	"-1.675747"	""	""
"243823_at"	"1.30e-15"	"3.81e-18"	"-28.316669"	"31.7359"	"-1.499823"	""	""
"201533_at"	"1.56e-15"	"4.80e-18"	"27.999089"	"31.5092"	"4.054743"	"CTNNB1"	"catenin (cadherin-associated protein), beta 1, 88kDa"
"210878_s_at"	"1.59e-15"	"5.06e-18"	"27.927536"	"31.45775"	"2.982033"	"KDM3B"	"lysine (K)-specific demethylase 3B"
"227712_at"	"3.18e-15"	"1.05e-17"	"26.938855"	"30.73223"	"2.426311"	"LYRM2"	"LYR motif containing 2"
"228520_s_at"	"3.56e-15"	"1.22e-17"	"26.742683"	"30.58495"	"3.744881"	"APLP2"	"amyloid beta (A4) precursor-like protein 2"
"210242_x_at"	"3.80e-15"	"1.36e-17"	"26.605262"	"30.48111"	"1.815311"	"ST20"	"suppressor of tumorigenicity 20"
"217301_x_at"	"3.80e-15"	"1.40e-17"	"26.565414"	"30.45089"	"3.275566"	"RBBP4"	"retinoblastoma binding protein 4"
"1557551_at"	"6.17e-15"	"2.35e-17"	"-25.892664"	"29.93351"	"-1.78824"	""	""
"201392_s_at"	"6.17e-15"	"2.42e-17"	"25.856344"	"29.90519"	"3.283483"	"IGF2R"	"insulin-like growth factor 2 receptor"
"210371_s_at"	"7.18e-15"	"2.91e-17"	"25.62344"	"29.72255"	"3.463431"	"RBBP4"	"retinoblastoma binding protein 4"
"204252_at"	"9.08e-15"	"3.79e-17"	"25.291186"	"29.45902"	"2.789842"	"CDK2"	"cyclin-dependent kinase 2"
"243200_at"	"1.04e-14"	"4.48e-17"	"-25.082134"	"29.29138"	"-1.539093"	""	""
"201140_s_at"	"1.16e-14"	"5.13e-17"	"24.916407"	"29.15746"	"2.834707"	"RAB5C"	"RAB5C, member RAS oncogene family"
"1559066_at"	"1.23e-14"	"5.57e-17"	"-24.813534"	"29.07387"	"-1.595061"	""	""
"201123_s_at"	"1.27e-14"	"5.91e-17"	"24.741268"	"29.01494"	"4.870779"	"EIF5A"	"eukaryotic translation initiation factor 5A"
"218291_at"	"1.41e-14"	"6.83e-17"	"24.565645"	"28.87099"	"2.605328"	"LAMTOR2"	"late endosomal/lysosomal adaptor, MAPK and MTOR activator 2"
"217704_x_at"	"1.41e-14"	"6.91e-17"	"-24.550405"	"28.85845"	"-1.711476"	"SUZ12P1"	"suppressor of zeste 12 homolog pseudogene 1"
"227338_at"	"1.44e-14"	"7.22e-17"	"-24.498114"	"28.81536"	"-2.927581"	"LOC440983"	"hypothetical gene supported by BC066916"
"210231_x_at"	"1.64e-14"	"8.47e-17"	"24.305184"	"28.65556"	"4.548338"	"SET"	"SET nuclear oncogene"
"225289_at"	"1.86e-14"	"9.82e-17"	"24.127523"	"28.50726"	"3.062123"	"STAT3"	"signal transducer and activator of transcription 3 (acute-phase response factor)"
"204658_at"	"1.93e-14"	"1.04e-16"	"24.056703"	"28.44783"	"2.868797"	"TRA2A"	"transformer 2 alpha homolog (Drosophila)"
"208819_at"	"2.54e-14"	"1.40e-16"	"23.705016"	"28.15009"	"2.593365"	"RAB8A"	"RAB8A, member RAS oncogene family"
"210011_s_at"	"2.58e-14"	"1.46e-16"	"23.660126"	"28.11176"	"2.309763"	"EWSR1"	"EWS RNA-binding protein 1"
"202397_at"	"2.58e-14"	"1.48e-16"	"23.638422"	"28.0932"	"4.332132"	"NUTF2"	"nuclear transport factor 2"
"1552628_a_at"	"2.86e-14"	"1.68e-16"	"23.492249"	"27.96778"	"2.892763"	"HERPUD2"	"HERPUD family member 2"
"233757_x_at"	"3.85e-14"	"2.31e-16"	"23.123802"	"27.64812"	"2.430056"	""	""
"201545_s_at"	"5.07e-14"	"3.16e-16"	"22.767216"	"27.33385"	"2.568005"	"PABPN1"	"poly(A) binding protein, nuclear 1"
"1562463_at"	"5.07e-14"	"3.17e-16"	"-22.763883"	"27.33089"	"-1.119718"	""	""
"219859_at"	"5.41e-14"	"3.45e-16"	"-22.669239"	"27.24664"	"-1.787549"	"CLEC4E"	"C-type lectin domain family 4, member E"
"1569136_at"	"6.91e-14"	"4.50e-16"	"-22.372385"	"26.98011"	"-1.95396"	"MGAT4A"	"mannosyl (alpha-1,3-)-glycoprotein beta-1,4-N-acetylglucosaminyltransferase, isozyme A"
"208601_s_at"	"7.15e-14"	"4.74e-16"	"-22.314594"	"26.92781"	"-1.323653"	"TUBB1"	"tubulin, beta 1 class VI"
"226194_at"	"1.11e-13"	"7.47e-16"	"21.813583"	"26.46872"	"2.331245"	"CHAMP1"	"chromosome alignment maintaining phosphoprotein 1"
"217877_s_at"	"1.15e-13"	"7.93e-16"	"21.748093"	"26.40795"	"2.862688"	"GPBP1L1"	"GC-rich promoter binding protein 1-like 1"
"225371_at"	"1.25e-13"	"8.73e-16"	"21.644444"	"26.31139"	"2.518013"	"GLE1"	"GLE1 RNA export mediator homolog (yeast)"
"1563431_x_at"	"1.44e-13"	"1.02e-15"	"21.472848"	"26.15053"	"1.874743"	"CALM3"	"calmodulin 3 (phosphorylase kinase, delta)"
"211505_s_at"	"1.45e-13"	"1.06e-15"	"21.437744"	"26.11746"	"2.642609"	"STAU1"	"staufen double-stranded RNA binding protein 1"
"201585_s_at"	"1.45e-13"	"1.07e-15"	"21.430113"	"26.11027"	"2.787833"	"SFPQ"	"splicing factor proline/glutamine-rich"
"225197_at"	"1.75e-13"	"1.31e-15"	"21.212989"	"25.90451"	"2.845005"	""	""
"220336_s_at"	"1.83e-13"	"1.41e-15"	"-21.132294"	"25.82752"	"-1.848273"	"GP6"	"glycoprotein VI (platelet)"
"216515_x_at"	"1.83e-13"	"1.42e-15"	"21.128023"	"25.82343"	"2.877477"	"MIR1244-2///MIR1244-3///MIR1244-1///PTMAP5///PTMA"	"microRNA 1244-2///microRNA 1244-3///microRNA 1244-1///prothymosin, alpha pseudogene 5///prothymosin, alpha"
"241773_at"	"3.49e-13"	"2.74e-15"	"-20.441442"	"25.15639"	"-1.835223"	""	""
"1558011_at"	"3.89e-13"	"3.15e-15"	"-20.297118"	"25.01342"	"-1.577874"	"LOC100510697"	"putative POM121-like protein 1-like"
"215240_at"	"3.89e-13"	"3.15e-15"	"-20.29699"	"25.01329"	"-1.613308"	"ITGB3"	"integrin, beta 3 (platelet glycoprotein IIIa, antigen CD61)"
"233746_x_at"	"3.95e-13"	"3.25e-15"	"20.265986"	"24.98245"	"2.364699"	"HYPK///SERF2"	"huntingtin interacting protein K///small EDRK-rich factor 2"
"1555338_s_at"	"4.10e-13"	"3.42e-15"	"-20.214797"	"24.93143"	"-1.280803"	"AQP10"	"aquaporin 10"
"217714_x_at"	"4.12e-13"	"3.48e-15"	"20.195128"	"24.91179"	"2.247023"	"STMN1"	"stathmin 1"
"202276_at"	"4.75e-13"	"4.08e-15"	"20.035595"	"24.75183"	"2.654202"	"SHFM1"	"split hand/foot malformation (ectrodactyly) type 1"
"225414_at"	"6.34e-13"	"5.52e-15"	"19.733786"	"24.44585"	"3.287225"	"RNF149"	"ring finger protein 149"
"243930_x_at"	"7.43e-13"	"6.64e-15"	"-19.55046"	"24.2578"	"-1.219467"	""	""
"1569263_at"	"7.43e-13"	"6.66e-15"	"-19.548534"	"24.25581"	"-1.662363"	""	""
"1554876_a_at"	"8.55e-13"	"7.77e-15"	"-19.397142"	"24.09923"	"-1.388081"	"S100Z"	"S100 calcium binding protein Z"
"220001_at"	"1.08e-12"	"9.97e-15"	"-19.15375"	"23.84505"	"-1.412727"	"PADI4"	"peptidyl arginine deiminase, type IV"
"228170_at"	"1.12e-12"	"1.05e-14"	"-19.106672"	"23.79554"	"-1.840114"	"OLIG1"	"oligodendrocyte transcription factor 1"
"211445_x_at"	"1.29e-12"	"1.22e-14"	"-18.959325"	"23.63981"	"-1.134266"	"NACAP1"	"nascent-polypeptide-associated complex alpha polypeptide pseudogene 1"
"1555311_at"	"1.33e-12"	"1.27e-14"	"-18.91869"	"23.59666"	"-1.45603"	""	""
"201643_x_at"	"1.47e-12"	"1.43e-14"	"18.808994"	"23.47974"	"1.867155"	"KDM3B"	"lysine (K)-specific demethylase 3B"
"216449_x_at"	"1.51e-12"	"1.48e-14"	"18.773094"	"23.44134"	"3.178009"	"HSP90B1"	"heat shock protein 90kDa beta (Grp94), member 1"
"218680_x_at"	"1.51e-12"	"1.50e-14"	"18.763896"	"23.43149"	"2.262739"	"HYPK///SERF2"	"huntingtin interacting protein K///small EDRK-rich factor 2"
"225954_s_at"	"1.65e-12"	"1.67e-14"	"18.662853"	"23.32298"	"2.405388"	"MIDN"	"midnolin"
"203102_s_at"	"1.65e-12"	"1.68e-14"	"18.658192"	"23.31796"	"2.476697"	"MGAT2"	"mannosyl (alpha-1,6-)-glycoprotein beta-1,2-N-acetylglucosaminyltransferase"
"1569345_at"	"1.69e-12"	"1.74e-14"	"18.624203"	"23.28133"	"1.236884"	""	""
"214001_x_at"	"1.71e-12"	"1.78e-14"	"18.598496"	"23.25358"	"2.570012"	""	""
"231812_x_at"	"1.72e-12"	"1.81e-14"	"18.583236"	"23.2371"	"1.678685"	"PHAX"	"phosphorylated adaptor for RNA export"
"232075_at"	"1.93e-12"	"2.06e-14"	"-18.462717"	"23.10643"	"-2.150701"	"WDR61"	"WD repeat domain 61"
"200669_s_at"	"1.96e-12"	"2.12e-14"	"18.438729"	"23.08033"	"1.891968"	"UBE2D3"	"ubiquitin-conjugating enzyme E2D 3"
"236995_x_at"	"2.04e-12"	"2.23e-14"	"-18.389604"	"23.02677"	"-1.879369"	"TFEC"	"transcription factor EC"
"218008_at"	"2.24e-12"	"2.48e-14"	"18.291537"	"22.91946"	"2.445428"	"TMEM248"	"transmembrane protein 248"
"217140_s_at"	"2.30e-12"	"2.56e-14"	"18.260017"	"22.88485"	"3.983721"	"VDAC1"	"voltage-dependent anion channel 1"
"210183_x_at"	"2.46e-12"	"2.79e-14"	"18.183339"	"22.80044"	"1.79105"	"PNN"	"pinin, desmosome associated protein"
"216954_x_at"	"2.46e-12"	"2.80e-14"	"-18.177967"	"22.79451"	"-1.090193"	"ATP5O"	"ATP synthase, H+ transporting, mitochondrial F1 complex, O subunit"
"207688_s_at"	"2.53e-12"	"2.92e-14"	"18.141153"	"22.75385"	"2.492309"	"INHBC"	"inhibin, beta C"
"218020_s_at"	"2.63e-12"	"3.06e-14"	"18.095669"	"22.70351"	"1.772689"	"ZFAND3"	"zinc finger, AN1-type domain 3"
"217756_x_at"	"3.12e-12"	"3.67e-14"	"17.930201"	"22.51939"	"1.914366"	"SERF2"	"small EDRK-rich factor 2"
"214150_x_at"	"3.42e-12"	"4.07e-14"	"-17.835551"	"22.41336"	"-1.177963"	"ATP6V0E1"	"ATPase, H+ transporting, lysosomal 9kDa, V0 subunit e1"
"208750_s_at"	"3.48e-12"	"4.18e-14"	"17.812279"	"22.38721"	"2.649599"	"ARF1"	"ADP-ribosylation factor 1"
"201749_at"	"3.59e-12"	"4.42e-14"	"17.761415"	"22.32994"	"1.917794"	"ECE1"	"endothelin converting enzyme 1"

Solution

Read the data into R. Plot log(P) against log(FC). Define some regular expressions that identify keywords in the gene title: things like "X-ase", "Y factor", "Z gene" etc. Apply these to the gene titles using regexpr() and store the results by applying regmatches() to the text. Then use text() to plot the extracted strings.

Appendix I: Metacharacters and their meaning

Expression	Meaning
`\`	Escape character
`\|`	Alternation character. Matches either one of specified alternatives. For example, /Asp\|Glu/i matches ASP, Asp, asp, GLU, Glu or glu.
`^`	If the caret occurs at the beginning of an expression it anchors the expression at the beginning of a line or input. For example, /^AT/ does not match the 'AT' in "HETATM" but does match it in "ATOM". If the caret occurs as the first character of a character set as in [^a-z] it specifies the complement of the character set. Everywhere else, it simply matches the character "^".
`$`	Matches end of input or line. For example, /t$/ does not match the 't' in "eater", but does match it in "eat"
`*`	Matches the preceding character 0 or more times. For example, /bo*/ matches 'boooo' in "A ghost booooed" and 'b' in "A bird warbled", but nothing in "A goat grunted".
`+`	Matches the preceding character 1 or more times. Equivalent to {1,} . For example, /a+/ matches the 'a' in "candy" and all the a's in "caaaaaaandy."
`?`	Matches the preceding character 0 or 1 time. For example, /e?le?/ matches the 'el' in "angel" and the 'le' in "angle."
`.`	(The decimal point) matches any single character except the newline character.
`(x)`	Matches 'x' and remembers the match. For example, /(foo) bar/ matches "foo bar" and stores 'foo' in the special variable $1. /(more) (joy)/ matches "more joy", then stores 'more' in $1 and 'joy' in $2.
`{n}`	Where n is a positive integer. Matches exactly n occurrences of the preceding character. For example, /a{2}/ doesn't match the 'a' in "candy," but it matches all of the a's in "caandy," and the first two a's in "caaandy."
`{n,}`	Where n is a positive integer. Matches at least n occurrences of the preceding character. For example, /a{2,}/ doesn't match the 'a' in "candy", but matches all of the a's in "caandy" and in "caaaaaaandy."
`{n,m}`	Where n and m are positive integers. Matches at least n and at most m occurrences of the preceding character. For example, /a{1,3}/ matches nothing in "cndy", the 'a' in "candy," the first two a's in "caandy," and the first three a's in "caaaaaaandy" Notice that when matching "caaaaaaandy", the match is "aaa", even though the original string had more a's in it.
`[xyz]`	A character set. Matches any one of the enclosed characters. You can specify a range of characters by using a hyphen. For example, [bcd] is the same as [b-d] . They match the 'c' in "cysteine" and the 'd' in "ached" .
`[^xyz]`	A negated or complemented character set. That is, it matches anything that is not enclosed in the brackets. Note that the caret has to be the first character in the bracket set. For example, [^abc] is the same as [^a-c] . They initially match 'l' in "alanine" and 'y' in "cysteine"

Appendix II: Character classes and their meaning

Expression	Meaning
`[\b]`	Matches a backspace. (Not to be confused with \b .)
`\b`	Matches a word boundary, such as a space. (Not to be confused with [\b].) The difference is that one is within a character set and the other is not! For example, /\bn\w/ matches the 'no' in "noonday"; /\wy\b/ matches the 'ly' in "possibly yesterday."
`\B`	Matches a non-word boundary. For example, /\w\Bn/ matches 'on' in "noonday", and /y\B\w/ matches 'ye' in "possibly yesterday."
`\cX`	Where X is a control character. Matches a control character in a string. For example, /\cM/ matches control-M in a string.
`\d`	Matches a digit character. Equivalent to [0-9] . For example, /\d/ or /[0-9]/ matches '2' in "B2 is the suite number."
`\D`	Matches any non-digit character. Equivalent to [^0-9] . For example, /\D/ or /[^0-9]/ matches 'B' in "B2 is the suite number."
`\f`	Matches a form-feed.
`\n`	Matches a linefeed.
`\r`	Matches a carriage return.
`\s`	Matches a single white space character, including space, tab, form feed, line feed. Equivalent to [ \f\n\r\t\v] . For example, /\s\w*/ matches ' bar' in "foo bar."
`\S`	Matches a single character other than white space. Equivalent to [^ \f\n\r\t\v] . For example, /\S/\w* matches 'foo' in "foo bar."
`\t`	Matches a tab
`\v`	Matches a vertical tab.
`\w`	Matches any alphanumeric character including the underscore. Equivalent to [A-Za-z0-9_] . For example, /\w/ matches 'a' in "apple," '5' in "$5.28," and '3' in "3D."
`\W`	Matches any non-word character. Equivalent to [^A-Za-z0-9_] . For example, /\W/ or /[^$A-Za-z0-9_]/ matches '%' in "50%."

Appendix III: Anchor codes and their meaning

Expression	Meaning
`^`	If the caret occurs at the beginning of an expression it anchors the expression at the beginning of a line or input. For example, /^AT/ does not match the 'AT' in "HETATM" but does match it in "ATOM".
`$`	Matches end of input or line. For example, /t$/ does not match the 't' in "eater", but does match it in "eat" as well as in "eat\n".
`\b`	Matches a word boundary, such as a space. (Not to be confused with [\b].) The difference is that one is within a character set and the other is not! For example, /\bn\w/ matches the 'no' in "noonday"; /\wy\b/ matches the 'ly' in "possibly yesterday."
`\B`	Matches a non-word boundary. For example, /\w\Bn/ matches 'on' in "noonday", and /y\B\w/ matches 'ye' in "possibly yesterday."
`\A`	Matches at the start of a string. Like "^". For example, /\AAT/ matches "AT" in "ATOM " but not in "HETATM"
`\Z`	Matches at the end of a string. Like "$". For example, /\t\Z/ matches a tab at the end of the string but not anywhere else.
`(?: … )`	Group what's between the brackets, but discard match.
`(?= … )`	The preceeding pattern must be followed by this one in order to match.
`(?! … )`	The preceeding pattern must not be followed by this one in order to match.

Appendix IV: Modifiers and their meaning

Expression<	Meaning
`g`	Matches globally - i.e. matches all occurrences of pattern, one after the other, do not stop at the first one.
`i`	Match in a case-insensitive manner. For example, /[ACGT]/i matches any specific nucleotide in upper or lower case.
`x`	Ignore whitespace in the expression
`o`	Evaluate pattern only once.
`m`	Treat the whole string as multiple lines.
`s`	Treat the whole string as a single line, i.e. don't treat "\n" as line separators. For example, /(<table>.*?</table>)/s captures an entire table, including newline characters. Without the modifier nothing would match if there is even a single newline in between the tags.

A Brief First Encounter of Regular Expressions

Regular expressions are a concise description language to define patterns for pattern-matching in strings.

Truth be told, many programmers have a love-hate relationship with regular expressions. The syntax of regular expressions is very powerful and expressive, but also terse, not always intuitive, and sometimes hard to understand. I'll introduce you to a few principles here that are quite straightforward and they will probably cover 99% of the cases you will encounter.

Here is our test-case: the sequence of Mbp1, copied from the NCBI Protein database page for yeast Mbp1.

       1 msnqiysary sgvdvyefih stgsimkrkk ddwvnathil kaanfakakr trilekevlk
      61 ethekvqggf gkyqgtwvpl niakqlaekf svydqlkplf dftqtdgsas pppapkhhha
     121 skvdrkkair sastsaimet krnnkkaeen qfqsskilgn ptaaprkrgr pvgstrgsrr
     181 klgvnlqrsq sdmgfprpai pnssisttql psirstmgpq sptlgileee rhdsrqqqpq
     241 qnnsaqfkei dledglssdv epsqqlqqvf nqntgfvpqq qssliqtqqt esmatsvsss
     301 pslptspgdf adsnpfeerf pgggtspiis miprypvtsr pqtsdindkv nkylsklvdy
     361 fisnemksnk slpqvllhpp phsapyidap idpelhtafh wacsmgnlpi aealyeagts
     421 irstnsqgqt plmrsslfhn sytrrtfpri fqllhetvfd idsqsqtvih hivkrksttp
     481 savyyldvvl skikdfspqy rielllntqd kngdtalhia skngdvvffn tlvkmgaltt
     541 isnkegltan eimnqqyeqm miqngtnqhv nssntdlnih vntnnietkn dvnsmvimsp
     601 vspsdyityp sqiatnisrn ipnvvnsmkq masiyndlhe qhdneikslq ktlksisktk
     661 iqvslktlev lkesskdeng eaqtnddfei lsrlqeqntk klrkrliryk rlikqkleyr
     721 qtvllnklie detqattnnt vekdnntler lelaqeltml qlqrknklss lvkkfednak
     781 ihkyrriire gtemnieevd ssldvilqtl iannnknkga eqiitisnan sha
//

Task:
Navigate to http://regexpal.com and paste the sequence into the lower box. This site is one of a number of online regular expression testers; their immediate, visual feedback is invaluable when you are developing regular expression patterns.

Lets try some expressions:

Most characters are matched literally.: Type "a" in to the upper box and you will see all "a" characters matched. Then replace a with q.; Now type "aa" instead. Then krnnkk. Sequences of characters are also matched literally.

The pipe character | that symbolizes logical OR can be used to define that more than one character should match: i(s|m|q)n matches isn OR imn OR iqn. Note how we can group with parentheses, and try what would happen without them.

We can more conveniently specify more than one character to match if we place it in square brackets.: [lq] matches l OR q. [familyvw] matches hydrophobic amino acids.

Within square brackets, we can specify "ranges".: [1-5] matches digits from 1 to 5.

Within square brackets, we can specify characters that should NOT be matched, with the caret, ^.: [^0-9] matches everything EXCEPT digits. [^a-z] matches everything that is not a lower-case letter. That's what we need (try it).

One of the R functions that uses regular expressions is the function gsub(). It replaces characters that match a "regex" with other characters. That is useful for our purpose: we can

match all characters that are NOT a letter, and
replace them by - nothing: the empty string "".

This deletes them.

Task:

study the code in the An excursion into regular expressions section of the R script

Notes

↑ Contributed by Jennifer Tsai

Self-evaluation

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.

About ...

Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2017-08-05

Version:

0.1

Version history:

0.1 First stub

This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.

[1] Contributed by Jennifer Tsai

[1]

RPR-RegEx

Contents

Abstract

This unit ...

Prerequisites

Objectives

Outcomes

Deliverables

Evaluation

Contents

Regular Expressions

Regular Expressions in Perl

Syntax

Specifying symbols

Character Sets

The complement

Specifying quantity

Specifying position (anchoring)

Operators that use regular expressions

Matching

The matching (binding) operators =~ and !~

Substitution - s

Transliteration - tr

split()

Behaviour

Returning values

Capturing matches directly

Modifiers

Greed

Regular Expressions in PHP

Regular expressions in Python

Example

Regular Expressions in R

Regular Expressions in Javascript

Regular Expressions in POSIX (Unix, the shell)

Discussion points

Exercises

Counting lines

...CA atoms only

eMail addresses

Mutiple sequence alignment

Screenscraping

Labeling

Appendix I: Metacharacters and their meaning

Appendix II: Character classes and their meaning

Appendix III: Anchor codes and their meaning

Appendix IV: Modifiers and their meaning

A Brief First Encounter of Regular Expressions

Further reading, links and resources

Notes

Self-evaluation

Navigation menu

Search