Regular Expressions

From "A B C"
Jump to navigation Jump to search

Regular Expressions


Defining and using regular expressions.



 

Regular Expressions

A Regular Expression is a specification of a pattern in text. Regular expressions have a flexible syntax that allows them to handle a range of tasks - from trivial substring matches to complex nested motifs. The syntax of regular expressions is a programming language in its own right, and is a powerful way of concisely and uniquely defining a pattern.

Regular expressions are examples of deterministic pattern matching - they either match a particular expression or not. This is in contrast to probabilistic pattern matching in which a pattern is more or less similar to an example.

  • History and Theory
  • What are they good for
  • When should they not be used; what are alternatives for these cases


 

Regular Expressions in Perl

Many programming languages support their own style of regular expressions - the one we are dicusssing here is the one that Perl uses - although most of its syntax would be the same as that of Unix or PHP regular expressions. The support of regular expressions in Perl is one of its main strengths. Regular expressions in Perl can be used

  • to match patterns in strings for use in if() or while() conditions, or to retrieve specific instances of patterns,
  • to substitute occurrences of patterns with strings,
  • to translate all occurrences of a pattern into different characters, or
  • to split strings into substrings that are delimited by the occurrence of a pattern.

Accordingly, a basic knowledge of regular expressions is needed to reade and write code, especially code that parses text.


 

Syntax

Regular expressions are formed of characters and/or numbers, enclosed in special quotation marks.

/a/

is a regular expression. The lowercase "a" is the expression, the "/" are delimiters that bound the expression. This expression specifies the single character a exactly.


Specifying symbols

The power of regular expressions lies in their flexible syntax that allows to specify character ranges, classes of characters, unspecified characters and much more. This sometimes can be confusing, because the symbols that specify ranges, options, wildcards and the like are of course themselves characters. Characters that specify information about other characters are called metacharacters, they include ".", "*", "[" and "]" and more. And the opposite is also possible: some plain characters can be turned into metacharacters to symbolize character classes.

In Perl the "\" - Perl's escape character - allows to distinguish when a character is to be taken literally and when it is to be interpreted as a metacharacter. Note that some symbols have to be escaped to be read literally, while some letters have to be escaped to be read as metacharacters.

Letters whose special meaning as a metacharacter is turned on with the escape character:

CharacterMeans
w the letter "w"
\w a "word" character, ie one of A-Z, a-z, 0-9 and "_"
s the letter "s"
\s a "space" character, i.e. one of " ", tab or newline
b the letter "b"
\b a word boundary

Metacharacters whose special meaning is turned off with the escape character:

CharacterMeans
+One or more repetitions of the preceeding expression
\+the character "+"
\the escape character
\\the character "\"
.any single character except the newline (\n)
\.a period

Note that these examples are not exhaustive. A (nearly) complete table of symbols and meanings is given in the appendix.


Character Sets

Square brackets specify when more than one specific character can match at a position.

ExpressionMeans
[acgtACGT]Any non-degenerate nucleotide

For example: /[AGR]AATT[CTY]/ matches all occurrences of an ApoI restriction site, either specified explicitly, or through the nucleotide ambiguity codes R (purines) or Y (pyrimidines).

Within character sets, hyphens can specify character ranges.

ExpressionMeans
[a-z]letters
[0-9]digits
[0-9+*\/=^-]digits and arithmetic symbols

Within character sets, some metacharacters that otherwise have special meanings do not need to be escaped. In the example above, only "/" is escaped, it would otherwise terminate the regular expression. Other characters that need to be escaped include "$", "%" and "@" since the Perl compiler would try to interpolate them as variables.


The complement

The caret character "^" denotes the complement of a character set; i.e. everything that is not that expression.

ExpressionMeans
[^9]Everything but the digit "9"
[^ACGT]Not a nucleotide code letter

Note that outside of character sets, the "^" character denotes "beginning of the string". This can be confusing.

For character classes, the class in upper case denotes the complement. This can also be confusing !

CharacterMeans
\Wnot a word character
\Snot a space character


Specifying quantity

Special characters in regular expressions control how often a pattern must be present in order to match:

ExpressionWhat it meansExample (meaning)
?match zero or one times"? (there may or may not be a quote mark)
+match one or more[A-Z]+ (there's at least one uppercase letter)
*match any number.* (there may be some characters)
{min,max}match between min and max times (assumes 0 and infinity respectively if not specified)[acgt]{20,200} (a stretch of between 20 and 200 non-ambiguous bases)

For example: /AAUAAA[ACGU]{10,30}$/ defines a polyadenylation site - a AAUAAA motif, followed by 10 to 30 of any nucleotide before the end of the RNA.


Specifying position (anchoring)

If a pattern must be matched at a particular location, special terms denote string anchors.

Anchoring TermMeaning
^Start of a line or string
$End of a line or string
\AStart of the string
\ZEnd of the string
\GLast global match end


Operators that use regular expressions

Of course specifying a regular expression does not yet do anything with it. Below are the most important Perl operators that use regular expressions. Write the small Perl program samples that are provided below and test how the operators and regular expressions work.

Matching

Matching is the default behaviour of Perl regular expressions. The matching operator is

m

and the syntax is

m/>Expression</>Modifier<
  • >Expression< is a regular expression.
  • >Modifier< is one or more characters from a list of modifiers detailed below.

Since m is the default behaviour for a regular expression in a Perl program

/>Expression</

works the same way.

There is one difference though: if the m operator is specified the default delimiter "/" can be replaced with any other character, for matching. Thus

/a/
m/a/
m:a:

are all valid regular matching operations, but

:a:

is not.

The matching (binding) operators =~ and !~

The =~ operator makes Perl apply the regular expression on the right to the variable on the left.

$test =~ /\w/;

is TRUE if the variable $test contains word-characters.

Its inverse is the !~ operator, for example

$line !~ m/^\s*#/;


is TRUE if the string contained in $line does not start "#", which may or may not be preceeded by a number of whitespaces. This would be useful to ignore comment lines.

The regular expression above is decomposed as follows:

  1. m the matching operator (optional)
  2. / the opening delimiter of the regular expression
  3. ^ the beginning of the line
  4. \s any whitespace character ...
    • ... repeated 0 or more times
    1. the hash character
  5. / the closing delimiter of the regular expression

The following example would process a file and store all lines that are not comments in an array:

#!/usr/bin/perl
use strict;
use warnings;

my @input;
my $index = 0;
while (my $line = <STDIN>) {     # while something is being read
   if ($line !~ m/^\s*#/) {      # if its not a comment ...
      $input[$index] = $line;    # ... store line in array
      $index++;                  # increment index
   }
}
print(@input,"\n");              # print whole array

exit();


Substitution - s

The substitution operator s substitutes the expression in the first part with the expression in the second part once per line. Its syntax is

s/>Expression</>Replacement</>Modifier<

>Expression< is a regular expression. >Replacement< is a specific pattern. >Modifier< is one or more characters from a list of modifiers detailed below.

Example (substitutes the first instance of ugly in a line with pretty):

$line =~ s/ugly/pretty/;

Try the folowing example:

#!/usr/bin/perl
use strict;
use warnings;

print("input>");
my $line = <STDIN>;
$line =~ s/[^0-9+*\/=^-]//g;  # substitute
print($line,"\n");

exit();


The key is the following command:

$line =~ s/[^0-9+*\/=^-]//g;

The substitution is applied to the contents of the variable $line. It is of the form

s/...//g;

which means substitute all occurrences ( g modifier !) of the pattern […] with nothing (because the replacement pattern is empty). This deletes all matching characters from the string.

The expression itself is a character set. It matches any character which is not a digit (0-9), a "+" or "*" character, a "/" character (which has to be preceded with an escape, as "\/", otherwise it would be parsed as the delimiter of the expression), or an "=", "^", or "-" character. Since it is itself a negation, only the characters specified thus are not deleted.

For example the input

aa2bb^4cc,.<>=16....

is changed into the output:

2^4=16


Transliteration - tr

The transliteration operator tr substitutes a range of characters with another range of characters.

$line =~ tr/[a-z]/[A-Z]/;

turns the contents of $line all into uppercase.


split()

Another operator that makes use of regular expressions is the split operator. You can split on a regular expression and thus remove unneeded characters from input, as in the following example:

#!/usr/bin/perl -w
use strict;
my $string = "A :colon:delimited: string: with:  random :spaces";
my ( @lines ) = split(/\s*:\s*/, $string);
# splits on colons surrounded by optional spaces
...


@lines now contains each entry in its own array element, without colons or whitespace.

In practice, when should you use matching, and when is split() more appropriate?

Use matching when you know what you want to keep
@words = $input =~ /\w+/g; # captures all blocks of characters
Use split() when you know what you want to discard
@words = split( /\s+/, $input); # splits on whitespace
                                # and discards it

Consider how punctuation marks would influence the results of these examples.

The most frequent use of the split function is for processing structured input data, such as comma- or tab delimited text:

#!/usr/bin/perl
use strict;
use warnings;
my @fields;
while (@fields = split(/\t/, <STDIN>) { #tab separated values
   # ... process fields
}
exit();

Behaviour

Returning values

It is often desirable to group terms together. This is done with various forms of parentheses. By default, grouping values with parentheses allows to capture the actual match to the special variables $1, $2, $3, etc. in the order in which the complete phrases of the groups are defined, from outermost to innermost !

Here is one example - the groupings are shown below the parentheses.

This is how it works:

( ( ) ( ( ) ) )
1-------------1
  2-2
      3-----3 
        4-4


This is how it does not work:

( ( ) ( ( ) ) )
1---1
  2-------2
      3-----3 
        4-----4
Grouping SyntaxMeaningWhere it occurs in the regex
()Group what's between the brackets and remember matchAnywhere
(?: … )Group what's between the brackets, but discard matchAnywhere
(?= … )must follow the matchEnd of a regex
(?! … )must not follow the matchEnd of a regex


In terms of saved values, also note that string parts are saved to special global variables.

VariableWhat it contains
$`Part of string before match
$&Part of string matched
$'Part of string after match

Note the following: if these are not used anywhere in your code, Perl doesn't bother to maintain them, when your program is compiled. This makes all regexes much faster. It seems sensible to avoid them for all but the quickest and dirtiest of programming work, use parentheses when you need to capture matches and never to put them in modules.


Modifiers

After the trailing / delimiter of the regular expression, an i makes the match case insensitive (e.g. /foo/i will match FOO too). An x causes Perl to ignore whitespace in the regex (e.g. /foo s?/x will match foo and foos, but not "foo s"; this is useful when an expression is long and may span several lines - just insert linebreaks, tabs or characters as needed.

For example the following is a valid regular expression in a Perl program that parses a Fasta file into header and sequence.

#!/usr/bin/perl
use strict;
use warnings;

my $fasta ='';
while (my $line .= <STDIN>) { $fasta .= $line; }

$fasta =~ /    # Begin regular expression
    (?:.*)     # discard whatever precedes next match
    \s*        # there could be whitespaces
    >(.*\n)    # match the header line and collect its contents
    \s*        # there could again be whitespaces
    ((.*\n)*)  # match everything else to the end
    /x;        # ignore whitespace in the regex

my $header = $1; 
my $sequence = $2;
$sequence =~ s/\s//g;   # remove all whitespace from sequence

print($header,"\n");
print($sequence,"\n");

exit();

Here the Perl compiler first discards the comments and the "x" modifier discards all the whitespaces inside the regular expressions.

Contrast this to the impenetrable expression you would have had to write otherwise !


$fasta =~ /(?:.*)\s*>(.*\n)\s*((.*\n)*)/;


The s modifier treats multi-line strings (with new-line characters in them) as a single line, otherwise matching ends at the first new-line (e.g. /fo\no/s will match foo split over two lines). The g modifier is useful in loops, making consecutive attempts to match, starting at the place in the string where the previous match ended (e.g. while($foo =~ /o/g){$o_count++} will give an o_count of two if $foo contains "foo" because there are two o's in "foo").

All of the modifiers can be used together. Just type them one after another after the delimiter.


Greed

By default, quantitative matches except zero/one (i.e. the ? character) are "greedy", i.e. they will match the largest possible number of characters. For example

/(\w+)(\d+)/

against "abc123" yields "abc12" and "3" for $1 and $2 respectively

This is because \w+ is greedy and grabs as many alphanumeric characters as it can before \d+ gets a chance to match. A ? after a quantity specifier makes it non-greedy, therefore

/(\w+?)(\d+)/

against "abc123" yields "abc" and"123" for $1 and $2 respectively.


Regular Expressions in PHP

<?php
$string = "The quick brown fox jumps over a lazy dog";

$words = preg_split('/\s+/', $string);
print_r($words);

preg_match('/.\W./', $string, $matches);
print_r($matches);

preg_match_all('/.\W./', $string, $matches);
print_r($matches);

#indexed preg_replace, iterates over array elements
$pat = array(); #broken
$pat[0] = '/quick brown/';
$pat[1] = '/fox/';
$pat[2] = '/lazy/';
$pat[3] = '/dog/';
$rep = array();
$rep[0] = 'lazy';
$rep[1] = 'dog';
$rep[2] = 'quick brown';
$rep[3] = 'fox';
print(preg_replace($pat, $rep, $string));
print("\n");

$pat = array();
$pat[0] = '/quick brown fox/';
$pat[1] = '/lazy dog/';
$pat[2] = '/foo/';
$pat[3] = '/bar/';
$rep = array();
$rep[0] = 'foo';
$rep[1] = 'bar';
$rep[2] = 'lazy dog';
$rep[3] = 'quick brown fox';
print(preg_replace($pat, $rep, $string));
print("\n");


?>

 

Regular Expressions in R

Default behaviour is not standard POSIX. Best to pass the perl=TRUE parameter.

# R regular expression examples

string <- "The quick brown fox jumps over a lazy dog";
grep("quick", string, perl=TRUE);
grep("quick", string, perl=TRUE, value=TRUE);

s1 <- strsplit(string, split="\\s"); #list
s1
s2 <- as.matrix(unlist(s1))
s2[2:4]

grep("[a:c]", s2, perl=TRUE);
grep("[a:c]", s2, perl=TRUE, value=TRUE);
grepl("[a:c]", s2, perl=TRUE); # logical vector

#get all word characters adjacent to "o"
regexpr("\\w{0,1}o\\w{0,1}", s2, perl=TRUE); # positions of matches
M <- regexpr("\\w{0,1}o\\w{0,1}", s2, perl=TRUE); # assign the result object
regmatches(s2, M) # use regmatches to process (vector)
regmatches(s2, M)[2]

 

Regular Expressions in POSIX (Unix, the shell)

Use in:

  • grep
  • egrep
  • find
  • sed
  • awk
  • cut


[TBC]


 

Exercises


Counting lines

Write a unix command that returns the number of atoms in a PDB file.

Sample data below ...

HEADER   TEST                                                 0TST      0TST   1
REMARK   ATOM   AND HETATM RECORDS FOR COUNTING                         0TST   2
ATOM      1  N   GLY     1      -6.253  75.745  53.559  1.00 36.34      0TST   3
ATOM      2  CA  GLY     1      -5.789  75.223  52.264  1.00 44.94      0TST   4
ATOM      3  C   GLY     1      -5.592  73.702  52.294  1.00 32.28      0TST   5
ATOM      4  O   GLY     1      -5.140  73.148  53.304  1.00 19.32      0TST   6
TER       5      GLY     1                                              0TST   7
HETATM    6  O   HOH     1      -4.169  60.050  40.145  1.00  3.00      0TST   8
HETATM    7 CA   CA      1      -1.258  -71.579  50.253  1.00  3.00      0TST   9
END                                                                     0TST  10


Hint: grep "ATOM " OR "HETATM" records at the beginning of a line, then pipe the output through wc.

the unix solution
egrep "^ATOM  |^HETATM" test.pdb | wc -l


a Perl solution
#!/usr/bin/perl
use warnings;
use strict;

my $numberOfAtoms = 0;

while (my $line = <STDIN>) {         # read in from STDIN
   
   if ($line =~ /^ATOM  |^HETATM/) { # match on "ATOM  " or
      $numberOfAtoms++;              # "HETATM" at the beginning 
   }                                 # of a line
}
print("Number of atoms in input file: ", $numberOfAtoms, "\n");

exit();

...CA atoms only

Write a unix command that returns the number of C-alpha atoms in a PDB file. Work only with regular expressions. Don't get fooled by calcium atoms!

TBC


TBC


eMail addresses

Write a program in a language of your choice that reads a file from STDIN and prints any e-mail address this file might contain !


What is a valid eMail address ... ?

The protocols that govern the Internet are maintained by the IETF (www.ietf.org). They are developed as so-called RFCs (Requests For Comment) and are an impressive example of voluntary, self-organized technical administration that works. E-mail address formats are specified in RFC2822. The short of section 3.4.1 is the following:

A valid e-mail address (this is slightly simplified from the RFC) consists of
local-part "@" domain
where "local-part" is either
1. a string containing the following characters
any Letter
any Digit
any of !#$%&'*+-/=?^_`{|}~
Or conversely any printable character except ()<>@,;:\".[]
... elements of which can be separted by a period, which must not occur as the first or last element ...


2. or any quoted string (i.e. one enclosed in double-quotes).


"@" is the character @.
"domain" is a valid organizational domain i.e. a string:
  • with at least two elements,
  • containing only letters, digits or hyphens,
  • separated by periods,
  • where the last element is a TLD (Top Level Domain) - currently these are either 2 or 3 characters long,
  • where the domain(s) preceding the TLD are not longer than 63 characters.


Something at a word boundary, followed by "@", followed by something, bounded by whitespace. Group this appropriately. Then return $1, $2, $3.


The code below implements all of the RFC2822 rules, except it does not check that the length of the subdomain does not exceed 63 characters.

#!/usr/bin/perl
use warnings;
use strict;
# Define valid character sets
my $LocalChars = 'a-zA-Z0-9!#$%&*+-/=?^_`{|}~\'';
my $DomainChars = 'a-zA-Z0-9-';

while (my $line = <STDIN>) {

   # Do a *global* match for e-mail addresses, the inner while loop repeats as long as
   # matches can be found. Omitting the modifier "g" at the end would report only the
   # first match.
   # Elements are parsed in several alternative groupings - only the outer ones are
   # stored, the others are discarded with (?: ...)
   while ($line =~ /                  # do while a match can be found
      (                            # open first grouping
         "[^"]+" |                  # quoted string, (quotes enclosing non-quotes) or ...
         \b                         # ... word boundary, followed by
         (?:[$LocalChars]+)         # at least one group of at least one character and ...
         (?:\.[$LocalChars]+)*      # ... 0 or more additional groups, separated by "."
      )
      @                            # The "@"
      (                            # open second grouping
         (?:[$DomainChars]+)        # at least one subdomain
         (?:\.[$DomainChars]+)*     # 0 or more repetitions
         (?:\.[$DomainChars]{2,3})  # Top Level Domain !
      )
      \s+                          # separated by whitespace
      /gx) {                       # do globally, ignore whitespace in expression
   print($1, "@", $2, "\n");
   }  # while - parse
}  # while - read <STDIN>

exit();

Here is some sample text to test the code[1]:

Hi,
blah blah blah hello joy joy giggle
g2g - alice@wonderland.org cheshire.d'cat@disappear.net
moose nibble on bark@lichens.com
Three valid addresses above. this.one@breaks.
to soooon "within the domain" and this.one@is.an.invalid+domain.com
Domains can h@ve.hy-phens.org but not under@scor_es.dunce.net
quoted strings can contain characters that are normally
disallowed - like this convincing sample: "Yo, :-) so kewl"@hotmail.com
invalid@.this.is , young padawan.
sh@rt.one is good but sh@rt.1 is bad
a.a.a.a.a.a.b.c@com.tw works, as does
user@mailbox.department.faculty.university.ac.uk but
a.@a@b@blah.tv is not valid RFC2822, please pick out the valid part
too.looooong@top.level.domain
oK@top.level.dom.ain
thats@it.end

And this is the output the program produces on the sample text:

alice@wonderland.org
cheshire.d'cat@disappear.net
bark@lichens.com
h@ve.hy-phens.org
"Yo, :-) so kewl"@hotmail.com
sh@rt.one
a.a.a.a.a.a.b.c@com.tw
user@mailbox.department.faculty.university.ac.uk
b@blah.tv
oK@top.level.dom.ain
thats@it.end




Mutiple sequence alignment

Write a program in a language of your choice that parses a CLUSTAL or MSF formatted multiple sequence alignment into a two-dimensional array where rows are sequences and columns are aligned positions.

Sample input data below ...

CLUSTAL formatted alignment
CLUSTAL multiple sequence alignment by MUSCLE (3.8)


SOK2_SACCE      --NGISVVRRADNDMVNGTKLLN-----VTKMTRGRRDGILKAEKIR----------HVV
PHD1_SACCE      --NGISVVRRADNNMINGTKLLN-----VTKMTRGRRDGILRSEKVR----------EVV
KILA_ESCCO      -IDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQSF
MBP1_SACCE      IHSTGSIMKRKKDDWVNATHILK-----AANFAKAKRTRILEKEVLKETH-------EKV
SWI4_SACCE      ---TKIVMRRTKDDWINITQVFK-----IAQFSKTKRTKILEKESNDMQH-------EKV
                      :  * .:. :* * : .      :. :. .    :  *               .

SOK2_SACCE      KIGSMHLKGVWIPFERALAIAQREKI-
PHD1_SACCE      KIGSMHLKGVWIPFERAYILAQREQI-
KILA_ESCCO      KGGRPENQGTWVHPDIAINLAQ-----
MBP1_SACCE      QGGFGKYQGTWVPLNIAKQLAEKFSVY
SWI4_SACCE      QGGYGRFQGTWIPLDSAKFLVNKYEI-



MSF formatted alignment
PileUp

  MSF: 87  Type: A  Check: 0000  ..

 Name: SOK2_SACCE  Len: 87  Check:  9836  Weight: 0.160458
 Name: PHD1_SACCE  Len: 87  Check:  2117  Weight: 0.160458
 Name: KILA_ESCCO  Len: 87  Check:  6044  Weight: 0.256296
 Name: MBP1_SACCE  Len: 87  Check:  4979  Weight: 0.211395
 Name: SWI4_SACCE  Len: 87  Check:  5197  Weight: 0.211395

//

SOK2_SACCE    ..NGISVVRR ADNDMVNGTK LLN.....VT KMTRGRRDGI LKAEKIR...
PHD1_SACCE    ..NGISVVRR ADNNMINGTK LLN.....VT KMTRGRRDGI LRSEKVR...
KILA_ESCCO    .IDGEIIHLR AKDGYINATS MCRTAGKLLS DYTRLKTTQE FFDELSRDMG
MBP1_SACCE    IHSTGSIMKR KKDDWVNATH ILK.....AA NFAKAKRTRI LEKEVLKETH
SWI4_SACCE    ...TKIVMRR TKDDWINITQ VFK.....IA QFSKTKRTKI LEKESNDMQH

SOK2_SACCE    .......HVV KIGSMHLKGV WIPFERALAI AQREKI.
PHD1_SACCE    .......EVV KIGSMHLKGV WIPFERAYIL AQREQI.
KILA_ESCCO    IPISELIQSF KGGRPENQGT WVHPDIAINL AQ.....
MBP1_SACCE    .......EKV QGGFGKYQGT WVPLNIAKQL AEKFSVY
SWI4_SACCE    .......EKV QGGYGRFQGT WIPLDSAKFL VNKYEI.



TBC


TBC


Screenscraping

Write a PHP program that screenscrapes resolution data for a protein structure file from the PDB.

TBC


TBC



Labeling

Write an R script that creates meaningful labels for data elements from metadata and shows them in a plot.

TBC


TBC







 

Appendix I: Metacharacters and their meaning

ExpressionMeaning
\Escape character
|Alternation character. Matches either one of specified alternatives. For example, /Asp|Glu/i matches ASP, Asp, asp, GLU, Glu or glu.
^If the caret occurs at the beginning of an expression it anchors the expression at the beginning of a line or input.
For example, /^AT/ does not match the 'AT' in "HETATM" but does match it in "ATOM". If the caret occurs as the first character of a character set as in [^a-z] it specifies the complement of the character set. Everywhere else, it simply matches the character "^".
$Matches end of input or line.
For example, /t$/ does not match the 't' in "eater", but does match it in "eat"
*Matches the preceding character 0 or more times. For example, /bo*/ matches 'boooo' in "A ghost booooed" and 'b' in "A bird warbled", but nothing in "A goat grunted".
+Matches the preceding character 1 or more times. Equivalent to {1,} . For example, /a+/ matches the 'a' in "candy" and all the a's in "caaaaaaandy."
?Matches the preceding character 0 or 1 time. For example, /e?le?/ matches the 'el' in "angel" and the 'le' in "angle."
.(The decimal point) matches any single character except the newline character.
(x)Matches 'x' and remembers the match. For example, /(foo) bar/ matches "foo bar" and stores 'foo' in the special variable $1. /(more) (joy)/ matches "more joy", then stores 'more' in $1 and 'joy' in $2.
{n}Where n is a positive integer. Matches exactly n occurrences of the preceding character. For example, /a{2}/ doesn't match the 'a' in "candy," but it matches all of the a's in "caandy," and the first two a's in "caaandy."
{n,}Where n is a positive integer. Matches at least n occurrences of the preceding character. For example, /a{2,}/ doesn't match the 'a' in "candy", but matches all of the a's in "caandy" and in "caaaaaaandy."
{n,m}Where n and m are positive integers. Matches at least n and at most m occurrences of the preceding character. For example, /a{1,3}/ matches nothing in "cndy", the 'a' in "candy," the first two a's in "caandy," and the first three a's in "caaaaaaandy" Notice that when matching "caaaaaaandy", the match is "aaa", even though the original string had more a's in it.
[xyz]A character set. Matches any one of the enclosed characters. You can specify a range of characters by using a hyphen. For example, [bcd] is the same as [b-d] . They match the 'c' in "cysteine" and the 'd' in "ached" .
[^xyz]A negated or complemented character set. That is, it matches anything that is not enclosed in the brackets. Note that the caret has to be the first character in the bracket set. For example, [^abc] is the same as [^a-c] . They initially match 'l' in "alanine" and 'y' in "cysteine"


Appendix II: Character classes and their meaning

ExpressionMeaning
[\b]Matches a backspace. (Not to be confused with \b .)
\bMatches a word boundary, such as a space. (Not to be confused with [\b].) The difference is that one is within a character set and the other is not! For example, /\bn\w/ matches the 'no' in "noonday"; /\wy\b/ matches the 'ly' in "possibly yesterday."
\BMatches a non-word boundary. For example, /\w\Bn/ matches 'on' in "noonday", and /y\B\w/ matches 'ye' in "possibly yesterday."
\cXWhere X is a control character. Matches a control character in a string. For example, /\cM/ matches control-M in a string.
\dMatches a digit character. Equivalent to [0-9] . For example, /\d/ or /[0-9]/ matches '2' in "B2 is the suite number."
\DMatches any non-digit character. Equivalent to [^0-9] . For example, /\D/ or /[^0-9]/ matches 'B' in "B2 is the suite number."
\fMatches a form-feed.
\nMatches a linefeed.
\rMatches a carriage return.
\sMatches a single white space character, including space, tab, form feed, line feed. Equivalent to [ \f\n\r\t\v] . For example, /\s\w*/ matches ' bar' in "foo bar."
\SMatches a single character other than white space. Equivalent to [^ \f\n\r\t\v] . For example, /\S/\w* matches 'foo' in "foo bar."
\tMatches a tab
\vMatches a vertical tab.
\wMatches any alphanumeric character including the underscore. Equivalent to [A-Za-z0-9_] . For example, /\w/ matches 'a' in "apple," '5' in "$5.28," and '3' in "3D."
\WMatches any non-word character. Equivalent to [^A-Za-z0-9_] . For example, /\W/ or /[^$A-Za-z0-9_]/ matches '%' in "50%."


Appendix III: Anchor codes and their meaning

ExpressionMeaning
^If the caret occurs at the beginning of an expression it anchors the expression at the beginning of a line or input. For example, /^AT/ does not match the 'AT' in "HETATM" but does match it in "ATOM".
$Matches end of input or line. For example, /t$/ does not match the 't' in "eater", but does match it in "eat" as well as in "eat\n".
\bMatches a word boundary, such as a space. (Not to be confused with [\b].) The difference is that one is within a character set and the other is not! For example, /\bn\w/ matches the 'no' in "noonday"; /\wy\b/ matches the 'ly' in "possibly yesterday."
\BMatches a non-word boundary. For example, /\w\Bn/ matches 'on' in "noonday", and /y\B\w/ matches 'ye' in "possibly yesterday."
\AMatches at the start of a string. Like "^". For example, /\AAT/ matches "AT" in "ATOM " but not in "HETATM"
\ZMatches at the end of a string. Like "$". For example, /\t\Z/ matches a tab at the end of the string but not anywhere else.
(?: … )Group what's between the brackets, but discard match.
(?= … )The preceeding pattern must be followed by this one in order to match.
(?! … )The preceeding pattern must not be followed by this one in order to match.


Appendix IV: Modifiers and their meaning

Expression<Meaning
gMatches globally - i.e. matches all occurrences of pattern, one after the other, do not stop at the first one.
iMatch in a case-insensitive manner. For example, /[ACGT]/i matches any specific nucleotide in upper or lower case.
xIgnore whitespace in the expression
oEvaluate pattern only once.
mTreat the whole string as multiple lines.
sTreat the whole string as a single line, i.e. don't treat "\n" as line separators. For example, /(<table>.*</table>)/s captures everything between the two tags, including newline characters. Without the modifier nothing would match if there is even a single newline in betweeen the tags.


 

Notes

  1. Contributed by Jennifer Tsai


 

Further reading and resources