Expected Preparations:
|
|||||||
|
|||||||
Keywords: Regular expressions | |||||||
|
|||||||
Objectives:
This unit will …
|
Outcomes:
After working through this unit you …
|
||||||
|
|||||||
Deliverables: Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit. Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these. Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page. |
|||||||
|
|||||||
Evaluation: NA: This unit is not evaluated for course marks. |
Using regular expressions is a key skill for all apsects of data science, and, by extension, for bioinformatics. There is a bit of a learning curve involved, but once you are even slightly comfortable with their syntax and use in code, you will appreciate their power in automating otherwise unbearably tedious tasks. This unit introduces the principles, the syntax, and provides a fair amount of practice.
A Regular Expression(W) is a specification of a pattern of characters. The typical use of a regular expression is to find occurrences of the pattern in a string. Regular expressions have a flexible syntax that allows them to handle a range of tasks - from trivial substring matches to complex nested motifs. The syntax of regular expressions is a programming language in its own right, and is a powerful way of concisely and uniquely defining a pattern.
Regular expressions are examples of deterministic pattern matching - they either match a particular expression or not. This is in contrast to probabilistic pattern matching in which a pattern is more or less similar to a query.
Truth be told, many programmers have a love-hate relationship with regular expressions. The syntax of regular expressions is very powerful and expressive, but also terse, not always intuitive, and sometimes hard to understand. Let’s try a few simple things:
Here is string to play with: the sequence of Mbp1, copied from the NCBI Protein database page for yeast Mbp1.
1 msnqiysary sgvdvyefih stgsimkrkk ddwvnathil kaanfakakr trilekevlk
61 ethekvqggf gkyqgtwvpl niakqlaekf svydqlkplf dftqtdgsas pppapkhhha
121 skvdrkkair sastsaimet krnnkkaeen qfqsskilgn ptaaprkrgr pvgstrgsrr
181 klgvnlqrsq sdmgfprpai pnssisttql psirstmgpq sptlgileee rhdsrqqqpq
241 qnnsaqfkei dledglssdv epsqqlqqvf nqntgfvpqq qssliqtqqt esmatsvsss
301 pslptspgdf adsnpfeerf pgggtspiis miprypvtsr pqtsdindkv nkylsklvdy
361 fisnemksnk slpqvllhpp phsapyidap idpelhtafh wacsmgnlpi aealyeagts
421 irstnsqgqt plmrsslfhn sytrrtfpri fqllhetvfd idsqsqtvih hivkrksttp
481 savyyldvvl skikdfspqy rielllntqd kngdtalhia skngdvvffn tlvkmgaltt
541 isnkegltan eimnqqyeqm miqngtnqhv nssntdlnih vntnnietkn dvnsmvimsp
601 vspsdyityp sqiatnisrn ipnvvnsmkq masiyndlhe qhdneikslq ktlksisktk
661 iqvslktlev lkesskdeng eaqtnddfei lsrlqeqntk klrkrliryk rlikqkleyr
721 qtvllnklie detqattnnt vekdnntler lelaqeltml qlqrknklss lvkkfednak
781 ihkyrriire gtemnieevd ssldvilqtl iannnknkga eqiitisnan sha
//
Task…
Navigate to http://regexpal.com and paste the sequence into the lower box. This site is one of a number of online regular expression testers; their immediate, visual feedback is invaluable when you are developing regular expression patterns.
Lets try some expressions:
a
” in to the upper box and you will
see all “a
” characters matched. Then replace a
with q
.
aa
” instead. Then krnnkk
.
Sequences of characters are also matched literally.
|
that symbolizes logical OR can be
used to define that more than one character should match:i(s|m|q)n
matches isn
OR imn
OR
iqn
. Note how we can group with parentheses, and try what
would happen without them.
[lq]
matches l
OR q
.
[milcwyf]
matches hydrophobic amino acids.
[1-5]
matches digits from 1 to 5.
^
.[^0-9]
matches everything EXCEPT digits.
[^a-z]
matches everything that is not a lower-case letter.
That’s what we would need to remove characters that do not represent
amino acids. Note that outside of the square brackets
the caret means “beginning of the string”. When yopu see a caret, you
need to consider its context carefully.
Make frequent use of this site to develop your regular expressions step by step.
According to the Chomsky hierarchy(W) regular expressions are a Type-3 (regular) grammar(W), thus their use forms a regular language(W). Therefore, like all Type-3
grammatical expressions they can be decided by a finite-state machine(W), i.e. a “machine” that is
defined by possible states, plus triggering conditions that control
transitions between states. Think of such automata as (elaborate)
if … else
constructs. The “regex” processor translates the
search pattern into such an automaton, which is then applied to the
search domain - the string in which the occurrence of the pattern is to
be sought.
Regular expressions support virtually all pattern matching tasks in data clean-up, extracting information items, data mining, “screen scraping”, parsing of files, subsetting large tables, etc. etc. This means, they must be part of your everyday toolkit.
Since regular expressions are Type-3 grammars, they must fail when trying to parse more complex grammars - i.e. gramars that can’t be expressed in a regular language. This means, you can’t reliably parse XML - and in perticular HTML - with regular expressions. There is a long discussion on this particular topic however, e.g. see here, and many other similar threads on stackoverflow, and see here for a discussion of when regular expressions should not be used. Use a real XML parser instead.
Two dialects of regular expressions exist, they differ in some
details of syntax. One is the nearly universal “Perl” dialect (Perl(W) is a programming language), the
other one is the “POSIX” standard that nearly no one uses. Except R.
Tragically, in R the POSIX standard is the default. Fortunately this
often does not make a difference, and we can explicitly turn this
nonsense off. But we need to type perl = TRUE
much more
often than we would like. Somebody, some time, made a wrong design
decision and thousands of wasted man- and woman hours later we are still
stuck with the consequences. If you use regular expressions according to
the POSIX standard, you have to learn the Perl standard anyway. But then
you can just use the Perl standard in the first place. The Wikipedia page on Regular Expressions(W) has a table with a side-by-side
comparison of the different ways the two standards express character
classes. Also see the help page on regex
in R for
details.
Regular expressions in R can be used
if()
or
while()
conditions, or to retrieve specific instances of
patterns with the regexpr()
family of functions;gsub()
;strsplit()
;…and more.
Accordingly, a basic knowledge of regular expressions is needed to reade and write code, especially code that parses text.
Regular expressions in R are strings, thus they are enclosed in quotation marks.
"a"
is a regular expression. It specifies the single, literal character
a
exactly.
The power of regular expressions lies in their flexible syntax that
allows to specify character ranges, classes of characters, unspecified
characters, alternatives, and much more. This sometimes can be
confusing, because the symbols that specify ranges, options, wildcards
and the like are of course themselves characters. Characters that
specify information about other characters are called
metacharacters, these include “.
”,
“?
”, “+
”, “*
“,”[
”
and “]
“,”{
” and “}
” and more. And
the opposite is also possible: some plain characters can be turned into
metacharacters to denote character classes.
The “\
” - escape character - allows to
distinguish when a character is to be taken literally and when it is to
be interpreted as a metacharacter. Note that some symbols have to be
escaped to be read literally, while some letters have to be escaped to
be read as metacharacters.
But there is a catch in R, relating to when the
escape characater is interpreted. Remember that “\n
” is a
linebreak in a string, “\t
” is a tab, etc. Obviously if you
write “\?
” (a literal questionmark in a regex), or
“\+
” (a literal plus-sign in a regex) into a regular
string, the mechanism that parses the string is going to see the escape
character, then it expects an “n” or a “t” or the like - but what it
gets instead is something it doesn’t know. So it throws an error.
Try:
"\n" # fine
"\?" # Error: ...
But then how can we write something like “\?
” when we
need it? That becomes obvious when you consider what happens with the
string: it gets sent to the regex engine for interpretation. Thus the
regex engine needs to see: character “\”, then character “?”. So it
needs two characters. The secret is: we need to prevent “\” from
attaching to the next character, and specify it as a single character in
its own right. We do that by escaping “\” itself -
with a backslash. Thus “\\
” is a literal
“\” character - and can get sent to the regex engine.
"\\?" # ok
cat("\\?") # that's what the regex engine sees.
Consequence is: you need to double the “\” to “\\”in R when you want
a single “\”. That works differently from other programming languages
that pass patterns to the regex engine as-is. You need to be aware of
this, for example when you develop a pattern in an online regex tool,
and then copy it back into your R code. You need to double all
occurrences of “\
” in your R string.
Letters whose special meaning as a metacharacter is turned on with the escape character:
Character | Means |
---|---|
w
|
the letter “w” |
\w
|
a “word” character, ie one of A-Z, a-z, 0-9 and “_“ |
s
|
the letter “s” |
\s
|
a “space” character, i.e. one of ” “, tab or newline |
b
|
the letter “b” |
\b
|
a word boundary |
Metacharacters whose special meaning is turned off with the escape character:
Character | Means |
---|---|
+
|
One or more repetitions of the preceeding expression |
\+
|
the literal character “+” |
\\
|
the escape character |
\
|
the literal character “\” |
.
|
any single character except the newline (\n) |
\.
|
a literal period |
Note that these examples are not exhaustive. A (nearly) complete table of symbols and meanings is given in the appendix.
Square brackets specify when more than one specific character can match at a position.
Expression | Means |
---|---|
[acgtACGT]
|
Any non-degenerate nucleotide |
For example: “[AGR]AATT[CTY]”
matches all occurrences of
an ApoI restriction site, either specified explicitly, or through the
nucleotide ambiguity codes R (purines) or Y (pyrimidines).
Within character sets, hyphens can specify character ranges.
Expression | Means |
---|---|
[a-z]
|
lowercase letters |
[0-9]
|
digits |
[0-9+*/=^\-]
|
digits and arithmetic symbols (Note the escaped hyphen) |
If you want to match a literal hyphen, you must escape it. Within character sets, some metacharacters that otherwise have special meanings usually do not need to be escaped.
The caret character “^” denotes the complement of a character set; i.e. everything that is not that expression.
Expression | Means |
---|---|
[^9]
|
Everything but the digit “9” |
[^ACGT]
|
Not a nucleotide code letter |
Note that outside of square brackets, the “^” character is an “anchoring code” and means “beginning of the string”. And inside a pattern it can of course be the literal carat: “^”. This can be confusing.
For many metacharacters that denote character classes, the metacharacter in upper case denotes the complement. This can also be confusing !
Character | Means |
---|---|
\w
|
a word character |
\W
|
not a word character |
\s
|
a space character |
\S
|
not a space character |
Special characters in regular expressions control how often a pattern must be present in order to match:
Expression | What it means | Example (meaning) |
---|---|---|
?
|
match zero or one times |
“? (there may or may not be a quote mark)
|
+
|
match one or more |
[A-Z]+ (there’s at least one uppercase letter)
|
*
|
match any number |
.* (there may be some characters)
|
{min,max}
|
match between min and max times (assumes 0 for min, if min is omitted; assumes infinity for max, if max is omitted). |
[atAT]{20,200} (a stretch of between 20 and 200 upper- or
lowercase As or Ts)
|
For example: “AAUAAA[ACGU]{10,30}$”
defines a
polyadenylation site - a AAUAAA motif, followed by 10 to 30 of any
nucleotide before the end of the RNA.
If a pattern must be matched at a particular location in the string, special terms denote string anchors.
Anchoring Term | Meaning |
---|---|
^
|
Start of a line or string |
$
|
End of a line or string |
\A
|
Start of the string |
\Z
|
End of the string |
\G
|
Last global match end |
Of course defining a regular expression pattern does not yet do anything with it. Below are the most important R functions that use regular expressions. Write the small code samples that are provided below, play with variations, and test how the operators and regular expressions work.
Not all pattern searches in strings use (and need) regular
expressions. Sometimes simple, exact string-matching is enough. R uses
string matching in character equality (==
) and by
extension, the set operation functions (union(),
intersect()
etc.), the match()
function, and the
%in%
operator.
vA <- c("the", "quick", "brown", "fox", "jumps", "over", "a", "lazy", "dog")
vA[2] == "quick" # TRUE
vA[2] == "quack" # FALSE
vA == "fox" # boolean vector
# match tests for string equality
match("fox", vA) # 4, i.e. the 4th element matches the string
match("o", vA) # NA: matches have to be to the WHOLE element
# match("fox", vA) is equivalent to...
which(vA == "fox")
# %in% can be used for creating intersections
# find whether elements from one vector are
# contained in another:
vB <- c("Quacking", "the", "duck", "wings", "over", "my", "cozy", "cot")
vA %in% vB
vB %in% vA # note that the length of the return vector is the same as the
# length of the first argument. So read this as:
# "Which of my vB are also in vA"
# We can use this to subset the vector with elements that are present in
# both:
vB[vB %in% vA]
# which is, of course, the intersection set operation.
intersect(vA, vB)
The general online help page is here.
Remember: R’s default behaviour is extended POSIX. To be sure which
regex dialect is used, pass the perl = TRUE
parameter.
# grep() is like match(), but uses regular expressions. A variant of grep() that
# returns a boolean vector - like "==" does - is grepl(). That is useful
# because we can & or | the vector, or invert it with ! .
grep("fox", vA)
grep("o", vA) # Aha! now we get all elements that contain an "o" -
# Because we get partial matches with regular expressions.
vA[grep("o", vA)] # subset
grepl("o", vA) # logical
! grepl("o", vA) # its inverse
vA[! grepl("o", vA)] # subset all words without "o"
Consider the following regular expression:
patt <- "^\\s*#"
This matches if the string it is applied to does not begin with a “#”, which may or may not be preceeded by whitespaces. This would be useful to ignore comment lines in a data file.
The regular expression above is decomposed as follows:
^
the beginning of the line\s
any whitespace character …*
… repeated 0 or more times#
the hash characterThe following example would read a file into an vector of lines, then drop all lines that are empty, and all lines that are comments. This is a straightforward idiom.
IN <- "test.txt"
patt <- "^\\s*#"
myData <- readLines(IN)
myData <- myData[myData != ""] # drop all elements that are the empty string
myData <- myData[! grepl(patt, myData)] # drop all elements match the pattern
Think of “gsub”” as “global substitution”, and you’ll understand that
there exists another function, sub()
that replaces only the
first occurrence of a pattern, rather than all of them as
gsub()
does. I can’t imagine what the use case for that
might be and I don’t think I have ever used sub()
. I get an
intuitive sense that code that needs such a function should probably be
reconceived. But gsub()
is very useful.
(s <- " 1 MKLAACFLTL LPGFAVA... 17 ") # E-coli Alpha Amylase signal peptide
# Drop everything from this string that is not an amino acid one-letter code.
# We use gsub() to first identify all non-amino acid letters with a character
# class regular expression, then we replace each occurrence with the empty
# string.
gsub("[^ACDEFGHIKLMNPQRSTVWY]", "", s)
# or, with assignment: ...
s <- gsub("[^ACDEFGHIKLMNPQRSTVWY]", "", s)
Another function that makes use of regular expressions is
strsplit()
. It takes a vector of strings, and returns a
list, one element for each element of the vector, in which each string
has been split up by separating it along a regular expression.
x <- c("a b c", "1 2")
strsplit(x, " ")
# [[1]]
# [1] "a" "b" "c"
#
# [[2]]
# [1] "1" "2"
Since even a single string returns a list, you often have to extract the element you want as a vector for further use.
corvidae <- c("crow:jackdaw:jay:magpie:raven:rook")
strsplit(corvidae, ":")
unlist(strsplit(corvidae, ":"))
strsplit(corvidae, ":")[[1]]
# Consider:
length(strsplit(corvidae, ":"))
length(unlist(strsplit(corvidae, ":")))
strsplit()
is immensely useful to extract elements from
strings with a relatively well defined structure.
s <- "1, 1, 2, 3, 5, 8"
strsplit(s, ", ")[[1]] # split on comma-space
s <- "~`!@#$%^&*()_-=+[{]}\|;:',<.>/?"
strsplit(s, "")[[1]] # split on empty string
s <- "chronological lifespan:\tincreased\ncold sensitivity:\tincreased\nsporulation:\tnormal"
strsplit(s, "\\t|\\n")[[1]] # split on tab or newline
Matches can be captured and used, e.g. in gsub()
.
# Capture matches by placing them in parentheses. To immediatley reuse them, refer to them with "backreferences": <code>\\1</code>, <code>\\2</code>, <code>\\3</code>.
# Example 1:
# The beginning and ending three words of some text...
s <- "I know, however, that its precarious and remote villages lie within the lowlands of the Wisla River."
gsub("^((\\S+\\s+){3}).*((\\s\\S+){3})$", "\\1 ... \\3", s)
# Note: matches \\2 and \\4 are the inside the parentheses that are there to
# group things to be found {3}-times.
# Example 2:
# A binomial species name has a genus, a species, and possibly a strain name.
# We use \\S (not whitespace) and \\s (whitespace) to tease this apart into
# three captured expressions:
s <- "Saccharomyces cerevisiae S288C"
gsub("^(\\S+)\\s(\\S+)\\s*(.*)$",
"genus: \\1; species: \\1 \\2; (strain: \\3)",
s)
gsub
Finding and returning matches in R is a two-step
process. (1) find matches with regexpr()
(one match),
gregexpr()
(all matches), or regexec()
(sub-expressions in parentheses). All of these return a “match object”.
(2) use the match object to extract the matching substrings from the
original string.
# Extracting gene names in text.
# Let's define a valid gene name to be a substring that is bounded by
# word-boundaries, starts with an upper-case character, contains more upper-case
# characters or numbers or a hyphen or underscore, with a minimal length of 3.
# Here is a regex, and we put the part of the string that we want to recover, in
# parentheses:
patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b"
# Test: positives
grepl(patt, "MBP1")
grepl(patt, "AAT")
grepl(patt, " AI1")
grepl(patt, "ASP3-1 ")
grepl(patt, " AI5_ALPHA; ")
grepl(patt, " (TY1B-PR3) ")
# Test: negatives
grepl(patt, "G1") # Too short
grepl(patt, "G1-") # Hyphen at end
grepl(patt, "Cell") # contains lower-case
# Let's apply this to retrieve gene names in text
s <- "Transcriptional activation of CLN1, CLN2, and a putative new G1 cyclin (HCS26) by SWI4, a positive regulator of G1-specific transcription. Cell 66(5):1015-26"
(m <- regexpr(patt, s)) # found a match in position 31
regmatches(s, m) # retrieve it
(m <- gregexpr(patt, s)) # found all matches
regmatches(s, m) # retrieve them (note, this is a list)
# The function of choice however is regexec(). It returns whatever the pattern
# has defined in parentheses, the others return the entire match. The
# parentheses are quite important, because we might want to specify additional
# context for a valid match, but we might not want the context in the match
# itself. In our example we used word boundaries - \\b - for such context; but
# these are zero-length and don't actually match a character, so they don't
# contaminate the substring anyway. But in general we need to be able to
# precisely retrieve only the target substring.
(m <- regexec(patt, s)) # only the parenthesized substring
regmatches(s, m) # retrieve it
# Note that there are two elements: the first is the whole match, the second
# is the substring that is in parentheses. In our example these are the same.
# Here is an example where they are not:
s <- "Find the last word. And tell me."
(m <- regexec("\\s(\\w+)\\.", s))
regmatches(s, m) # retrieve it
# Unfortunately there is no option to capture multiple matches
# in base R: regexec() lacks a corresponding gregexec()...
patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b"
s <- "Transcriptional activation of CLN1, CLN2, and a putative new G1 cyclin (HCS26) by SWI4, a positive regulator of G1-specific transcription. Cell 66(5):1015-26"
# Solution 1 (base R): you can use multiple matches in an sapply()
# statement...
sapply(regmatches(s, gregexpr(patt, s))[[1]],
function(M){regmatches(M, regexec(patt, M))})
# Solution 2 (probably preferred): you can use
# str_match_all() from the very useful library "stringr" ...
if (! requireNamespace("stringr", quietly=TRUE)) {
install.packages("stringr")
}
# Package information:
# library(help = stringr) # basic information
# browseVignettes("stringr") # available vignettes
# data(package = "stringr") # available datasets
stringr::str_match_all(s, patt)
stringr::str_match_all(s, patt)[[1]][,2]
# [1] "CLN1" "CLN2" "HCS26" "SWI4"
# Note that str_match_all() handles the match object internally, no need for
# the two-step code.
An interesting new alternative/complement to the base R regex libraries is the “ore” R package that uses the Oniguruma(W) libraries and supports multiple character encodings, which you need when you work with Unicode and/or CJK character sets.
if (! requireNamespace("ore"), quietly = TRUE) {
install.packages("ore")
}
# Package information:
# library(help = ore) # basic information
# browseVignettes("ore") # available vignettes
# data(package = "ore") # available datasets
S <- "The quick brown fox jumps over a lazy dog"
ore::ore.search(". .", S)
ore::ore.search(". .", S, all=TRUE)
M <- ore::ore.search(". .", S, all=TRUE)
M$nMatches
M$match[2:4]
According to the author John Clayden, key advantages include:
regmatches()
or similar to extract the
matches themselves.
A number of modifiers can be applied as arguments to regular expression functions that may be useful. Here are the two most important ones.
# Option "ignore.case" allows to have case-insensitive matches. This is usually
# poor programming style, a more explicit (= better) way is to define your
# character classes appropriately.
patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b"
s <- "The MBP1 gene encodes the Mbp1 protein."
m <- gregexpr(patt, s)
regmatches(s, m)[[1]]
m <- gregexpr(patt, s, ignore.case = TRUE)
regmatches(s, m)[[1]]
# For regex functions in the stringr package, you can compile the pattern
# with the regex() function, and include the option "comments = TRUE". This
# allows you to insert whitespace and # characters into the pattern
# which will be ingnored by the regex engine. Thus you can comment
# complex regular expressions inline.
myRegex <- stringr::regex("\\b # word boundary
( # begin capture
[A-Z] # one uppercase letter
[A-Z0-9\\-_]+ # one or more letters, numbers, hyphen or
# underscore
[A-Z0-9] # one letter or number.
# Note: this captured subexpression has a minimum length of 3.
) # end capture
\\b", # word boundary
comments = TRUE)
stringr::str_match_all(s, myRegex)[[1]][2]
By default, quantitative matches except zero/one (i.e. the ? character) are “greedy”, i.e. they will match the largest possible number of characters. For example:
s <- "abc123"
patt <- "(\\w+)(\\d+)" # word characters, followed by digits. This pattern ...
stringr::str_match_all(s, patt)[[1]][-1]
# ...yields "abc12" and "3" . This is because \w+ is greedy and grabs as many
# alphanumeric characters as it can before \d+ gets a chance to match. A "?"
# after a quantity specifier makes it non-greedy, therefore ...
patt <- "(\\w+?)(\\d+)" # Note the questionmark in (\\w+?)
stringr::str_match_all(s, patt)[[1]][-1]
# ... now \d+ gets a chance to match as many digits as possible
<?php
$string = "The quick brown fox jumps over a lazy dog";
$words = preg_split('/\s+/', $string);
print_r($words);
preg_match('/.\W./', $string, $matches);
print_r($matches);
preg_match_all('/.\W./', $string, $matches);
print_r($matches);
# indexed preg_replace, iterates over array elements
$pat = array(); #broken
$pat[0] = '/quick brown/';
$pat[1] = '/fox/';
$pat[2] = '/lazy/';
$pat[3] = '/dog/';
$rep = array();
$rep[0] = 'lazy';
$rep[1] = 'dog';
$rep[2] = 'quick brown';
$rep[3] = 'fox';
print(preg_replace($pat, $rep, $string));
print("\n");
$pat = array();
$pat[0] = '/quick brown fox/';
$pat[1] = '/lazy dog/';
$pat[2] = '/foo/';
$pat[3] = '/bar/';
$rep = array();
$rep[0] = 'foo';
$rep[1] = 'bar';
$rep[2] = 'lazy dog';
$rep[3] = 'quick brown fox';
print(preg_replace($pat, $rep, $string));
print("\n");
?>
Python regular expression are provided through the module
re
. See here
for documentation.
.re
functions in general operate on a string and return
a MatchObject. The MatchObject is then further analyzed by supplied
methods.
The most frequently used functions are:
re.match(pattern, string)
matches
only at the beginning of a line.re.search(pattern, string)
matches
anywhere in a line.re.split(pattern, string)
returns the
split string as a list.re.findall(pattern, string)
returns
all matches in a list.Download this
.svg
file to experiment.
# parse_SVG_example.py
# Read an svg file line by line and process path data
# to write commands separately to an output file, line by line.
import re
filePath = "/my/working/directory/whatever/"
myIn = filePath + "sample.svg"
myOut = filePath + "test.svg"
IN = open(myIn)
OUT = open(myOut, "w")
for line in IN:
path = re.search('\sd=\"(.*?)\"', line) # returns the MatchObject "path"
if path:
# Found. Process the result with a second regex.
# path.group() is a method of the MatchObject
pathData = re.findall('([aAcChHlLmMqQsStTvVzZ]|-?\d*\.?\d+)',
path.group(1))
# Write it nicely formatted to output, one command per line
OUT.write("d=\"")
s = "" # we accumulate output lines in this variable
for token in pathData:
if re.match('[aAcChHlLmMqQsStTvVzZ]', token):
# it's a letter:
OUT.write("\n "+s) # flush s to output
s = token + " " # new s
else:
s = s + token + " " # append to s
OUT.write("\n " + s + "\"\n") # flush s, close string, and add \n
else:
OUT.write(line)
IN.close()
OUT.close()
Javascript is probably the most convenient choice if you want to interact directly with the DOM (Document Object Model) of a Web page. Here is example code for a “bookmarklet” that rewrites the URL of a journal-article page for access from outside the UofT network.
javascript:(function(){
var url=window.location.href;
var re=/\/([\w.]+)\/(.*$)/;
var match=url.match(re);
var newURL="http://"+match[1]+".myaccess.library.utoronto.ca/"+match[2];
window.location.href=newURL;
})();
void 0
Put this into the body of an arbitrary bookmark on your browser, then click it to be redirected to our library’s free access system of a paywalled journal article.
Use in: * grep
: grep
finds patterns in
files. Patterns are regular expressions and can come in basic
or extended flavors. In GNU grep
there is no
difference between these; in implementations where there is, you switch
from basic to extended syntax with the grep -E
flag which
is the same as invoking egrep
. : Example: what demons run
on your system? ps -ax | egrep -o “/([^A-Z]+d) | sort -u
Other uses of regular expressions in: * find
*
sed
* awk
* cut
… see the
man
pages.
Task…
ABC-units
R project. If you
have loaded it before, choose File ▹ Recent
projects ▹ ABC-Units. If you have not loaded
it before, follow the instructions in the RPR-Introduction
unit.init()
if requested.RPR-RegEx.R
and follow the instructions.
Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.
Expression | Meaning |
---|---|
\
|
Escape character |
|
|
Alternation character. Matches either one of specified alternatives. For
example, /Asp|Glu/i matches ASP, Asp, asp, GLU, Glu or glu.
|
^
|
If the caret occurs at the beginning of an expression it anchors the
expression at the beginning of a line or input. For example, /^AT/ does not match the ‘AT’ in “HETATM” but does match it in “ATOM”. If the caret occurs as the first character of a character set as in [^a-z] it specifies the complement of the character set. Everywhere else, it simply matches the character “^”. |
$
|
Matches end of input or line. For example, /t$/ does not match the ‘t’ in “eater”, but does match it in “eat” |
*
|
Matches the preceding character 0 or more times. For example, /bo*/ matches ‘boooo’ in “A ghost booooed” and ‘b’ in “A bird warbled”, but nothing in “A goat grunted”. |
+
|
Matches the preceding character 1 or more times. Equivalent to {1,} . For example, /a+/ matches the ‘a’ in “candy” and all the a’s in “caaaaaaandy.” |
?
|
Matches the preceding character 0 or 1 time. For example, /e?le?/ matches the ‘el’ in “angel” and the ‘le’ in “angle.” |
.
|
(The decimal point) matches any single character except the newline character. |
(x)
|
Matches ‘x’ and remembers the match. For example, /(foo) bar/ matches “foo bar” and stores ‘foo’ in the special variable $1. /(more) (joy)/ matches “more joy”, then stores ‘more’ in $1 and ‘joy’ in $2. |
{n}
|
Where n is a positive integer. Matches exactly n occurrences of the preceding character. For example, /a{2}/ doesn’t match the ‘a’ in “candy,” but it matches all of the a’s in “caandy,” and the first two a’s in “caaandy.” |
{n,}
|
Where n is a positive integer. Matches at least n occurrences of the preceding character. For example, /a{2,}/ doesn’t match the ‘a’ in “candy”, but matches all of the a’s in “caandy” and in “caaaaaaandy.” |
{n,m}
|
Where n and m are positive integers. Matches at least n and at most m occurrences of the preceding character. For example, /a{1,3}/ matches nothing in “cndy”, the ‘a’ in “candy,” the first two a’s in “caandy,” and the first three a’s in “caaaaaaandy” Notice that when matching “caaaaaaandy”, the match is “aaa”, even though the original string had more a’s in it. |
[xyz]
|
A character set. Matches any one of the enclosed characters. You can specify a range of characters by using a hyphen. For example, [bcd] is the same as [b-d] . They match the ‘c’ in “cysteine” and the ‘d’ in “ached” . |
[^xyz]
|
A negated or complemented character set. That is, it matches anything that is not enclosed in the brackets. Note that the caret has to be the first character in the bracket set. For example, [^abc] is the same as [^a-c] . They initially match ‘l’ in “alanine” and ‘y’ in “cysteine” |
Expression | Meaning |
---|---|
[\b]
|
Matches a backspace. (Not to be confused with \b .) |
\b
|
Matches a word boundary, such as a space. (Not to be confused with [\b].) The difference is that one is within a character set and the other is not! For example, /\bn\w/ matches the ‘no’ in “noonday”; /\wy\b/ matches the ‘ly’ in “possibly yesterday.” |
\B
|
Matches a non-word boundary. For example, /\w\Bn/ matches ‘on’ in “noonday”, and /y\B\w/ matches ‘ye’ in “possibly yesterday.” |
\cX
|
Where X is a control character. Matches a control character in a string. For example, /\cM/ matches control-M in a string. |
\d
|
Matches a digit character. Equivalent to [0-9] . For example, /\d/ or /[0-9]/ matches ‘2’ in “B2 is the suite number.” |
\D
|
Matches any non-digit character. Equivalent to [^0-9] . For example, /\D/ or /[^0-9]/ matches ‘B’ in “B2 is the suite number.” |
\f
|
Matches a form-feed. |
\n
|
Matches a linefeed. |
\r
|
Matches a carriage return. |
\s
|
Matches a single white space character, including space, tab, form feed, line feed. Equivalent to [ \f\n\r\t\v] . For example, /\s\w*/ matches ’ bar’ in “foo bar.” |
\S
|
Matches a single character other than white space. Equivalent to [^ \f\n\r\t\v] . For example, /\S/\w* matches ‘foo’ in “foo bar.” |
\t
|
Matches a tab |
\v
|
Matches a vertical tab. |
\w
|
Matches any alphanumeric character including the underscore. Equivalent to [A-Za-z0-9_] . For example, /\w/ matches ‘a’ in “apple,” ‘5’ in “$5.28,” and ‘3’ in “3D.” |
\W
|
Matches any non-word character. Equivalent to [^A-Za-z0-9_] . For example, /\W/ or /[^\$A-Za-z0-9_]/ matches ‘%’ in “50%.” |
Expression | Meaning |
---|---|
^
|
If the caret occurs at the beginning of an expression it anchors the expression at the beginning of a line or input. For example, /^AT/ does not match the ‘AT’ in “HETATM” but does match it in “ATOM”. |
$
|
Matches end of input or line. For example, /t$/ does not match the ‘t’ in “eater”, but does match it in “eat” as well as in “eat\n”. |
\b
|
Matches a word boundary, such as a space. (Not to be confused with [\b].) The difference is that one is within a character set and the other is not! For example, /\bn\w/ matches the ‘no’ in “noonday”; /\wy\b/ matches the ‘ly’ in “possibly yesterday.” |
\B
|
Matches a non-word boundary. For example, /\w\Bn/ matches ‘on’ in “noonday”, and /y\B\w/ matches ‘ye’ in “possibly yesterday.” |
\A
|
Matches at the start of a string. Like “^”. For example, /\AAT/ matches “AT” in “ATOM” but not in “HETATM” |
\Z
|
Matches at the end of a string. Like “$”. For example, /\t\Z/ matches a tab at the end of the string but not anywhere else. |
(?: … )
|
Group what’s between the brackets, but discard match. |
(?= … )
|
The preceeding pattern must be followed by this one in order to match. |
(?! … )
|
The preceeding pattern must not be followed by this one in order to match. |
Expression< | Meaning |
---|---|
g
|
Matches globally - i.e. matches all occurrences of pattern, one after the other, do not stop at the first one. |
i
|
Match in a case-insensitive manner. For example, /[ACGT]/i matches any specific nucleotide in upper or lower case. |
x
|
Ignore whitespace in the expression |
o
|
Evaluate pattern only once. |
m
|
Treat the whole string as multiple lines. |
s
|
Treat the whole string as a single line, i.e. don’t treat “” as line separators. For example, /(<table>.*?</table>)/s captures an entire table, including newline characters. Without the modifier nothing would match if there is even a single newline in between the tags. |
Visit the stackoverflow thread on regex and HTML parsing. What’s your opinion on the OP’s question?
If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.
Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.
[END]