Difference between revisions of "RPR-RegEx"
m |
m |
||
Line 1: | Line 1: | ||
− | <div id=" | + | <div id="ABC"> |
− | + | <div style="padding:5px; border:1px solid #000000; background-color:#b3dbce; font-size:300%; font-weight:400; color: #000000; width:100%;"> | |
Regular Expressions (regex) with R | Regular Expressions (regex) with R | ||
− | + | <div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#b3dbce; font-size:30%; font-weight:200; color: #000000; "> | |
− | + | (Regular expressions) | |
− | + | </div> | |
− | |||
− | |||
− | |||
− | |||
</div> | </div> | ||
− | {{ | + | {{Smallvspace}} |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | < | + | <div style="padding:5px; border:1px solid #000000; background-color:#b3dbce33; font-size:85%;"> |
− | <div | + | <div style="font-size:118%;"> |
− | + | <b>Abstract:</b><br /> | |
<section begin=abstract /> | <section begin=abstract /> | ||
− | |||
Using regular expressions is a key skill for all apsects of data science, and, by extension, for bioinformatics. There is a bit of a learning curve involved, but once you are even slightly comfortable with their syntax and use in code, you will appreciate their power in automating otherwise unbearably tedious tasks. This unit introduces the principles, the syntax, and provides a fair amount of practice. | Using regular expressions is a key skill for all apsects of data science, and, by extension, for bioinformatics. There is a bit of a learning curve involved, but once you are even slightly comfortable with their syntax and use in code, you will appreciate their power in automating otherwise unbearably tedious tasks. This unit introduces the principles, the syntax, and provides a fair amount of practice. | ||
<section end=abstract /> | <section end=abstract /> | ||
− | + | </div> | |
− | + | <!-- ============================ --> | |
− | + | <hr> | |
− | + | <table> | |
− | == | + | <tr> |
− | === | + | <td style="padding:10px;"> |
− | < | + | <b>Objectives:</b><br /> |
− | < | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | < | ||
This unit will ... | This unit will ... | ||
* ... introduce regular expressions; | * ... introduce regular expressions; | ||
* ... demonstrate their use in R functions; | * ... demonstrate their use in R functions; | ||
* ... teach how to apply them in common tasks. | * ... teach how to apply them in common tasks. | ||
− | + | </td> | |
− | + | <td style="padding:10px;"> | |
− | + | <b>Outcomes:</b><br /> | |
− | |||
− | |||
− | < | ||
After working through this unit you ... | After working through this unit you ... | ||
* ... can express pattern-matching tasks as regular expressions and correctly use a variety of functions that use them; | * ... can express pattern-matching tasks as regular expressions and correctly use a variety of functions that use them; | ||
* ... are familar with online regex testing sites that help you troubleshoot your expressions during development; | * ... are familar with online regex testing sites that help you troubleshoot your expressions during development; | ||
* ... have written to code that uses regular expressions for a variety of purposes. | * ... have written to code that uses regular expressions for a variety of purposes. | ||
− | + | </td> | |
− | + | </tr> | |
− | + | </table> | |
− | + | <!-- ============================ --> | |
− | === | + | <hr> |
− | < | + | <b>Deliverables:</b><br /> |
+ | <section begin=deliverables /> | ||
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-time_management" --> | <!-- included from "./data/ABC-unit_components.txt", section: "deliverables-time_management" --> | ||
*<b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit. | *<b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit. | ||
Line 73: | Line 47: | ||
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-insights" --> | <!-- included from "./data/ABC-unit_components.txt", section: "deliverables-insights" --> | ||
*<b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]]. | *<b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]]. | ||
+ | <section end=deliverables /> | ||
+ | <!-- ============================ --> | ||
+ | <hr> | ||
+ | <section begin=prerequisites /> | ||
+ | <b>Prerequisites:</b><br /> | ||
+ | <!-- included from "./data/ABC-unit_components.txt", section: "notes-prerequisites" --> | ||
+ | This unit builds on material covered in the following prerequisite units: | ||
+ | *[[RPR-Introduction|RPR-Introduction (Introduction to R)]] | ||
+ | <section end=prerequisites /> | ||
+ | <!-- ============================ --> | ||
+ | </div> | ||
+ | |||
+ | {{Smallvspace}} | ||
+ | |||
+ | |||
+ | |||
+ | {{Smallvspace}} | ||
+ | |||
+ | |||
+ | __TOC__ | ||
{{Vspace}} | {{Vspace}} | ||
− | |||
− | |||
== Contents == | == Contents == | ||
<!-- included from "./components/RPR-RegEx.components.txt", section: "contents" --> | <!-- included from "./components/RPR-RegEx.components.txt", section: "contents" --> | ||
Line 928: | Line 920: | ||
{{Vspace}} | {{Vspace}} | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Self-evaluation == | == Self-evaluation == | ||
− | |||
<!-- | <!-- | ||
=== Question 1=== | === Question 1=== | ||
Line 972: | Line 937: | ||
--> | --> | ||
+ | == Notes == | ||
+ | <!-- included from "./components/RPR-RegEx.components.txt", section: "notes" --> | ||
+ | <!-- included from "./data/ABC-unit_components.txt", section: "notes" --> | ||
+ | <references /> | ||
+ | == Further reading, links and resources == | ||
− | + | <div class="reference-box">[https://en.wikipedia.org/wiki/Regular_expression Regular expressions (Wikipedia)]</div> | |
− | + | <div class="reference-box">[http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html '''R''' regular expressions]</div> | |
+ | <div class="reference-box">[http://regexpal.com/ '''RegexPal''' - a javascript regex tester]</div> | ||
+ | <div class="reference-box">Visit [http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags the stackoverflow thread on regex and HTML parsing]. What's your opinion on the OP's question?</div> | ||
+ | <div class="reference-box">[http://xkcd.com/208/ '''XKCD''']</div> | ||
Revision as of 19:32, 26 January 2018
Regular Expressions (regex) with R
(Regular expressions)
Abstract:
Using regular expressions is a key skill for all apsects of data science, and, by extension, for bioinformatics. There is a bit of a learning curve involved, but once you are even slightly comfortable with their syntax and use in code, you will appreciate their power in automating otherwise unbearably tedious tasks. This unit introduces the principles, the syntax, and provides a fair amount of practice.
Objectives:
|
Outcomes:
|
Deliverables:
- Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
- Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
- Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.
Prerequisites:
This unit builds on material covered in the following prerequisite units:
Contents
- 1 Contents
- 2 First steps
- 3 Regular Expressions in R
- 4 Syntax
- 5 Behaviour
- 6 Regular Expressions in other languages
- 7 Practice
- 8 Appendix I: Metacharacters and their meaning
- 9 Appendix II: Character classes and their meaning
- 10 Appendix III: Anchor codes and their meaning
- 11 Appendix IV: Modifiers and their meaning
- 12 Self-evaluation
- 13 Notes
- 14 Further reading, links and resources
Contents
First steps
A Regular Expression is a specification of a pattern of characters. The typical use of a regular expression is to find occurrences of the pattern in a string. Regular expressions have a flexible syntax that allows them to handle a range of tasks - from trivial substring matches to complex nested motifs. The syntax of regular expressions is a programming language in its own right, and is a powerful way of concisely and uniquely defining a pattern.
Regular expressions are examples of deterministic pattern matching - they either match a particular expression or not. This is in contrast to probabilistic pattern matching in which a pattern is more or less similar to a query.
Truth be told, many programmers have a love-hate relationship with regular expressions. The syntax of regular expressions is very powerful and expressive, but also terse, not always intuitive, and sometimes hard to understand. Let's try a few simple things:
Here is string to play with: the sequence of Mbp1, copied from the NCBI Protein database page for yeast Mbp1.
1 msnqiysary sgvdvyefih stgsimkrkk ddwvnathil kaanfakakr trilekevlk 61 ethekvqggf gkyqgtwvpl niakqlaekf svydqlkplf dftqtdgsas pppapkhhha 121 skvdrkkair sastsaimet krnnkkaeen qfqsskilgn ptaaprkrgr pvgstrgsrr 181 klgvnlqrsq sdmgfprpai pnssisttql psirstmgpq sptlgileee rhdsrqqqpq 241 qnnsaqfkei dledglssdv epsqqlqqvf nqntgfvpqq qssliqtqqt esmatsvsss 301 pslptspgdf adsnpfeerf pgggtspiis miprypvtsr pqtsdindkv nkylsklvdy 361 fisnemksnk slpqvllhpp phsapyidap idpelhtafh wacsmgnlpi aealyeagts 421 irstnsqgqt plmrsslfhn sytrrtfpri fqllhetvfd idsqsqtvih hivkrksttp 481 savyyldvvl skikdfspqy rielllntqd kngdtalhia skngdvvffn tlvkmgaltt 541 isnkegltan eimnqqyeqm miqngtnqhv nssntdlnih vntnnietkn dvnsmvimsp 601 vspsdyityp sqiatnisrn ipnvvnsmkq masiyndlhe qhdneikslq ktlksisktk 661 iqvslktlev lkesskdeng eaqtnddfei lsrlqeqntk klrkrliryk rlikqkleyr 721 qtvllnklie detqattnnt vekdnntler lelaqeltml qlqrknklss lvkkfednak 781 ihkyrriire gtemnieevd ssldvilqtl iannnknkga eqiitisnan sha //
Task:
Navigate to http://regexpal.com and paste the sequence into the lower box. This site is one of a number of online regular expression testers; their immediate, visual feedback is invaluable when you are developing regular expression patterns.
Lets try some expressions:
- Most characters are matched literally.
- Type "
a
" in to the upper box and you will see all "a
" characters matched. Then replacea
withq
. - Now type "
aa
" instead. Thenkrnnkk
. Sequences of characters are also matched literally.
- The pipe character | that symbolizes logical OR can be used to define that more than one character should match
i(s|m|q)n
matchesisn
ORimn
ORiqn
. Note how we can group with parentheses, and try what would happen without them.
- We can more conveniently specify more than one character to match if we place it in square brackets. This is a "character class". We will encounter those frequently
[lq]
matchesl
ORq
.[milcwyf]
matches hydrophobic amino acids.
- Within square brackets, we can specify "ranges".
[1-5]
matches digits from 1 to 5.
- Within square brackets, we can specify characters that should NOT be matched, with the "caret",
^
. [^0-9]
matches everything EXCEPT digits.[^a-z]
matches everything that is not a lower-case letter. That's what we would need to remove characters that do not represent amino acids.
Make frequent use of this site to develop your regular expressions step by step.
Theory
According to the Chomsky hierarchy regular expressions are a Type-3 (regular) grammar, thus their use forms a regular language. Therefore, like all Type-3 grammatical expressions they can be decided by a finite-state machine, i.e. a "machine" that is defined by possible states, plus triggering conditions that control transitions between states. Think of such automata as (elaborate) if ... else
constructs. The "regex" processor translates the search pattern into such an automaton, which is then applied to the search domain - the string in which the occurrence of the pattern is to be sought.
What are they good for
Regular expressions support virtually all pattern matching tasks in data clean-up, extracting information items, data mining, "screen scraping", parsing of files, subsetting large tables, etc. etc. This means, they must be part of your everyday toolkit.
When should they not be used
Since regular expressions are Type-3 grammars, they must fail when trying to parse more complex grammars - i.e. gramars that can't be expressed in a regular language. This means, you can't reliably parse XML - and in perticular HTML - with regular expressions. There is a long discussion on this particular topic however, e.g. see here, and many other similar threads on stackoverflow, and see here for a discussion of when regular expressions should not be used. Use a real XML parser instead.
Perl and POSIX
Two dialects of regular expressions exist, they differ in some details of syntax. One is the nearly universal "Perl" dialect (Perl is a programming language), the other one is the "POSIX" standard that nearly no one uses. Except R. Tragically, in R the POSIX standard is the default. Fortunately this often does not make a difference, and we can explicitly turn this nonsense off. But just like the utterly annoying stringsAsFactors = FALSE
, we need to type perl = TRUE
much more often than we would like. Somebody, some time, made a wrong design decision and thousands of wasted man- and woman hours later we are still stuck with the consequences. If you use regular expressions according to the POSIX standard, you have to learn the Perl standard anyway. But then you can just use the Perl standard in the first place. The Wikipedia page on Regular Expressions has a table with a side-by-side comparison of the different ways the two standards express character classes. Also see the help page on regex
in R for details.
Regular Expressions in R
Regular expressions in R can be used
- to match patterns in strings for use in
if()
orwhile()
conditions, or to retrieve specific instances of patterns with theregexpr()
family of functions; - to substitute occurrences of patterns in strings with other strings with
gsub()
; - to split strings into substrings that are delimited by the occurrence of a pattern with
strsplit()
;
...and more.
Accordingly, a basic knowledge of regular expressions is needed to reade and write code, especially code that parses text.
Syntax
Regular expressions in R are strings, thus they are enclosed in quotation marks.
"a"
is a regular expression. It specifies the single, literal character a
exactly.
Specifying symbols
The power of regular expressions lies in their flexible syntax that allows to specify character ranges, classes of characters, unspecified characters, alternatives, and much more. This sometimes can be confusing, because the symbols that specify ranges, options, wildcards and the like are of course themselves characters. Characters that specify information about other characters are called metacharacters, these include ".
", "?
", "+
", "*
", "[
" and "]
", "{
" and "}
" and more. And the opposite is also possible: some plain characters can be turned into metacharacters to denote character classes.
The "\
" - escape character - allows to distinguish when a character is to be taken literally and when it is to be interpreted as a metacharacter. Note that some symbols have to be escaped to be read literally, while some letters have to be escaped to be read as metacharacters.
But there is a catch in R, relating to when the escape characater is interpreted. Remember that "\n
" is a linebreak in a string, "\t
" is a tab, etc. Obviously if you write "\?
" (a literal questionmark in a regex), or "\+
" (a literal plus-sign in a regex) into a regular string, the mechanism that parses the string is going to see the escape character, then it expects an "n" or a "t" or the like - but what it gets instead is something it doesn't know. So it throws an error. Try:
"\n" # fine
"\?" # Error: ...
But then how can we write something like "\?
" when we need it? That becomes obvious when you consider what happens with the string: it gets sent to the regex engine for interpretation. Thus the regex engine needs to see: character "\", then character "?". So it needs two characters. The secret is: we need to prevent "\" from attaching to the next character, and specify it as a single character in its own right. We do that by escaping "\" itself - with a backslash. Thus "\\
" is a literal "\" character - and can get sent to the regex engine.
"\\?" # ok
cat("\\?") # that's what the regex engine sees.
Consequence is: you need to double the "\\" in R when you want a single "\". That works differently from other programming languages who pass patterns to the regex engine as-is. You need to be aware of this, for example when you develop a pattern in an online regex tool, and then copy it back into your R code. You need to double all occurrences of "\
" in your R string.
Letters whose special meaning as a metacharacter is turned on with the escape character:
Character | Means |
---|---|
w the letter "w" | |
\w a "word" character, ie one of A-Z, a-z, 0-9 and "_" | |
s the letter "s" | |
\s a "space" character, i.e. one of " ", tab or newline | |
b the letter "b" | |
\b a word boundary |
Metacharacters whose special meaning is turned off with the escape character:
Character | Means |
---|---|
+ | One or more repetitions of the preceeding expression |
\+ | the literal character "+" |
\ | the escape character |
\\ | the literal character "\" |
. | any single character except the newline (\n) |
\. | a literal period |
Note that these examples are not exhaustive. A (nearly) complete table of symbols and meanings is given in the appendix.
Character Classes
Square brackets specify when more than one specific character can match at a position.
Expression | Means |
---|---|
[acgtACGT] | Any non-degenerate nucleotide |
For example:
"[AGR]AATT[CTY]"
matches all occurrences of an ApoI restriction site, either specified explicitly, or through the nucleotide ambiguity codes R (purines) or Y (pyrimidines).
Within character sets, hyphens can specify character ranges.
Expression | Means |
---|---|
[a-z] | lowercase letters |
[0-9] | digits |
[0-9+*/=^\\-] | digits and arithmetic symbols (Note the escaped hyphen) |
If you want to match a literal hyphen, you must escape it. Within character sets, some metacharacters that otherwise have special meanings usually do not need to be escaped.
The complement
The caret character "^" denotes the complement of a character set; i.e. everything that is not that expression.
Expression | Means |
---|---|
[^9] | Everything but the digit "9" |
[^ACGT] | Not a nucleotide code letter |
Note that outside of square brackets, the "^" character is an "anchoring code" and means "beginning of the string". This can be confusing.
For many metacharacters that denoite character classes, the metacharacter in upper case denotes the complement. This can also be confusing !
Character | Means |
---|---|
\w | a word character |
\W | not a word character |
\s | a space character |
\S | not a space character |
Specifying quantity
Special characters in regular expressions control how often a pattern must be present in order to match:
Expression | What it means | Example (meaning) |
---|---|---|
? | match zero or one times | "? (there may or may not be a quote mark) |
+ | match one or more | [A-Z]+ (there's at least one uppercase letter) |
* | match any number | .* (there may be some characters) |
{min,max} | match between min and max times (assumes 0 for min, if min is omitted; assumes infinity for max, if max is omitted). | [atAT]{20,200} (a stretch of between 20 and 200 upper- or lowercase As or Ts) |
For example:
"AAUAAA[ACGU]{10,30}$"
defines a polyadenylation site - a AAUAAA motif, followed by 10 to 30 of any nucleotide before the end of the RNA.
Specifying position (anchoring)
If a pattern must be matched at a particular location in the string, special terms denote string anchors.
Anchoring Term | Meaning |
---|---|
^ | Start of a line or string |
$ | End of a line or string |
\A | Start of the string |
\Z | End of the string |
\G | Last global match end |
Of course defining a regular expression pattern does not yet do anything with it. Below are the most important R functions that use regular expressions. Write the small code samples that are provided below, play with variations, and test how the operators and regular expressions work.
Functions that don't use regular expressions
Not all pattern searches in strings use (and need) regular expressions. Sometimes
simple, exact string-matching is enough. R uses string matching in character equality (==
) and by extension, the set operation functions (union(), intersect()
etc.), the match()
function, and the %in%
operator.
vA <- c("the", "quick", "brown", "fox", "jumps", "over", "a", "lazy", "dog")
vA[2] == "quick" # TRUE
vA[2] == "quack" # FALSE
vA == "fox" # boolean vector
# match tests for string equality
match("fox", vA) # 4, i.e. the 4th element matches the string
match("o", vA) # NA: matches have to be to the WHOLE element
# match("fox", vA) is equivalent to...
which(vA == "fox")
# %in% can be used for creating intersections
# find whether elements from one vector are
# contained in another:
vB <- c("Quacking", "the", "duck", "wings", "over", "my", "cozy", "cot")
vA %in% vB
vB %in% vA # note that the length of the return vector is the same as the
# length of the first argument. So read this as:
# "Which of my vB are also in vA"
# We can use this to subset the vector with elements that are present in
# both:
vB[vB %in% vA]
# which is, of course, the intersection set operation.
intersect(vA, vB)
Functions that use regular expressions
The general online help page is here. Remember: R's default behaviour is extended POSIX. To be sure which regex dialect is used, pass the perl = TRUE
parameter.
grep()
# grep() is like match(), but uses regular expressions. A variant of grep() that
# returns a boolean vector - like "==" does - is grepl(). That is useful
# becoause we can & or | the vector, or invert it with ! .
grep("fox", vA)
grep("o", vA) # Aha! now we get all elements that contain an "o" -
# Because we get partial matches with regular expressions.
vA[grep("o", vA)] # subset
grepl("o", vA) # logical
! grepl("o", vA) # its inverse
vA[! grepl("o", vA)] # subset all words without "o"
Subsetting example
Consider the following regular expression:
patt <- "^\\s*#"
This matches if the string it is applied to does not begin with a "#", which may or may not be preceeded by whitespaces. This would be useful to ignore comment lines in a data file.
The regular expression above is decomposed as follows:
^
the beginning of the line\\s
any whitespace character ...*
... repeated 0 or more times#
the hash character
The following example would read a file into an vector of lines, then drop all lines that are empty, and all lines that are comments. This is a straightforward idiom.
IN <- "test.txt"
patt <- "^\\s*#"
myData <- readLines(IN)
myData <- myData[myData != ""] # drop all elements that are the empty string
myData <- myData[! grepl(patt, myData)] # drop all elements match the pattern
Substitution - gsub()
Think of "gsub"" as "global substitution", and you'll understand that there exists another function, sub()
that replaces only the first occurrence of a pattern, rather than all of them as gsub()
does. I can't imagine what the use case for that might be and I don't think I have ever used sub()
. I get an intuitive sense that code that needs such a function should probably be reconceived. But gsub()
is very useful.
(s <- " 1 MKLAACFLTL LPGFAVA... 17 ") # E-coli Alpha Amylase signal peptide
# Drop everything from this string that is not an amino acid one-letter code.
# We use gsub() to first identify all non-amino acid letters with a character
# class regular expression, then we replace each occurrence with the empty
# string.
gsub("[^ACDEFGHIKLMNPQRSTVWY]", "", s)
# or, with assignment: ...
s <- gsub("[^ACDEFGHIKLMNPQRSTVWY]", "", s)
strsplit()
Another function that makes use of regular expressions is strsplit()
. It takes a vector of strings, and returns a list, one element for each element of the vector, in which each string has been split up by separating it along a regular expression.
x <- c("a b c", "1 2")
strsplit(x, " ")
# [[1]]
# [1] "a" "b" "c"
#
# [[2]]
# [1] "1" "2"
Since even a single string returns a list, you often have to extract the element you want as a vector for further use.
corvidae <- c("crow:jackdaw:jay:magpie:raven:rook")
strsplit(corvidae, ":")
unlist(strsplit(corvidae, ":"))
strsplit(corvidae, ":")[[1]]
# Consider:
length(strsplit(corvidae, ":"))
length(unlist(strsplit(corvidae, ":")))
strsplit()
is immensely useful to extract elements from strings with a relatively well defined structure.
s <- "1, 1, 2, 3, 5, 8"
strsplit(s, ", ")[[1]] # split on comma-space
s <- "~`!@#$%^&*()_-=+[{]}\|;:',<.>/?"
strsplit(s, "")[[1]] # split on empty string
s <- "chronological lifespan:\tincreased\ncold sensitivity:\tincreased\nsporulation:\tnormal"
strsplit(s, "\\t|\\n")[[1]] # split on tab or newline
Behaviour
Capturing matches
Finding and returning matches in R is a two-step process. (1) find matches with regexpr()
(one match), gregexpr()
(all matches), or regexec()
(sub-expressions in parentheses). All of these return a "match object". (2) use the match object to extract the matching substrings from the original string.
# Extracting gene names in text.
# Let's define a valid gene name to be a substring that is bounded by
# word-boundaries, starts with an upper-case character, contains more upper-case
# characters or numbers or a hyphen or underscore, with a minimal length of 3.
# Here is a regex, and we put the part of the string that we want to recover, in
# parentheses:
patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b"
# Test: positives
grepl(patt, "MBP1")
grepl(patt, "AAT")
grepl(patt, " AI1")
grepl(patt, "ASP3-1 ")
grepl(patt, " AI5_ALPHA; ")
grepl(patt, " (TY1B-PR3) ")
# Test: negatives
grepl(patt, "G1") # Too short
grepl(patt, "G1-") # Hyphen at end
grepl(patt, "Cell") # contains lower-case
# Let's apply this to retrieve gene names in text
s <- "Transcriptional activation of CLN1, CLN2, and a putative new G1 cyclin (HCS26) by SWI4, a positive regulator of G1-specific transcription. Cell 66(5):1015-26"
(m <- regexpr(patt, s)) # found a match in position 31
regmatches(s, m) # retrieve it
(m <- gregexpr(patt, s)) # found all matches
regmatches(s, m) # retrieve them (note, this is a list)
# The function of choice however is regexec(). It returns whatever the pattern
# has defined in parentheses, the others return the entire match. The
# parentheses are quite important, because we might want to specify additional
# context for a valid match, but we might not want the context in the match
# itself. In our example we used word boundaries - \\b - for such context; but
# these are zero-length and don't actually match a character, so they don't
# contaminate the substring anyway. But in general we need to be able to
# precisely retrieve only the target substring.
(m <- regexec(patt, s)) # only the parenthesized substring
regmatches(s, m) # retrieve it
# Note that there are two elements: the first is the whole match, the second
# is the substring that is in parentheses. In our example these are the same.
# Here is an example where they are not:
s <- "Find the last word. And tell me."
(m <- regexec("\\s(\\w+)\\.", s))
regmatches(s, m) # retrieve it
# Unfortunately there is no option to capture multiple matches
# in base R: regexec() lacks a corresponding gregexec()...
patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b"
s <- "Transcriptional activation of CLN1, CLN2, and a putative new G1 cyclin (HCS26) by SWI4, a positive regulator of G1-specific transcription. Cell 66(5):1015-26"
# Solution 1 (base R): you can use multiple matches in an sapply()
# statement...
sapply(regmatches(s, gregexpr(patt, s))[[1]],
function(M){regmatches(M, regexec(patt, M))})
# Solution 2 (probably preferred): you can use
# str_match_all() from the very useful library "stringr" ...
if (! require(stringr, quietly=TRUE)) {
install.packages("stringr")
library(stringr)
}
# Package information:
# library(help = stringr) # basic information
# browseVignettes("stringr") # available vignettes
# data(package = "stringr") # available datasets
str_match_all(s, patt)
str_match_all(s, patt)[[1]][,2]
# [1] "CLN1" "CLN2" "HCS26" "SWI4"
# Note that str_match_all() handles the match object internally, no need for
# the two-step code.
An interesting new alternative/complement to the base R regex libraries is the package "ore"
that uses the Oniguruma libraries and supports multiple character encodings, which you need when you work with Unicode and/or CJK character sets.
if (! require(ore), quietly = TRUE) {
install.packages("ore")
library(ore)
}
# Package information:
# library(help = ore) # basic information
# browseVignettes("ore") # available vignettes
# data(package = "ore") # available datasets
S <- "The quick brown fox jumps over a lazy dog"
ore.search(". .", S)
ore.search(". .", S, all=TRUE)
M <- ore.search(". .", S, all=TRUE)
M$nMatches
M$match[2:4]
According to the author John Clayden, key advantages include:
- Search results focus around the matched substrings (including parenthesised groups), rather than the locations of matches. This saves
extra work with regmatches()
or similar to extract the matches themselves.
- Substantially better performance, especially when matching against long strings.
- Substitutions can be functions as well as strings.
- Matches can be efficiently obtained over only part of the strings.
- Fewer core functions, with more consistent names.
Modifiers
A number of modifiers can be applied as arguments to regular expression functions that may be useful. Here are the two most important ones.
# Option "ignore.case" allows to have case-insensitive matches. This is usually
# poor programming style, a more explicit (= better) way is to define your
# character classes appropriately.
patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b"
s <- "The MBP1 gene encodes the Mbp1 protein."
m <- gregexpr(patt, s)
regmatches(s, m)[[1]]
m <- gregexpr(patt, s, ignore.case = TRUE)
regmatches(s, m)[[1]]
# For regex functions in the stringr package, you can compile the pattern
# with the regex() function, and include the option "comments = TRUE". This
# allows you to insert whitespace and # characters into the pattern
# which will be ingnored by the regex engine. Thus you can comment
# complex regular expressions inline.
library(stringr)
myRegex <- regex("\\b # word boundary
( # begin capture
[A-Z] # one uppercase letter
[A-Z0-9\\-_]+ # one or more letters, numbers, hyphen or
# underscore
[A-Z0-9] # one letter or number.
# Note: this captured subexpression has a minimum length of 3.
) # end capture
\\b", # word boundary
comments = TRUE)
str_match_all(s, myRegex)[[1]][2]
Greed
By default, quantitative matches except zero/one (i.e. the ? character) are "greedy", i.e. they will match the largest possible number of characters. For example:
s <- "abc123"
patt <- "(\\w+)(\\d+)" # word characters, followed by digits. This pattern ...
str_match_all(s, patt)[[1]][-1]
# ...yields "abc12" and "3" . This is because \w+ is greedy and grabs as many
# alphanumeric characters as it can before \d+ gets a chance to match. A "?"
# after a quantity specifier makes it non-greedy, therefore ...
patt <- "(\\w+?)(\\d+)" # Note the questionmark in (\\w+?)
str_match_all(s, patt)[[1]][-1]
# ... now \d+ gets a chance to match as many digits as possible
Regular Expressions in other languages
PHP
<?php
$string = "The quick brown fox jumps over a lazy dog";
$words = preg_split('/\s+/', $string);
print_r($words);
preg_match('/.\W./', $string, $matches);
print_r($matches);
preg_match_all('/.\W./', $string, $matches);
print_r($matches);
#indexed preg_replace, iterates over array elements
$pat = array(); #broken
$pat[0] = '/quick brown/';
$pat[1] = '/fox/';
$pat[2] = '/lazy/';
$pat[3] = '/dog/';
$rep = array();
$rep[0] = 'lazy';
$rep[1] = 'dog';
$rep[2] = 'quick brown';
$rep[3] = 'fox';
print(preg_replace($pat, $rep, $string));
print("\n");
$pat = array();
$pat[0] = '/quick brown fox/';
$pat[1] = '/lazy dog/';
$pat[2] = '/foo/';
$pat[3] = '/bar/';
$rep = array();
$rep[0] = 'foo';
$rep[1] = 'bar';
$rep[2] = 'lazy dog';
$rep[3] = 'quick brown fox';
print(preg_replace($pat, $rep, $string));
print("\n");
?>
Python
Python regular expression are provided through the module re
. See here for documentation.
.re
functions in general operate on a string and return a MatchObject. The MatchObject is then further analyzed by supplied methods.
The most frequently used functions are:
re.match(pattern, string)
matches only at the beginning of a line.re.search(pattern, string)
matches anywhere in a line.re.split(pattern, string)
returns the split string as a list.re.findall(pattern, string)
returns all matches in a list.
Python example
Download this .svg
file to experiment.
# parse_SVG_example.py
# Read an svg file line by line and process path data
# to write commands separately to an output file, line by line.
import re
filePath = "/my/working/directory/whatever/"
myIn = filePath + "sample.svg"
myOut = filePath + "test.svg"
IN = open(myIn)
OUT = open(myOut, "w")
for line in IN:
path = re.search('\sd=\"(.*?)\"', line) # returns the MatchObject "path"
if path:
# Found. Process the result with a second regex.
# path.group() is a method of the MatchObject
pathData = re.findall('([aAcChHlLmMqQsStTvVzZ]|-?\d*\.?\d+)',
path.group(1))
# Write it nicely formatted to output, one command per line
OUT.write("d=\"")
s = "" # we accumulate output lines in this variable
for token in pathData:
if re.match('[aAcChHlLmMqQsStTvVzZ]', token):
# it's a letter:
OUT.write("\n "+s) # flush s to output
s = token + " " # new s
else:
s = s + token + " " # append to s
OUT.write("\n " + s + "\"\n") # flush s, close string, and add \n
else:
OUT.write(line)
IN.close()
OUT.close()
Javascript
Javascript is probably the most convenient choice if you want to interact directly with the DOM (Document Object Model) of a Web page. Here is example code for a "bookmarklet" that rewrites the URL of a journal-article page for access from outside the UofT network.
javascript:(function(){
var url=window.location.href;
var re=/\/([\w.]+)\/(.*$)/;
var match=url.match(re);
var newURL="http://"+match[1]+".myaccess.library.utoronto.ca/"+match[2];
window.location.href=newURL;
})();
void 0
Put this into the body of an arbitrary bookmark on your browser, then click it to be redirected to our library's free access system of a paywalled journal article.
POSIX (Unix, the bash shell)
Use in:
grep
grep
finds patterns in files. Patterns are regular expressions and can come in basic or extended flavors. In GNUgrep
there is no difference between these; in implementations where there is, you switch from basic to extended syntax with thegrep -E
flag which is the same as invokingegrep
.- Example: what demons run on your system?
ps -ax | egrep -o "/([^A-Z]\w+d)\b" | sort -u
Other uses of regular expressions in:
find
sed
awk
cut
... see the man
pages.
Practice
Task:
- Open RStudio and load the
ABC-units
R project. If you have loaded it before, choose File → Recent projects → ABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit. - Choose Tools → Version Control → Pull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
- Type
init()
if requested. - Open the file
RPR-RegEx.R
and follow the instructions.
Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.
Appendix I: Metacharacters and their meaning
Expression | Meaning |
---|---|
\ | Escape character |
| | Alternation character. Matches either one of specified alternatives. For example, /Asp|Glu/i matches ASP, Asp, asp, GLU, Glu or glu. |
^ | If the caret occurs at the beginning of an expression it anchors the expression at the beginning of a line or input. For example, /^AT/ does not match the 'AT' in "HETATM" but does match it in "ATOM". If the caret occurs as the first character of a character set as in [^a-z] it specifies the complement of the character set. Everywhere else, it simply matches the character "^". |
$ | Matches end of input or line. For example, /t$/ does not match the 't' in "eater", but does match it in "eat" |
* | Matches the preceding character 0 or more times. For example, /bo*/ matches 'boooo' in "A ghost booooed" and 'b' in "A bird warbled", but nothing in "A goat grunted". |
+ | Matches the preceding character 1 or more times. Equivalent to {1,} . For example, /a+/ matches the 'a' in "candy" and all the a's in "caaaaaaandy." |
? | Matches the preceding character 0 or 1 time. For example, /e?le?/ matches the 'el' in "angel" and the 'le' in "angle." |
. | (The decimal point) matches any single character except the newline character. |
(x) | Matches 'x' and remembers the match. For example, /(foo) bar/ matches "foo bar" and stores 'foo' in the special variable $1. /(more) (joy)/ matches "more joy", then stores 'more' in $1 and 'joy' in $2. |
{n} | Where n is a positive integer. Matches exactly n occurrences of the preceding character. For example, /a{2}/ doesn't match the 'a' in "candy," but it matches all of the a's in "caandy," and the first two a's in "caaandy." |
{n,} | Where n is a positive integer. Matches at least n occurrences of the preceding character. For example, /a{2,}/ doesn't match the 'a' in "candy", but matches all of the a's in "caandy" and in "caaaaaaandy." |
{n,m} | Where n and m are positive integers. Matches at least n and at most m occurrences of the preceding character. For example, /a{1,3}/ matches nothing in "cndy", the 'a' in "candy," the first two a's in "caandy," and the first three a's in "caaaaaaandy" Notice that when matching "caaaaaaandy", the match is "aaa", even though the original string had more a's in it. |
[xyz] | A character set. Matches any one of the enclosed characters. You can specify a range of characters by using a hyphen. For example, [bcd] is the same as [b-d] . They match the 'c' in "cysteine" and the 'd' in "ached" . |
[^xyz] | A negated or complemented character set. That is, it matches anything that is not enclosed in the brackets. Note that the caret has to be the first character in the bracket set. For example, [^abc] is the same as [^a-c] . They initially match 'l' in "alanine" and 'y' in "cysteine" |
Appendix II: Character classes and their meaning
Expression | Meaning |
---|---|
[\b] | Matches a backspace. (Not to be confused with \b .) |
\b | Matches a word boundary, such as a space. (Not to be confused with [\b].) The difference is that one is within a character set and the other is not! For example, /\bn\w/ matches the 'no' in "noonday"; /\wy\b/ matches the 'ly' in "possibly yesterday." |
\B | Matches a non-word boundary. For example, /\w\Bn/ matches 'on' in "noonday", and /y\B\w/ matches 'ye' in "possibly yesterday." |
\cX | Where X is a control character. Matches a control character in a string. For example, /\cM/ matches control-M in a string. |
\d | Matches a digit character. Equivalent to [0-9] . For example, /\d/ or /[0-9]/ matches '2' in "B2 is the suite number." |
\D | Matches any non-digit character. Equivalent to [^0-9] . For example, /\D/ or /[^0-9]/ matches 'B' in "B2 is the suite number." |
\f | Matches a form-feed. |
\n | Matches a linefeed. |
\r | Matches a carriage return. |
\s | Matches a single white space character, including space, tab, form feed, line feed. Equivalent to [ \f\n\r\t\v] . For example, /\s\w*/ matches ' bar' in "foo bar." |
\S | Matches a single character other than white space. Equivalent to [^ \f\n\r\t\v] . For example, /\S/\w* matches 'foo' in "foo bar." |
\t | Matches a tab |
\v | Matches a vertical tab. |
\w | Matches any alphanumeric character including the underscore. Equivalent to [A-Za-z0-9_] . For example, /\w/ matches 'a' in "apple," '5' in "$5.28," and '3' in "3D." |
\W | Matches any non-word character. Equivalent to [^A-Za-z0-9_] . For example, /\W/ or /[^$A-Za-z0-9_]/ matches '%' in "50%." |
Appendix III: Anchor codes and their meaning
Expression | Meaning |
---|---|
^ | If the caret occurs at the beginning of an expression it anchors the expression at the beginning of a line or input. For example, /^AT/ does not match the 'AT' in "HETATM" but does match it in "ATOM". |
$ | Matches end of input or line. For example, /t$/ does not match the 't' in "eater", but does match it in "eat" as well as in "eat\n". |
\b | Matches a word boundary, such as a space. (Not to be confused with [\b].) The difference is that one is within a character set and the other is not! For example, /\bn\w/ matches the 'no' in "noonday"; /\wy\b/ matches the 'ly' in "possibly yesterday." |
\B | Matches a non-word boundary. For example, /\w\Bn/ matches 'on' in "noonday", and /y\B\w/ matches 'ye' in "possibly yesterday." |
\A | Matches at the start of a string. Like "^". For example, /\AAT/ matches "AT" in "ATOM " but not in "HETATM" |
\Z | Matches at the end of a string. Like "$". For example, /\t\Z/ matches a tab at the end of the string but not anywhere else. |
(?: … ) | Group what's between the brackets, but discard match. |
(?= … ) | The preceeding pattern must be followed by this one in order to match. |
(?! … ) | The preceeding pattern must not be followed by this one in order to match. |
Appendix IV: Modifiers and their meaning
Expression< | Meaning |
---|---|
g | Matches globally - i.e. matches all occurrences of pattern, one after the other, do not stop at the first one. |
i | Match in a case-insensitive manner. For example, /[ACGT]/i matches any specific nucleotide in upper or lower case. |
x | Ignore whitespace in the expression |
o | Evaluate pattern only once. |
m | Treat the whole string as multiple lines. |
s | Treat the whole string as a single line, i.e. don't treat "\n" as line separators. For example, /(<table>.*?</table>)/s captures an entire table, including newline characters. Without the modifier nothing would match if there is even a single newline in between the tags. |
Self-evaluation
Notes
Further reading, links and resources
If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2017-08-05
Modified:
- 2017-10-01
Version:
- 1.0
Version history:
- 1.0 First live version, translated from Perl examples in old version
- 0.1 First stub
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.