Difference between revisions of "RPR-RegEx"

From "A B C"
Jump to navigation Jump to search
m
m
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
<div id="BIO">
+
<div id="ABC">
  <div class="b1">
+
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce; font-size:300%; font-weight:400; color: #000000; width:100%;">
 
Regular Expressions (regex) with R
 
Regular Expressions (regex) with R
  </div>
+
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#b3dbce; font-size:30%; font-weight:200; color: #000000; ">
 
+
(Regular expressions)
  {{Vspace}}
+
</div>
 
 
<div class="keywords">
 
<b>Keywords:</b>&nbsp;
 
Regular expressions
 
 
</div>
 
</div>
  
{{Vspace}}
+
{{Smallvspace}}
  
  
__TOC__
+
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce33; font-size:85%;">
 
+
<div style="font-size:118%;">
{{Vspace}}
+
<b>Abstract:</b><br />
 
 
 
 
{{LIVE}}
 
 
 
{{Vspace}}
 
 
 
 
 
</div>
 
<div id="ABC-unit-framework">
 
== Abstract ==
 
 
<section begin=abstract />
 
<section begin=abstract />
<!-- included from "./components/RPR-RegEx.components.txt", section: "abstract" -->
 
 
Using regular expressions is a key skill for all apsects of data science, and, by extension, for bioinformatics. There is a bit of a learning curve involved, but once you are even slightly comfortable with their syntax and use in code, you will appreciate their power in automating otherwise unbearably tedious tasks. This unit introduces the principles, the syntax, and provides a fair amount of practice.
 
Using regular expressions is a key skill for all apsects of data science, and, by extension, for bioinformatics. There is a bit of a learning curve involved, but once you are even slightly comfortable with their syntax and use in code, you will appreciate their power in automating otherwise unbearably tedious tasks. This unit introduces the principles, the syntax, and provides a fair amount of practice.
 
<section end=abstract />
 
<section end=abstract />
 
+
</div>
{{Vspace}}
+
<!-- ============================ -->
 
+
<hr>
 
+
<table>
== This unit ... ==
+
<tr>
=== Prerequisites ===
+
<td style="padding:10px;">
<!-- included from "./components/RPR-RegEx.components.txt", section: "prerequisites" -->
+
<b>Objectives:</b><br />
<!-- included from "./data/ABC-unit_components.txt", section: "notes-prerequisites" -->
 
You need to complete the following units before beginning this one:
 
*[[RPR-Introduction|RPR-Introduction (Introduction to R)]]
 
 
 
{{Vspace}}
 
 
 
 
 
=== Objectives ===
 
<!-- included from "./components/RPR-RegEx.components.txt", section: "objectives" -->
 
 
This unit will ...
 
This unit will ...
 
* ... introduce regular expressions;
 
* ... introduce regular expressions;
 
* ... demonstrate their use in R functions;
 
* ... demonstrate their use in R functions;
 
* ... teach how to apply them in common tasks.
 
* ... teach how to apply them in common tasks.
 
+
</td>
{{Vspace}}
+
<td style="padding:10px;">
 
+
<b>Outcomes:</b><br />
 
 
=== Outcomes ===
 
<!-- included from "./components/RPR-RegEx.components.txt", section: "outcomes" -->
 
 
After working through this unit you ...
 
After working through this unit you ...
 
* ... can express pattern-matching tasks as regular expressions and correctly use a variety of functions that use them;
 
* ... can express pattern-matching tasks as regular expressions and correctly use a variety of functions that use them;
 
* ... are familar with online regex testing sites that help you troubleshoot your expressions during development;
 
* ... are familar with online regex testing sites that help you troubleshoot your expressions during development;
 
* ... have written to code that uses regular expressions for a variety of purposes.
 
* ... have written to code that uses regular expressions for a variety of purposes.
 +
</td>
 +
</tr>
 +
</table>
 +
<!-- ============================  -->
 +
<hr>
 +
<b>Deliverables:</b><br />
 +
<section begin=deliverables />
 +
<li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li>
 +
<li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li>
 +
<li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li>
 +
<section end=deliverables />
 +
<!-- ============================  -->
 +
<hr>
 +
<section begin=prerequisites />
 +
<b>Prerequisites:</b><br />
 +
This unit builds on material covered in the following prerequisite units:<br />
 +
*[[RPR-Introduction|RPR-Introduction (Introduction to R)]]
 +
<section end=prerequisites />
 +
<!-- ============================  -->
 +
</div>
 +
 +
{{Smallvspace}}
 +
 +
  
{{Vspace}}
+
{{Smallvspace}}
  
  
=== Deliverables ===
+
__TOC__
<!-- included from "./components/RPR-RegEx.components.txt", section: "deliverables" -->
 
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-time_management" -->
 
*<b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
 
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-journal" -->
 
*<b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.
 
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-insights" -->
 
*<b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].
 
  
 
{{Vspace}}
 
{{Vspace}}
  
  
</div>
+
=== Evaluation ===
<div id="BIO">
+
<b>Evaluation: NA</b><br />
 +
<div style="margin-left: 2rem;">This unit is not evaluated for course marks.</div>
 
== Contents ==
 
== Contents ==
<!-- included from "./components/RPR-RegEx.components.txt", section: "contents" -->
 
  
 
==First steps==
 
==First steps==
Line 129: Line 119:
  
 
;Within square brackets, we can specify characters that should NOT be matched, with the "caret", <code>^</code>.
 
;Within square brackets, we can specify characters that should NOT be matched, with the "caret", <code>^</code>.
:<code>[^0-9]</code> matches everything EXCEPT digits. <code>[^a-z]</code> matches everything that is not a lower-case letter. That's what we would need to remove characters that do not represent amino acids.
+
:<code>[^0-9]</code> matches everything EXCEPT digits. <code>[^a-z]</code> matches everything that is not a lower-case letter. That's what we would need to remove characters that do not represent amino acids. Note that '''outside''' of the square brackets the caret means "beginning of the string". When yopu see a caret, you need to consider its context carefully.
  
 
}}
 
}}
Line 158: Line 148:
 
===Perl and POSIX===
 
===Perl and POSIX===
  
Two dialects of regular expressions exist, they differ in some details of syntax. One is the nearly universal "Perl" dialect ({{WP|Perl}} is a programming language), the other one is the "POSIX" standard that nearly no one uses. Except R. Tragically, in R the POSIX standard is the default. Fortunately this often does not make a difference, and we can explicitly turn this nonsense off. But just like the utterly annoying <code>stringsAsFactors = FALSE</code>, we need to type <code>perl = TRUE</code> much more often than we would like. Somebody, some time, made a wrong design decision and thousands of wasted man- and woman hours later we are still stuck with the consequences. If you use regular expressions according to the POSIX standard, you have to learn the Perl standard anyway. But then you can just use the Perl standard in the first place. The {{WP|Regular expression|Wikipedia page on Regular Expressions}} has a table with a side-by-side comparison of the different ways the two standards express character classes. Also see the help page on <code>regex</code> in R for details.
+
Two dialects of regular expressions exist, they differ in some details of syntax. One is the nearly universal "Perl" dialect ({{WP|Perl}} is a programming language), the other one is the "POSIX" standard that nearly no one uses. Except R. Tragically, in R the POSIX standard is the default. Fortunately this often does not make a difference, and we can explicitly turn this nonsense off. But we need to type <code>perl = TRUE</code> much more often than we would like. Somebody, some time, made a wrong design decision and thousands of wasted man- and woman hours later we are still stuck with the consequences. If you use regular expressions according to the POSIX standard, you have to learn the Perl standard anyway. But then you can just use the Perl standard in the first place. The {{WP|Regular expression|Wikipedia page on Regular Expressions}} has a table with a side-by-side comparison of the different ways the two standards express character classes. Also see the help page on <code>regex</code> in R for details.
  
 
{{Vspace}}
 
{{Vspace}}
Line 179: Line 169:
 
Regular expressions in R are strings, thus they are enclosed in quotation marks.
 
Regular expressions in R are strings, thus they are enclosed in quotation marks.
  
<source lang="rsplus">
+
<pre>
 
"a"
 
"a"
</source>
+
</pre>
  
 
is a regular expression. It specifies the single, literal character <code>a</code> exactly.
 
is a regular expression. It specifies the single, literal character <code>a</code> exactly.
Line 193: Line 183:
 
But there is a catch in R, relating to '''when''' the escape characater is interpreted. Remember that "<code>\n</code>" is a linebreak in a string, "<code>\t</code>" is a tab, etc. Obviously if you write "<code>\?</code>" (a literal questionmark in a regex), or  "<code>\+</code>" (a literal plus-sign in a regex) into a regular string, the mechanism that parses the string is going to see the escape character, then it expects an "n" or a "t" or the like - but what it gets instead is something it doesn't know. So it throws an error. Try:
 
But there is a catch in R, relating to '''when''' the escape characater is interpreted. Remember that "<code>\n</code>" is a linebreak in a string, "<code>\t</code>" is a tab, etc. Obviously if you write "<code>\?</code>" (a literal questionmark in a regex), or  "<code>\+</code>" (a literal plus-sign in a regex) into a regular string, the mechanism that parses the string is going to see the escape character, then it expects an "n" or a "t" or the like - but what it gets instead is something it doesn't know. So it throws an error. Try:
  
<source lang="rsplus">
+
<pre>
 
"\n" # fine
 
"\n" # fine
 
"\?" # Error: ...
 
"\?" # Error: ...
</source>
+
</pre>
  
 
But then how can we write something like "<code>\?</code>" when we need it? That becomes obvious when you consider what happens with the string: it gets sent to the regex engine for interpretation. Thus the regex engine needs to see: character "\", then character "?". So it needs two characters. The secret is: we need to prevent "\" from attaching to the next character, and specify it as a single character in its own right. We do that by '''escaping''' "\" itself - '''with a backslash'''. Thus "<code>\\</code>" is a literal "\" character - and can get sent to the regex engine.
 
But then how can we write something like "<code>\?</code>" when we need it? That becomes obvious when you consider what happens with the string: it gets sent to the regex engine for interpretation. Thus the regex engine needs to see: character "\", then character "?". So it needs two characters. The secret is: we need to prevent "\" from attaching to the next character, and specify it as a single character in its own right. We do that by '''escaping''' "\" itself - '''with a backslash'''. Thus "<code>\\</code>" is a literal "\" character - and can get sent to the regex engine.
  
<source lang="rsplus">
+
<pre>
 
"\\?" # ok
 
"\\?" # ok
 
cat("\\?") # that's what the regex engine sees.
 
cat("\\?") # that's what the regex engine sees.
</source>
+
</pre>
  
 
Consequence is: you need to double the "\\" in R when you want a single "\". That works differently from other programming languages who pass patterns to the regex engine as-is. You need to be aware of this, for example when you develop a pattern in an online regex tool, and then copy it back into your R code. You need to double all occurrences of "<code>\</code>" in your R string.
 
Consequence is: you need to double the "\\" in R when you want a single "\". That works differently from other programming languages who pass patterns to the regex engine as-is. You need to be aware of this, for example when you develop a pattern in an online regex tool, and then copy it back into your R code. You need to double all occurrences of "<code>\</code>" in your R string.
Line 320: Line 310:
 
simple, exact string-matching is enough. R uses string matching in character equality (<code>==</code>) and by extension, the set operation functions (<code>union(), intersect()</code> etc.), the <code>match()</code> function, and the <code>%in%</code> operator.
 
simple, exact string-matching is enough. R uses string matching in character equality (<code>==</code>) and by extension, the set operation functions (<code>union(), intersect()</code> etc.), the <code>match()</code> function, and the <code>%in%</code> operator.
  
<source lang="R">
+
<pre>
  
 
vA <- c("the", "quick", "brown", "fox", "jumps", "over", "a", "lazy", "dog")
 
vA <- c("the", "quick", "brown", "fox", "jumps", "over", "a", "lazy", "dog")
Line 355: Line 345:
 
# which is, of course, the intersection set operation.
 
# which is, of course, the intersection set operation.
 
intersect(vA, vB)
 
intersect(vA, vB)
</source>
+
</pre>
  
 
{{Vspace}}
 
{{Vspace}}
Line 369: Line 359:
 
<!-- for updates, see code in R_Exercise -Bioinformatics "Sequence.R script -->
 
<!-- for updates, see code in R_Exercise -Bioinformatics "Sequence.R script -->
  
<source lang="R">
+
<pre>
  
 
# grep() is like match(), but uses regular expressions. A variant of grep() that
 
# grep() is like match(), but uses regular expressions. A variant of grep() that
 
# returns a boolean vector - like "==" does - is grepl(). That is useful
 
# returns a boolean vector - like "==" does - is grepl(). That is useful
# becoause we can & or | the vector, or invert it with ! .
+
# because we can & or | the vector, or invert it with ! .
  
 
grep("fox", vA)
 
grep("fox", vA)
Line 385: Line 375:
 
vA[! grepl("o", vA)] # subset all words without "o"
 
vA[! grepl("o", vA)] # subset all words without "o"
  
</source>
+
</pre>
  
 
{{Vspace}}
 
{{Vspace}}
Line 393: Line 383:
 
Consider the following regular expression:
 
Consider the following regular expression:
  
<source lang="R">
+
<pre>
  
 
patt <- "^\\s*#"
 
patt <- "^\\s*#"
  
</source>
+
</pre>
  
  
Line 412: Line 402:
 
The following example would read a file into an vector of lines, then drop all lines that are empty, and all lines that are comments. This is a straightforward idiom.
 
The following example would read a file into an vector of lines, then drop all lines that are empty, and all lines that are comments. This is a straightforward idiom.
  
<source lang="R">
+
<pre>
  
 
IN <- "test.txt"
 
IN <- "test.txt"
Line 421: Line 411:
 
myData <- myData[! grepl(patt, myData)]  # drop all elements match the pattern
 
myData <- myData[! grepl(patt, myData)]  # drop all elements match the pattern
  
</source>
+
</pre>
  
 
{{Vspace}}
 
{{Vspace}}
Line 429: Line 419:
 
Think of "gsub"" as "global substitution", and you'll understand that there exists another function, <code>sub()</code> that replaces only the first occurrence of a pattern, rather than all of them as <code>gsub()</code> does. I can't imagine what the use case for that might be and I don't think I have ever used <code>sub()</code>. I get an intuitive sense that code that needs such a function should probably be reconceived. But <code>gsub()</code> is very useful.
 
Think of "gsub"" as "global substitution", and you'll understand that there exists another function, <code>sub()</code> that replaces only the first occurrence of a pattern, rather than all of them as <code>gsub()</code> does. I can't imagine what the use case for that might be and I don't think I have ever used <code>sub()</code>. I get an intuitive sense that code that needs such a function should probably be reconceived. But <code>gsub()</code> is very useful.
  
<source lang="R">
+
<pre>
  
 
(s <- "  1 MKLAACFLTL LPGFAVA... 17  ") # E-coli Alpha Amylase signal peptide
 
(s <- "  1 MKLAACFLTL LPGFAVA... 17  ") # E-coli Alpha Amylase signal peptide
Line 443: Line 433:
 
s <- gsub("[^ACDEFGHIKLMNPQRSTVWY]", "", s)
 
s <- gsub("[^ACDEFGHIKLMNPQRSTVWY]", "", s)
  
</source>
+
</pre>
  
 
{{Vspace}}
 
{{Vspace}}
Line 451: Line 441:
 
Another function that makes use of regular expressions is <code>strsplit()</code>. It takes a vector of strings, and returns a list, one element for each element of the vector, in which each string has been split up by separating it along a regular expression.
 
Another function that makes use of regular expressions is <code>strsplit()</code>. It takes a vector of strings, and returns a list, one element for each element of the vector, in which each string has been split up by separating it along a regular expression.
  
<source lang="R">
+
<pre>
 
x <- c("a b c", "1 2")
 
x <- c("a b c", "1 2")
 
strsplit(x, " ")
 
strsplit(x, " ")
Line 459: Line 449:
 
# [[2]]
 
# [[2]]
 
# [1] "1" "2"
 
# [1] "1" "2"
</source>
+
</pre>
  
 
Since even a single string returns a list, you often have to extract the element you want as a vector for further use.
 
Since even a single string returns a list, you often have to extract the element you want as a vector for further use.
  
<source lang="R">
+
<pre>
 
corvidae <- c("crow:jackdaw:jay:magpie:raven:rook")
 
corvidae <- c("crow:jackdaw:jay:magpie:raven:rook")
 
strsplit(corvidae, ":")
 
strsplit(corvidae, ":")
Line 473: Line 463:
 
length(strsplit(corvidae, ":"))
 
length(strsplit(corvidae, ":"))
 
length(unlist(strsplit(corvidae, ":")))
 
length(unlist(strsplit(corvidae, ":")))
</source>
+
</pre>
  
  
 
<code>strsplit()</code> is immensely useful to extract elements from strings with a relatively well defined structure.
 
<code>strsplit()</code> is immensely useful to extract elements from strings with a relatively well defined structure.
  
<source lang="R">
+
<pre>
 
s <- "1, 1, 2, 3, 5, 8"
 
s <- "1, 1, 2, 3, 5, 8"
 
strsplit(s, ", ")[[1]] # split on comma-space
 
strsplit(s, ", ")[[1]] # split on comma-space
Line 488: Line 478:
 
strsplit(s, "\\t|\\n")[[1]]  # split on tab or newline
 
strsplit(s, "\\t|\\n")[[1]]  # split on tab or newline
  
</source>
+
</pre>
  
  
Line 499: Line 489:
  
  
====Capturing matches ====
+
====Capturing and using  matches ====
 +
 
 +
Matches can be captured and used, e.g. in <code>gsub()</code>.
 +
 
 +
<pre>
 +
# Capture matches by placing them in parentheses. To immediatley reuse them, refer to them with "backreferences": <code>\\1</code>, <code>\\2</code>, <code>\\3</code>.
 +
 
 +
# Example 1:
 +
# The beginning and ending three words of some text...
 +
s <- "I know, however, that its precarious and remote villages lie within the lowlands of the Wisla River."
 +
gsub("^((\\S+\\s+){3}).*((\\s\\S+){3})$", "\\1 ... \\3", s)
 +
 
 +
# Note: matches \\2 and \\4 are the inside the parentheses that are there to
 +
# group things to be found {3}-times.
 +
 
 +
 
 +
# Example 2:
 +
# A binomial species name has a genus, a species, and possibly a strain name.
 +
# We use \\S (not whitespace) and \\s (whitespace) to tease this apart into
 +
# three captures expressions:
 +
s <- "Saccharomyces cerevisiae S288C"
 +
gsub("^(\\S+)\\s(\\S+)\\s*(.*)$",
 +
    "genus: \\1; species: \\1 \\2; (strain: \\3)",
 +
    s)
 +
gsub
 +
 
 +
</pre>
 +
 
 +
====Capturing and returning matches ====
  
 
Finding and '''returning''' matches in R is a two-step process. (1) find matches with <code>regexpr()</code> (one match), <code>gregexpr()</code> (all matches), or <code>regexec()</code> (sub-expressions in parentheses). All of these return a "match object". (2) use the match object to extract the matching substrings from the original string.
 
Finding and '''returning''' matches in R is a two-step process. (1) find matches with <code>regexpr()</code> (one match), <code>gregexpr()</code> (all matches), or <code>regexec()</code> (sub-expressions in parentheses). All of these return a "match object". (2) use the match object to extract the matching substrings from the original string.
  
  
<source lang="R">
+
<pre>
  
  
Line 574: Line 592:
 
# Solution 2 (probably preferred): you can use
 
# Solution 2 (probably preferred): you can use
 
# str_match_all() from the very useful library "stringr" ...
 
# str_match_all() from the very useful library "stringr" ...
if (! require(stringr, quietly=TRUE)) {
+
if (! requireNamespace("stringr", quietly=TRUE)) {
 
   install.packages("stringr")
 
   install.packages("stringr")
  library(stringr)
 
 
}
 
}
 
# Package information:
 
# Package information:
Line 584: Line 601:
  
  
str_match_all(s, patt)
+
stringr::str_match_all(s, patt)
str_match_all(s, patt)[[1]][,2]
+
stringr::str_match_all(s, patt)[[1]][,2]
 
# [1] "CLN1"  "CLN2"  "HCS26" "SWI4"
 
# [1] "CLN1"  "CLN2"  "HCS26" "SWI4"
  
Line 591: Line 608:
 
# the two-step code.
 
# the two-step code.
  
</source>
+
</pre>
  
  
Line 597: Line 614:
  
  
<source lang="R">
+
<pre>
if (! require(ore), quietly = TRUE) {
+
if (! requireNamespace("ore"), quietly = TRUE) {
 
     install.packages("ore")
 
     install.packages("ore")
    library(ore)
 
 
}
 
}
 
# Package information:
 
# Package information:
Line 610: Line 626:
 
S <- "The quick brown fox jumps over a lazy dog"
 
S <- "The quick brown fox jumps over a lazy dog"
  
ore.search(". .", S)
+
ore::ore.search(". .", S)
ore.search(". .", S, all=TRUE)
+
ore::ore.search(". .", S, all=TRUE)
M <- ore.search(". .", S, all=TRUE)
+
M <- ore::ore.search(". .", S, all=TRUE)
 
M$nMatches
 
M$nMatches
 
M$match[2:4]
 
M$match[2:4]
</source>
+
</pre>
  
 
According to the author John Clayden, key advantages include:
 
According to the author John Clayden, key advantages include:
Line 631: Line 647:
 
A number of modifiers can be applied as arguments to regular expression functions that may be useful. Here are the two most important ones.
 
A number of modifiers can be applied as arguments to regular expression functions that may be useful. Here are the two most important ones.
  
<source lang="R">
+
<pre>
  
 
# Option "ignore.case" allows to have case-insensitive matches. This is usually
 
# Option "ignore.case" allows to have case-insensitive matches. This is usually
Line 654: Line 670:
 
# complex regular expressions inline.
 
# complex regular expressions inline.
  
library(stringr)
 
  
myRegex <- regex("\\b            # word boundary
+
myRegex <- stringr::regex("\\b            # word boundary
                (              # begin capture
+
                          (              # begin capture
                [A-Z]          # one uppercase letter
+
                          [A-Z]          # one uppercase letter
                [A-Z0-9\\-_]+  # one or more letters, numbers, hyphen or
+
                          [A-Z0-9\\-_]+  # one or more letters, numbers, hyphen or
                                #  underscore
+
                                          #  underscore
                [A-Z0-9]        # one letter or number.
+
                          [A-Z0-9]        # one letter or number.
                # Note: this captured subexpression has a minimum length of 3.
+
                          # Note: this captured subexpression has a minimum length of 3.
                )              # end capture
+
                          )              # end capture
                \\b",          # word boundary
+
                          \\b",          # word boundary
                comments = TRUE)
+
                          comments = TRUE)
  
str_match_all(s, myRegex)[[1]][2]
+
stringr::str_match_all(s, myRegex)[[1]][2]
  
</source>
+
</pre>
  
  
Line 676: Line 691:
 
By default, quantitative matches except zero/one (i.e. the ? character) are "greedy", i.e. they will match the largest possible number of characters.  For example:
 
By default, quantitative matches except zero/one (i.e. the ? character) are "greedy", i.e. they will match the largest possible number of characters.  For example:
  
<source lang="R">
+
<pre>
 
s <- "abc123"
 
s <- "abc123"
  
 
patt <- "(\\w+)(\\d+)"  # word characters, followed by digits. This pattern ...
 
patt <- "(\\w+)(\\d+)"  # word characters, followed by digits. This pattern ...
str_match_all(s, patt)[[1]][-1]
+
stringr::str_match_all(s, patt)[[1]][-1]
  
 
# ...yields "abc12" and "3" . This is because \w+ is greedy and grabs as many
 
# ...yields "abc12" and "3" . This is because \w+ is greedy and grabs as many
Line 687: Line 702:
  
 
patt <- "(\\w+?)(\\d+)"  # Note the questionmark in (\\w+?)
 
patt <- "(\\w+?)(\\d+)"  # Note the questionmark in (\\w+?)
str_match_all(s, patt)[[1]][-1]
+
stringr::str_match_all(s, patt)[[1]][-1]
  
 
# ... now \d+ gets a chance to match as many digits as possible
 
# ... now \d+ gets a chance to match as many digits as possible
  
</source>
+
</pre>
  
 
{{Vspace}}
 
{{Vspace}}
Line 701: Line 716:
 
===PHP===
 
===PHP===
  
<source lang="PHP">
+
<pre>
 
<?php
 
<?php
 
$string = "The quick brown fox jumps over a lazy dog";
 
$string = "The quick brown fox jumps over a lazy dog";
Line 743: Line 758:
  
 
?>
 
?>
</source>
+
</pre>
  
 
{{Vspace}}
 
{{Vspace}}
Line 765: Line 780:
  
  
<source lang="python">
+
<pre>
 
# parse_SVG_example.py
 
# parse_SVG_example.py
 
# Read an svg file line by line and process path data
 
# Read an svg file line by line and process path data
Line 805: Line 820:
 
OUT.close()
 
OUT.close()
  
</source>
+
</pre>
  
 
{{Vspace}}
 
{{Vspace}}
Line 813: Line 828:
 
Javascript is probably the most convenient choice if you want to interact directly with the DOM (Document Object Model) of a Web page. Here is example code for a "bookmarklet" that rewrites the URL of a journal-article page for access from outside the UofT network.
 
Javascript is probably the most convenient choice if you want to interact directly with the DOM (Document Object Model) of a Web page. Here is example code for a "bookmarklet" that rewrites the URL of a journal-article page for access from outside the UofT network.
  
<source lang="javascript">
+
<pre>
 
javascript:(function(){
 
javascript:(function(){
 
   var url=window.location.href;
 
   var url=window.location.href;
Line 823: Line 838:
 
void 0
 
void 0
  
</source>
+
</pre>
  
 
Put this into the body of an arbitrary bookmark on your browser, then click it to be redirected to our library's free access system of a paywalled journal article.
 
Put this into the body of an arbitrary bookmark on your browser, then click it to be redirected to our library's free access system of a paywalled journal article.
Line 927: Line 942:
  
 
{{Vspace}}
 
{{Vspace}}
 
 
{{Vspace}}
 
 
  
 
== Further reading, links and resources ==
 
== Further reading, links and resources ==
Line 943: Line 954:
 
{{Vspace}}
 
{{Vspace}}
  
 
== Notes ==
 
<!-- included from "./components/RPR-RegEx.components.txt", section: "notes" -->
 
<!-- included from "./data/ABC-unit_components.txt", section: "notes" -->
 
<references />
 
 
{{Vspace}}
 
 
 
</div>
 
<div id="ABC-unit-framework">
 
== Self-evaluation ==
 
<!-- included from "./components/RPR-RegEx.components.txt", section: "self-evaluation" -->
 
<!--
 
=== Question 1===
 
 
Question ...
 
 
<div class="toccolours mw-collapsible mw-collapsed" style="width:800px">
 
Answer ...
 
<div class="mw-collapsible-content">
 
Answer ...
 
 
</div>
 
  </div>
 
 
  {{Vspace}}
 
 
-->
 
 
{{Vspace}}
 
 
 
 
{{Vspace}}
 
 
 
<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_ask" -->
 
 
----
 
 
{{Vspace}}
 
 
<b>If in doubt, ask!</b> If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
 
 
----
 
 
{{Vspace}}
 
  
 
<div class="about">
 
<div class="about">
Line 1,000: Line 963:
 
:2017-08-05
 
:2017-08-05
 
<b>Modified:</b><br />
 
<b>Modified:</b><br />
:2017-10-01
+
:2020-09-22
 
<b>Version:</b><br />
 
<b>Version:</b><br />
:1.0
+
:1.2
 
<b>Version history:</b><br />
 
<b>Version history:</b><br />
 +
*1.2 2020 Maintenance, added gsub() cature and backreference
 +
*1.1 Change from require() to requireNamespace() and use &lt;package&gt;::&lt;function&gt;() idiom.
 
*1.0 First live version, translated from Perl examples in old version
 
*1.0 First live version, translated from Perl examples in old version
 
*0.1 First stub
 
*0.1 First stub
 
</div>
 
</div>
[[Category:ABC-units]]
 
<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_footer" -->
 
  
 
{{CC-BY}}
 
{{CC-BY}}
  
 +
[[Category:ABC-units]]
 +
{{UNIT}}
 +
{{LIVE}}
 
</div>
 
</div>
 
<!-- [END] -->
 
<!-- [END] -->

Latest revision as of 09:29, 25 September 2020

Regular Expressions (regex) with R

(Regular expressions)


 


Abstract:

Using regular expressions is a key skill for all apsects of data science, and, by extension, for bioinformatics. There is a bit of a learning curve involved, but once you are even slightly comfortable with their syntax and use in code, you will appreciate their power in automating otherwise unbearably tedious tasks. This unit introduces the principles, the syntax, and provides a fair amount of practice.


Objectives:
This unit will ...

  • ... introduce regular expressions;
  • ... demonstrate their use in R functions;
  • ... teach how to apply them in common tasks.

Outcomes:
After working through this unit you ...

  • ... can express pattern-matching tasks as regular expressions and correctly use a variety of functions that use them;
  • ... are familar with online regex testing sites that help you troubleshoot your expressions during development;
  • ... have written to code that uses regular expressions for a variety of purposes.

Deliverables:

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

  • Prerequisites:
    This unit builds on material covered in the following prerequisite units:


     



     



     


    Evaluation

    Evaluation: NA

    This unit is not evaluated for course marks.

    Contents

    First steps

    A Regular Expression is a specification of a pattern of characters. The typical use of a regular expression is to find occurrences of the pattern in a string. Regular expressions have a flexible syntax that allows them to handle a range of tasks - from trivial substring matches to complex nested motifs. The syntax of regular expressions is a programming language in its own right, and is a powerful way of concisely and uniquely defining a pattern.

    Regular expressions are examples of deterministic pattern matching - they either match a particular expression or not. This is in contrast to probabilistic pattern matching in which a pattern is more or less similar to a query.

    Truth be told, many programmers have a love-hate relationship with regular expressions. The syntax of regular expressions is very powerful and expressive, but also terse, not always intuitive, and sometimes hard to understand. Let's try a few simple things:

    Here is string to play with: the sequence of Mbp1, copied from the NCBI Protein database page for yeast Mbp1.

           1 msnqiysary sgvdvyefih stgsimkrkk ddwvnathil kaanfakakr trilekevlk
          61 ethekvqggf gkyqgtwvpl niakqlaekf svydqlkplf dftqtdgsas pppapkhhha
         121 skvdrkkair sastsaimet krnnkkaeen qfqsskilgn ptaaprkrgr pvgstrgsrr
         181 klgvnlqrsq sdmgfprpai pnssisttql psirstmgpq sptlgileee rhdsrqqqpq
         241 qnnsaqfkei dledglssdv epsqqlqqvf nqntgfvpqq qssliqtqqt esmatsvsss
         301 pslptspgdf adsnpfeerf pgggtspiis miprypvtsr pqtsdindkv nkylsklvdy
         361 fisnemksnk slpqvllhpp phsapyidap idpelhtafh wacsmgnlpi aealyeagts
         421 irstnsqgqt plmrsslfhn sytrrtfpri fqllhetvfd idsqsqtvih hivkrksttp
         481 savyyldvvl skikdfspqy rielllntqd kngdtalhia skngdvvffn tlvkmgaltt
         541 isnkegltan eimnqqyeqm miqngtnqhv nssntdlnih vntnnietkn dvnsmvimsp
         601 vspsdyityp sqiatnisrn ipnvvnsmkq masiyndlhe qhdneikslq ktlksisktk
         661 iqvslktlev lkesskdeng eaqtnddfei lsrlqeqntk klrkrliryk rlikqkleyr
         721 qtvllnklie detqattnnt vekdnntler lelaqeltml qlqrknklss lvkkfednak
         781 ihkyrriire gtemnieevd ssldvilqtl iannnknkga eqiitisnan sha
    //
    


    Task:
    Navigate to http://regexpal.com and paste the sequence into the lower box. This site is one of a number of online regular expression testers; their immediate, visual feedback is invaluable when you are developing regular expression patterns.

    Lets try some expressions:

    Most characters are matched literally.
    Type "a" in to the upper box and you will see all "a" characters matched. Then replace a with q.
    Now type "aa" instead. Then krnnkk. Sequences of characters are also matched literally.
    The pipe character | that symbolizes logical OR can be used to define that more than one character should match
    i(s|m|q)n matches isn OR imn OR iqn. Note how we can group with parentheses, and try what would happen without them.
    We can more conveniently specify more than one character to match if we place it in square brackets. This is a "character class". We will encounter those frequently
    [lq] matches l OR q. [milcwyf] matches hydrophobic amino acids.
    Within square brackets, we can specify "ranges".
    [1-5] matches digits from 1 to 5.
    Within square brackets, we can specify characters that should NOT be matched, with the "caret", ^.
    [^0-9] matches everything EXCEPT digits. [^a-z] matches everything that is not a lower-case letter. That's what we would need to remove characters that do not represent amino acids. Note that outside of the square brackets the caret means "beginning of the string". When yopu see a caret, you need to consider its context carefully.


     

    Make frequent use of this site to develop your regular expressions step by step.


     

    Theory

    According to the Chomsky hierarchy regular expressions are a Type-3 (regular) grammar, thus their use forms a regular language. Therefore, like all Type-3 grammatical expressions they can be decided by a finite-state machine, i.e. a "machine" that is defined by possible states, plus triggering conditions that control transitions between states. Think of such automata as (elaborate) if ... else constructs. The "regex" processor translates the search pattern into such an automaton, which is then applied to the search domain - the string in which the occurrence of the pattern is to be sought.


     

    What are they good for

    Regular expressions support virtually all pattern matching tasks in data clean-up, extracting information items, data mining, "screen scraping", parsing of files, subsetting large tables, etc. etc. This means, they must be part of your everyday toolkit.


     

    When should they not be used

    Since regular expressions are Type-3 grammars, they must fail when trying to parse more complex grammars - i.e. gramars that can't be expressed in a regular language. This means, you can't reliably parse XML - and in perticular HTML - with regular expressions. There is a long discussion on this particular topic however, e.g. see here, and many other similar threads on stackoverflow, and see here for a discussion of when regular expressions should not be used. Use a real XML parser instead.


     

    Perl and POSIX

    Two dialects of regular expressions exist, they differ in some details of syntax. One is the nearly universal "Perl" dialect (Perl is a programming language), the other one is the "POSIX" standard that nearly no one uses. Except R. Tragically, in R the POSIX standard is the default. Fortunately this often does not make a difference, and we can explicitly turn this nonsense off. But we need to type perl = TRUE much more often than we would like. Somebody, some time, made a wrong design decision and thousands of wasted man- and woman hours later we are still stuck with the consequences. If you use regular expressions according to the POSIX standard, you have to learn the Perl standard anyway. But then you can just use the Perl standard in the first place. The Wikipedia page on Regular Expressions has a table with a side-by-side comparison of the different ways the two standards express character classes. Also see the help page on regex in R for details.


     

    Regular Expressions in R

    Regular expressions in R can be used

    • to match patterns in strings for use in if() or while() conditions, or to retrieve specific instances of patterns with the regexpr() family of functions;
    • to substitute occurrences of patterns in strings with other strings with gsub();
    • to split strings into substrings that are delimited by the occurrence of a pattern with strsplit();

    ...and more.

    Accordingly, a basic knowledge of regular expressions is needed to reade and write code, especially code that parses text.


     

    Syntax

    Regular expressions in R are strings, thus they are enclosed in quotation marks.

    "a"
    

    is a regular expression. It specifies the single, literal character a exactly.


    Specifying symbols

    The power of regular expressions lies in their flexible syntax that allows to specify character ranges, classes of characters, unspecified characters, alternatives, and much more. This sometimes can be confusing, because the symbols that specify ranges, options, wildcards and the like are of course themselves characters. Characters that specify information about other characters are called metacharacters, these include ".", "?", "+", "*", "[" and "]", "{" and "}" and more. And the opposite is also possible: some plain characters can be turned into metacharacters to denote character classes.

    The "\" - escape character - allows to distinguish when a character is to be taken literally and when it is to be interpreted as a metacharacter. Note that some symbols have to be escaped to be read literally, while some letters have to be escaped to be read as metacharacters.

    But there is a catch in R, relating to when the escape characater is interpreted. Remember that "\n" is a linebreak in a string, "\t" is a tab, etc. Obviously if you write "\?" (a literal questionmark in a regex), or "\+" (a literal plus-sign in a regex) into a regular string, the mechanism that parses the string is going to see the escape character, then it expects an "n" or a "t" or the like - but what it gets instead is something it doesn't know. So it throws an error. Try:

    "\n" # fine
    "\?" # Error: ...
    

    But then how can we write something like "\?" when we need it? That becomes obvious when you consider what happens with the string: it gets sent to the regex engine for interpretation. Thus the regex engine needs to see: character "\", then character "?". So it needs two characters. The secret is: we need to prevent "\" from attaching to the next character, and specify it as a single character in its own right. We do that by escaping "\" itself - with a backslash. Thus "\\" is a literal "\" character - and can get sent to the regex engine.

    "\\?" # ok
    cat("\\?") # that's what the regex engine sees.
    

    Consequence is: you need to double the "\\" in R when you want a single "\". That works differently from other programming languages who pass patterns to the regex engine as-is. You need to be aware of this, for example when you develop a pattern in an online regex tool, and then copy it back into your R code. You need to double all occurrences of "\" in your R string.

    Letters whose special meaning as a metacharacter is turned on with the escape character:

    CharacterMeans
    wthe letter "w"
    \wa "word" character, ie one of A-Z, a-z, 0-9 and "_"
    sthe letter "s"
    \sa "space" character, i.e. one of " ", tab or newline
    bthe letter "b"
    \ba word boundary


     

    Metacharacters whose special meaning is turned off with the escape character:

    CharacterMeans
    +One or more repetitions of the preceeding expression
    \+the literal character "+"
    \the escape character
    \\the literal character "\"
    .any single character except the newline (\n)
    \.a literal period

    Note that these examples are not exhaustive. A (nearly) complete table of symbols and meanings is given in the appendix.


     

    Character Classes

    Square brackets specify when more than one specific character can match at a position.

    ExpressionMeans
    [acgtACGT]Any non-degenerate nucleotide

    For example: "[AGR]AATT[CTY]" matches all occurrences of an ApoI restriction site, either specified explicitly, or through the nucleotide ambiguity codes R (purines) or Y (pyrimidines).

    Within character sets, hyphens can specify character ranges.

    ExpressionMeans
    [a-z]lowercase letters
    [0-9]digits
    [0-9+*/=^\\-]digits and arithmetic symbols (Note the escaped hyphen)

    If you want to match a literal hyphen, you must escape it. Within character sets, some metacharacters that otherwise have special meanings usually do not need to be escaped.


    The complement

    The caret character "^" denotes the complement of a character set; i.e. everything that is not that expression.

    ExpressionMeans
    [^9]Everything but the digit "9"
    [^ACGT]Not a nucleotide code letter

    Note that outside of square brackets, the "^" character is an "anchoring code" and means "beginning of the string". This can be confusing.

    For many metacharacters that denoite character classes, the metacharacter in upper case denotes the complement. This can also be confusing !

    CharacterMeans
    \wa word character
    \Wnot a word character
    \sa space character
    \Snot a space character


    Specifying quantity

    Special characters in regular expressions control how often a pattern must be present in order to match:

    ExpressionWhat it meansExample (meaning)
    ?match zero or one times"? (there may or may not be a quote mark)
    +match one or more[A-Z]+ (there's at least one uppercase letter)
    *match any number.* (there may be some characters)
    {min,max}match between min and max times (assumes 0 for min, if min is omitted; assumes infinity for max, if max is omitted).[atAT]{20,200} (a stretch of between 20 and 200 upper- or lowercase As or Ts)

    For example: "AAUAAA[ACGU]{10,30}$" defines a polyadenylation site - a AAUAAA motif, followed by 10 to 30 of any nucleotide before the end of the RNA.


    Specifying position (anchoring)

    If a pattern must be matched at a particular location in the string, special terms denote string anchors.

    Anchoring TermMeaning
    ^Start of a line or string
    $End of a line or string
    \AStart of the string
    \ZEnd of the string
    \GLast global match end


     

    Of course defining a regular expression pattern does not yet do anything with it. Below are the most important R functions that use regular expressions. Write the small code samples that are provided below, play with variations, and test how the operators and regular expressions work.


     

    Functions that don't use regular expressions

    Not all pattern searches in strings use (and need) regular expressions. Sometimes simple, exact string-matching is enough. R uses string matching in character equality (==) and by extension, the set operation functions (union(), intersect() etc.), the match() function, and the %in% operator.

    
    vA <- c("the", "quick", "brown", "fox", "jumps", "over", "a", "lazy", "dog")
    
    vA[2] == "quick"  # TRUE
    vA[2] == "quack"  # FALSE
    
    vA == "fox"  # boolean vector
    
    # match tests for string equality
    match("fox", vA)   # 4, i.e. the 4th element matches the string
    match("o", vA)     # NA: matches have to be to the WHOLE element
    
    # match("fox", vA) is equivalent to...
    which(vA == "fox")
    
    # %in% can be used for creating intersections
    # find whether elements from one vector are
    # contained in another:
    
    vB <- c("Quacking", "the", "duck", "wings", "over", "my", "cozy", "cot")
    
    
    vA %in% vB
    vB %in% vA  # note that the length of the return vector is the same as the
                # length of the first argument. So read this as:
                # "Which of my vB are also in vA"
    
    # We can use this to subset the vector with elements that are present in
    # both:
    
    vB[vB %in% vA]
    
    # which is, of course, the intersection set operation.
    intersect(vA, vB)
    


     

    Functions that use regular expressions

    The general online help page is here. Remember: R's default behaviour is extended POSIX. To be sure which regex dialect is used, pass the perl = TRUE parameter.


     

    grep()

    
    # grep() is like match(), but uses regular expressions. A variant of grep() that
    # returns a boolean vector - like "==" does - is grepl(). That is useful
    # because we can & or | the vector, or invert it with ! .
    
    grep("fox", vA)
    grep("o", vA) # Aha! now we get all elements that contain an "o" -
                  # Because we get partial matches with regular expressions.
    vA[grep("o", vA)] # subset
    
    grepl("o", vA)    # logical
    ! grepl("o", vA)  # its inverse
    
    vA[! grepl("o", vA)] # subset all words without "o"
    
    


     

    Subsetting example

    Consider the following regular expression:

    
    patt <- "^\\s*#"
    
    


    This matches if the string it is applied to does not begin with a "#", which may or may not be preceeded by whitespaces. This would be useful to ignore comment lines in a data file.

    The regular expression above is decomposed as follows:

    1. ^   the beginning of the line
    2. \\s   any whitespace character ...
    3. *    ... repeated 0 or more times
    4. #    the hash character


    The following example would read a file into an vector of lines, then drop all lines that are empty, and all lines that are comments. This is a straightforward idiom.

    
    IN <- "test.txt"
    patt <- "^\\s*#"
    
    myData <- readLines(IN)
    myData <- myData[myData != ""]  # drop all elements that are the empty string
    myData <- myData[! grepl(patt, myData)]  # drop all elements match the pattern
    
    


     

    Substitution - gsub()

    Think of "gsub"" as "global substitution", and you'll understand that there exists another function, sub() that replaces only the first occurrence of a pattern, rather than all of them as gsub() does. I can't imagine what the use case for that might be and I don't think I have ever used sub(). I get an intuitive sense that code that needs such a function should probably be reconceived. But gsub() is very useful.

    
    (s <- "   1 MKLAACFLTL LPGFAVA... 17   ") # E-coli Alpha Amylase signal peptide
    
    # Drop everything from this string that is not an amino acid one-letter code.
    # We use gsub() to first identify all non-amino acid letters with a character
    # class regular expression, then we replace each occurrence with the empty
    # string.
    
    gsub("[^ACDEFGHIKLMNPQRSTVWY]", "", s)
    
    # or, with assignment: ...
    s <- gsub("[^ACDEFGHIKLMNPQRSTVWY]", "", s)
    
    


     

    strsplit()

    Another function that makes use of regular expressions is strsplit(). It takes a vector of strings, and returns a list, one element for each element of the vector, in which each string has been split up by separating it along a regular expression.

    x <- c("a b c", "1 2")
    strsplit(x, " ")
    # [[1]]
    # [1] "a" "b" "c"
    #
    # [[2]]
    # [1] "1" "2"
    

    Since even a single string returns a list, you often have to extract the element you want as a vector for further use.

    corvidae <- c("crow:jackdaw:jay:magpie:raven:rook")
    strsplit(corvidae, ":")
    
    unlist(strsplit(corvidae, ":"))
    strsplit(corvidae, ":")[[1]]
    
    # Consider:
    length(strsplit(corvidae, ":"))
    length(unlist(strsplit(corvidae, ":")))
    


    strsplit() is immensely useful to extract elements from strings with a relatively well defined structure.

    s <- "1, 1, 2, 3, 5, 8"
    strsplit(s, ", ")[[1]] # split on comma-space
    
    s <- "~`!@#$%^&*()_-=+[{]}\|;:',<.>/?"
    strsplit(s, "")[[1]]  # split on empty string
    
    s <- "chronological lifespan:\tincreased\ncold sensitivity:\tincreased\nsporulation:\tnormal"
    strsplit(s, "\\t|\\n")[[1]]  # split on tab or newline
    
    



     

    Behaviour

     


    Capturing and using matches

    Matches can be captured and used, e.g. in gsub().

    # Capture matches by placing them in parentheses. To immediatley reuse them, refer to them with "backreferences": <code>\\1</code>, <code>\\2</code>, <code>\\3</code>.
    
    # Example 1:
    # The beginning and ending three words of some text...
    s <- "I know, however, that its precarious and remote villages lie within the lowlands of the Wisla River."
    gsub("^((\\S+\\s+){3}).*((\\s\\S+){3})$", "\\1 ... \\3", s)
    
    # Note: matches \\2 and \\4 are the inside the parentheses that are there to
    # group things to be found {3}-times.
    
    
    # Example 2:
    # A binomial species name has a genus, a species, and possibly a strain name.
    # We use \\S (not whitespace) and \\s (whitespace) to tease this apart into
    # three captures expressions:
    s <- "Saccharomyces cerevisiae S288C"
    gsub("^(\\S+)\\s(\\S+)\\s*(.*)$",
         "genus: \\1; species: \\1 \\2; (strain: \\3)",
         s)
    gsub
    
    

    Capturing and returning matches

    Finding and returning matches in R is a two-step process. (1) find matches with regexpr() (one match), gregexpr() (all matches), or regexec() (sub-expressions in parentheses). All of these return a "match object". (2) use the match object to extract the matching substrings from the original string.


    
    
    # Extracting gene names in text.
    
    # Let's define a valid gene name to be a substring that is bounded by
    # word-boundaries, starts with an upper-case character, contains more upper-case
    # characters or numbers or a hyphen or underscore, with a minimal length of 3.
    # Here is a regex, and we put the part of the string that we want to recover, in
    # parentheses:
    
    patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b"
    
    # Test: positives
    grepl(patt, "MBP1")
    grepl(patt, "AAT")
    grepl(patt, " AI1")
    grepl(patt, "ASP3-1 ")
    grepl(patt, " AI5_ALPHA; ")
    grepl(patt, " (TY1B-PR3) ")
    # Test: negatives
    grepl(patt, "G1") # Too short
    grepl(patt, "G1-") # Hyphen at end
    grepl(patt, "Cell") # contains lower-case
    
    # Let's apply this to retrieve gene names in text
    
    s <- "Transcriptional activation of CLN1, CLN2, and a putative new G1 cyclin (HCS26) by SWI4, a positive regulator of G1-specific transcription. Cell 66(5):1015-26"
    
    (m <- regexpr(patt, s)) # found a match in position 31
    regmatches(s, m)        # retrieve it
    
    (m <- gregexpr(patt, s)) # found all matches
    regmatches(s, m)         # retrieve them (note, this is a list)
    
    # The function of choice however is regexec(). It returns whatever the pattern
    # has defined in parentheses, the others return the entire match. The
    # parentheses are quite important, because we might want to specify additional
    # context for a valid match, but we might not want the context in the match
    # itself. In our example we used word boundaries - \\b - for such context; but
    # these are zero-length and don't actually match a character, so they don't
    # contaminate the substring anyway. But in general we need to be able to
    # precisely retrieve only the target substring.
    
    (m <- regexec(patt, s)) # only the parenthesized substring
    regmatches(s, m)        # retrieve it
    
    # Note that there are two elements: the first is the whole match, the second
    # is the substring that is in parentheses. In our example these are the same.
    # Here is an example where they are not:
    s <- "Find the last word. And tell me."
    (m <- regexec("\\s(\\w+)\\.", s))
    regmatches(s, m)        # retrieve it
    
    # Unfortunately there is no option to capture multiple matches
    # in base R: regexec() lacks a corresponding gregexec()...
    
    patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b"
    
    s <- "Transcriptional activation of CLN1, CLN2, and a putative new G1 cyclin (HCS26) by SWI4, a positive regulator of G1-specific transcription. Cell 66(5):1015-26"
    
    
    # Solution 1 (base R): you can use multiple matches in an sapply()
    # statement...
    sapply(regmatches(s, gregexpr(patt, s))[[1]],
           function(M){regmatches(M, regexec(patt, M))})
    
    
    # Solution 2 (probably preferred): you can use
    # str_match_all() from the very useful library "stringr" ...
    if (! requireNamespace("stringr", quietly=TRUE)) {
      install.packages("stringr")
    }
    # Package information:
    #  library(help = stringr)       # basic information
    #  browseVignettes("stringr")    # available vignettes
    #  data(package = "stringr")     # available datasets
    
    
    stringr::str_match_all(s, patt)
    stringr::str_match_all(s, patt)[[1]][,2]
    # [1] "CLN1"  "CLN2"  "HCS26" "SWI4"
    
    # Note that str_match_all() handles the match object internally, no need for
    # the two-step code.
    
    


    An interesting new alternative/complement to the base R regex libraries is the package "ore" that uses the Oniguruma libraries and supports multiple character encodings, which you need when you work with Unicode and/or CJK character sets.


    if (! requireNamespace("ore"), quietly = TRUE) {
        install.packages("ore")
    }
    # Package information:
    #  library(help = ore)       # basic information
    #  browseVignettes("ore")    # available vignettes
    #  data(package = "ore")     # available datasets
    
    
    S <- "The quick brown fox jumps over a lazy dog"
    
    ore::ore.search(". .", S)
    ore::ore.search(". .", S, all=TRUE)
    M <- ore::ore.search(". .", S, all=TRUE)
    M$nMatches
    M$match[2:4]
    

    According to the author John Clayden, key advantages include:

    • Search results focus around the matched substrings (including parenthesised groups), rather than the locations of matches. This saves

    extra work with regmatches() or similar to extract the matches themselves.

    • Substantially better performance, especially when matching against long strings.
    • Substitutions can be functions as well as strings.
    • Matches can be efficiently obtained over only part of the strings.
    • Fewer core functions, with more consistent names.


     

    Modifiers

    A number of modifiers can be applied as arguments to regular expression functions that may be useful. Here are the two most important ones.

    
    # Option "ignore.case" allows to have case-insensitive matches. This is usually
    # poor programming style, a more explicit (= better) way is to define your
    # character classes appropriately.
    
    patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b"
    
    s <- "The MBP1 gene encodes the Mbp1 protein."
    
    m <- gregexpr(patt, s)
    regmatches(s, m)[[1]]
    
    m <- gregexpr(patt, s, ignore.case = TRUE)
    regmatches(s, m)[[1]]
    
    
    # For regex functions in the stringr package, you can compile the pattern
    # with the regex() function, and include the option "comments = TRUE". This
    # allows you to insert whitespace and # characters into the pattern
    # which will be ingnored by the regex engine. Thus you can comment
    # complex regular expressions inline.
    
    
    myRegex <- stringr::regex("\\b            # word boundary
                              (               # begin capture
                              [A-Z]           # one uppercase letter
                              [A-Z0-9\\-_]+   # one or more letters, numbers, hyphen or
                                              #   underscore
                              [A-Z0-9]        # one letter or number.
                              # Note: this captured subexpression has a minimum length of 3.
                              )               # end capture
                              \\b",           # word boundary
                              comments = TRUE)
    
    stringr::str_match_all(s, myRegex)[[1]][2]
    
    


    Greed

    By default, quantitative matches except zero/one (i.e. the ? character) are "greedy", i.e. they will match the largest possible number of characters. For example:

    s <- "abc123"
    
    patt <- "(\\w+)(\\d+)"   # word characters, followed by digits. This pattern ...
    stringr::str_match_all(s, patt)[[1]][-1]
    
    # ...yields "abc12" and "3" . This is because \w+ is greedy and grabs as many
    # alphanumeric characters as it can before \d+ gets a chance to match.  A "?"
    # after a quantity specifier makes it non-greedy, therefore ...
    
    patt <- "(\\w+?)(\\d+)"   # Note the questionmark in (\\w+?)
    stringr::str_match_all(s, patt)[[1]][-1]
    
    # ... now \d+ gets a chance to match as many digits as possible
    
    


     

    Regular Expressions in other languages

     

    PHP

    <?php
    $string = "The quick brown fox jumps over a lazy dog";
    
    $words = preg_split('/\s+/', $string);
    print_r($words);
    
    preg_match('/.\W./', $string, $matches);
    print_r($matches);
    
    preg_match_all('/.\W./', $string, $matches);
    print_r($matches);
    
    #indexed preg_replace, iterates over array elements
    $pat = array(); #broken
    $pat[0] = '/quick brown/';
    $pat[1] = '/fox/';
    $pat[2] = '/lazy/';
    $pat[3] = '/dog/';
    $rep = array();
    $rep[0] = 'lazy';
    $rep[1] = 'dog';
    $rep[2] = 'quick brown';
    $rep[3] = 'fox';
    print(preg_replace($pat, $rep, $string));
    print("\n");
    
    $pat = array();
    $pat[0] = '/quick brown fox/';
    $pat[1] = '/lazy dog/';
    $pat[2] = '/foo/';
    $pat[3] = '/bar/';
    $rep = array();
    $rep[0] = 'foo';
    $rep[1] = 'bar';
    $rep[2] = 'lazy dog';
    $rep[3] = 'quick brown fox';
    print(preg_replace($pat, $rep, $string));
    print("\n");
    
    
    ?>
    


     

    Python

    Python regular expression are provided through the module re. See here for documentation.

    .re functions in general operate on a string and return a MatchObject. The MatchObject is then further analyzed by supplied methods.

    The most frequently used functions are:

    • re.match(pattern, string) matches only at the beginning of a line.
    • re.search(pattern, string) matches anywhere in a line.
    • re.split(pattern, string) returns the split string as a list.
    • re.findall(pattern, string) returns all matches in a list.


    Python example

    Download this .svg file to experiment.


    # parse_SVG_example.py
    # Read an svg file line by line and process path data
    # to write commands separately to an output file, line by line.
    
    import re
    
    filePath = "/my/working/directory/whatever/"
    
    myIn  = filePath + "sample.svg"
    myOut = filePath + "test.svg"
    
    IN  = open(myIn)
    OUT = open(myOut, "w")
    
    for line in IN:
       path = re.search('\sd=\"(.*?)\"', line) # returns the MatchObject "path"
       if path:
           # Found. Process the result with a second regex.
           # path.group() is a method of the MatchObject
           pathData = re.findall('([aAcChHlLmMqQsStTvVzZ]|-?\d*\.?\d+)',
                                 path.group(1))
           # Write it nicely formatted to output, one command per line
           OUT.write("d=\"")
           s = ""    # we accumulate output lines in this variable
           for token in pathData:
               if re.match('[aAcChHlLmMqQsStTvVzZ]', token):
                   # it's a letter:
                   OUT.write("\n    "+s)     # flush s to output
                   s = token + " "       # new s
               else:
                   s = s + token + " "   # append to s
           OUT.write("\n    " + s + "\"\n")  # flush s, close string, and add \n
    
       else:
           OUT.write(line)
    
    IN.close()
    OUT.close()
    
    


     

    Javascript

    Javascript is probably the most convenient choice if you want to interact directly with the DOM (Document Object Model) of a Web page. Here is example code for a "bookmarklet" that rewrites the URL of a journal-article page for access from outside the UofT network.

    javascript:(function(){
       var url=window.location.href;
       var re=/\/([\w.]+)\/(.*$)/;
       var match=url.match(re);
       var newURL="http://"+match[1]+".myaccess.library.utoronto.ca/"+match[2];
       window.location.href=newURL;
    })();
    void 0
    
    

    Put this into the body of an arbitrary bookmark on your browser, then click it to be redirected to our library's free access system of a paywalled journal article.


     

    POSIX (Unix, the bash shell)

    Use in:

    • grep
    grep finds patterns in files. Patterns are regular expressions and can come in basic or extended flavors. In GNU grep there is no difference between these; in implementations where there is, you switch from basic to extended syntax with the grep -E flag which is the same as invoking egrep.
    Example: what demons run on your system?
    ps -ax | egrep -o "/([^A-Z]\w+d)\b" | sort -u
    

    Other uses of regular expressions in:

    • find
    • sed
    • awk
    • cut

    ... see the man pages.


     

    Practice

    Task:

     
    • Open RStudio and load the ABC-units R project. If you have loaded it before, choose FileRecent projectsABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
    • Choose ToolsVersion ControlPull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
    • Type init() if requested.
    • Open the file RPR-RegEx.R and follow the instructions.


     

    Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.


     


     

    Appendix I: Metacharacters and their meaning

    ExpressionMeaning
    \Escape character
    |Alternation character. Matches either one of specified alternatives. For example, /Asp|Glu/i matches ASP, Asp, asp, GLU, Glu or glu.
    ^If the caret occurs at the beginning of an expression it anchors the expression at the beginning of a line or input.
    For example, /^AT/ does not match the 'AT' in "HETATM" but does match it in "ATOM". If the caret occurs as the first character of a character set as in [^a-z] it specifies the complement of the character set. Everywhere else, it simply matches the character "^".
    $Matches end of input or line.
    For example, /t$/ does not match the 't' in "eater", but does match it in "eat"
    *Matches the preceding character 0 or more times. For example, /bo*/ matches 'boooo' in "A ghost booooed" and 'b' in "A bird warbled", but nothing in "A goat grunted".
    +Matches the preceding character 1 or more times. Equivalent to {1,} . For example, /a+/ matches the 'a' in "candy" and all the a's in "caaaaaaandy."
    ?Matches the preceding character 0 or 1 time. For example, /e?le?/ matches the 'el' in "angel" and the 'le' in "angle."
    .(The decimal point) matches any single character except the newline character.
    (x)Matches 'x' and remembers the match. For example, /(foo) bar/ matches "foo bar" and stores 'foo' in the special variable $1. /(more) (joy)/ matches "more joy", then stores 'more' in $1 and 'joy' in $2.
    {n}Where n is a positive integer. Matches exactly n occurrences of the preceding character. For example, /a{2}/ doesn't match the 'a' in "candy," but it matches all of the a's in "caandy," and the first two a's in "caaandy."
    {n,}Where n is a positive integer. Matches at least n occurrences of the preceding character. For example, /a{2,}/ doesn't match the 'a' in "candy", but matches all of the a's in "caandy" and in "caaaaaaandy."
    {n,m}Where n and m are positive integers. Matches at least n and at most m occurrences of the preceding character. For example, /a{1,3}/ matches nothing in "cndy", the 'a' in "candy," the first two a's in "caandy," and the first three a's in "caaaaaaandy" Notice that when matching "caaaaaaandy", the match is "aaa", even though the original string had more a's in it.
    [xyz]A character set. Matches any one of the enclosed characters. You can specify a range of characters by using a hyphen. For example, [bcd] is the same as [b-d] . They match the 'c' in "cysteine" and the 'd' in "ached" .
    [^xyz]A negated or complemented character set. That is, it matches anything that is not enclosed in the brackets. Note that the caret has to be the first character in the bracket set. For example, [^abc] is the same as [^a-c] . They initially match 'l' in "alanine" and 'y' in "cysteine"


     

    Appendix II: Character classes and their meaning

    ExpressionMeaning
    [\b]Matches a backspace. (Not to be confused with \b .)
    \bMatches a word boundary, such as a space. (Not to be confused with [\b].) The difference is that one is within a character set and the other is not! For example, /\bn\w/ matches the 'no' in "noonday"; /\wy\b/ matches the 'ly' in "possibly yesterday."
    \BMatches a non-word boundary. For example, /\w\Bn/ matches 'on' in "noonday", and /y\B\w/ matches 'ye' in "possibly yesterday."
    \cXWhere X is a control character. Matches a control character in a string. For example, /\cM/ matches control-M in a string.
    \dMatches a digit character. Equivalent to [0-9] . For example, /\d/ or /[0-9]/ matches '2' in "B2 is the suite number."
    \DMatches any non-digit character. Equivalent to [^0-9] . For example, /\D/ or /[^0-9]/ matches 'B' in "B2 is the suite number."
    \fMatches a form-feed.
    \nMatches a linefeed.
    \rMatches a carriage return.
    \sMatches a single white space character, including space, tab, form feed, line feed. Equivalent to [ \f\n\r\t\v] . For example, /\s\w*/ matches ' bar' in "foo bar."
    \SMatches a single character other than white space. Equivalent to [^ \f\n\r\t\v] . For example, /\S/\w* matches 'foo' in "foo bar."
    \tMatches a tab
    \vMatches a vertical tab.
    \wMatches any alphanumeric character including the underscore. Equivalent to [A-Za-z0-9_] . For example, /\w/ matches 'a' in "apple," '5' in "$5.28," and '3' in "3D."
    \WMatches any non-word character. Equivalent to [^A-Za-z0-9_] . For example, /\W/ or /[^$A-Za-z0-9_]/ matches '%' in "50%."


     

    Appendix III: Anchor codes and their meaning

    ExpressionMeaning
    ^If the caret occurs at the beginning of an expression it anchors the expression at the beginning of a line or input. For example, /^AT/ does not match the 'AT' in "HETATM" but does match it in "ATOM".
    $Matches end of input or line. For example, /t$/ does not match the 't' in "eater", but does match it in "eat" as well as in "eat\n".
    \bMatches a word boundary, such as a space. (Not to be confused with [\b].) The difference is that one is within a character set and the other is not! For example, /\bn\w/ matches the 'no' in "noonday"; /\wy\b/ matches the 'ly' in "possibly yesterday."
    \BMatches a non-word boundary. For example, /\w\Bn/ matches 'on' in "noonday", and /y\B\w/ matches 'ye' in "possibly yesterday."
    \AMatches at the start of a string. Like "^". For example, /\AAT/ matches "AT" in "ATOM " but not in "HETATM"
    \ZMatches at the end of a string. Like "$". For example, /\t\Z/ matches a tab at the end of the string but not anywhere else.
    (?: … )Group what's between the brackets, but discard match.
    (?= … )The preceeding pattern must be followed by this one in order to match.
    (?! … )The preceeding pattern must not be followed by this one in order to match.


     

    Appendix IV: Modifiers and their meaning

    Expression<Meaning
    gMatches globally - i.e. matches all occurrences of pattern, one after the other, do not stop at the first one.
    iMatch in a case-insensitive manner. For example, /[ACGT]/i matches any specific nucleotide in upper or lower case.
    xIgnore whitespace in the expression
    oEvaluate pattern only once.
    mTreat the whole string as multiple lines.
    sTreat the whole string as a single line, i.e. don't treat "\n" as line separators. For example, /(<table>.*?</table>)/s captures an entire table, including newline characters. Without the modifier nothing would match if there is even a single newline in between the tags.


     

    Further reading, links and resources

    Visit the stackoverflow thread on regex and HTML parsing. What's your opinion on the OP's question?


     


    About ...
     
    Author:

    Boris Steipe <boris.steipe@utoronto.ca>

    Created:

    2017-08-05

    Modified:

    2020-09-22

    Version:

    1.2

    Version history:

    • 1.2 2020 Maintenance, added gsub() cature and backreference
    • 1.1 Change from require() to requireNamespace() and use <package>::<function>() idiom.
    • 1.0 First live version, translated from Perl examples in old version
    • 0.1 First stub

    CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.