Difference between revisions of "RPR-RegEx"
m |
m |
||
Line 19: | Line 19: | ||
− | {{ | + | {{DEV}} |
{{Vspace}} | {{Vspace}} | ||
Line 29: | Line 29: | ||
<section begin=abstract /> | <section begin=abstract /> | ||
<!-- included from "../components/RPR-RegEx.components.wtxt", section: "abstract" --> | <!-- included from "../components/RPR-RegEx.components.wtxt", section: "abstract" --> | ||
− | ... | + | Using regular expressions is a key skill for all apsects of data science, and, by extension, for bioinformatics. There is a bit of a learning curve involved, but once you are even slightly comfortable with their syntax and use in code, you will appreciate their power in automating otherwise unbearably tedious tasks. This unit introduces the principles, the syntax, and provides a fair amount of practice. |
<section end=abstract /> | <section end=abstract /> | ||
Line 40: | Line 40: | ||
<!-- included from "ABC-unit_components.wtxt", section: "notes-prerequisites" --> | <!-- included from "ABC-unit_components.wtxt", section: "notes-prerequisites" --> | ||
You need to complete the following units before beginning this one: | You need to complete the following units before beginning this one: | ||
− | *[[RPR-Introduction]] | + | *[[RPR-Introduction|RPR-Introduction (Introduction to R)]] |
{{Vspace}} | {{Vspace}} | ||
Line 47: | Line 47: | ||
=== Objectives === | === Objectives === | ||
<!-- included from "../components/RPR-RegEx.components.wtxt", section: "objectives" --> | <!-- included from "../components/RPR-RegEx.components.wtxt", section: "objectives" --> | ||
− | ... | + | This unit will ... |
+ | * ... introduce regular expressions; | ||
+ | * ... demonstrate thier use in R functions; | ||
+ | * ... teach how to apply them in common tasks. | ||
{{Vspace}} | {{Vspace}} | ||
Line 54: | Line 57: | ||
=== Outcomes === | === Outcomes === | ||
<!-- included from "../components/RPR-RegEx.components.wtxt", section: "outcomes" --> | <!-- included from "../components/RPR-RegEx.components.wtxt", section: "outcomes" --> | ||
− | ... | + | After working through this unit you ... |
+ | * ... can express pattern-matching tasks as regular expressions; | ||
+ | * ... are familar with online regex testing sites that help you troubleshoot your expressions; | ||
+ | * ... have begun write code that uses them for a variety of purposes. | ||
{{Vspace}} | {{Vspace}} | ||
Line 85: | Line 91: | ||
<!-- included from "../components/RPR-RegEx.components.wtxt", section: "contents" --> | <!-- included from "../components/RPR-RegEx.components.wtxt", section: "contents" --> | ||
− | == | + | ==First steps== |
A {{WP|Regular expression|Regular Expression}} is a specification of a pattern of characters. The typical use of a regular expression is to find occurrences of the pattern in a string. Regular expressions have a flexible syntax that allows them to handle a range of tasks - from trivial substring matches to complex nested motifs. The syntax of regular expressions is a programming language in its own right, and is a powerful way of concisely and uniquely defining a pattern. | A {{WP|Regular expression|Regular Expression}} is a specification of a pattern of characters. The typical use of a regular expression is to find occurrences of the pattern in a string. Regular expressions have a flexible syntax that allows them to handle a range of tasks - from trivial substring matches to complex nested motifs. The syntax of regular expressions is a programming language in its own right, and is a powerful way of concisely and uniquely defining a pattern. | ||
− | Regular expressions are examples of '''deterministic pattern matching''' - they either match a particular expression or not. This is in contrast to probabilistic pattern matching in which a pattern is more or less similar to | + | Regular expressions are examples of '''deterministic pattern matching''' - they either match a particular expression or not. This is in contrast to probabilistic pattern matching in which a pattern is ''more or less'' similar to a query. |
− | + | Truth be told, many programmers have a love-hate relationship with regular expressions. The syntax of regular expressions is very powerful and expressive, but also terse, not always intuitive, and sometimes hard to understand. Let's try a few simple things: | |
− | |||
− | + | Here is string to play with: the sequence of Mbp1, copied from the [https://www.ncbi.nlm.nih.gov/protein/NP_010227 NCBI Protein database page for yeast Mbp1]. | |
− | |||
− | + | 1 msnqiysary sgvdvyefih stgsimkrkk ddwvnathil kaanfakakr trilekevlk | |
− | + | 61 ethekvqggf gkyqgtwvpl niakqlaekf svydqlkplf dftqtdgsas pppapkhhha | |
+ | 121 skvdrkkair sastsaimet krnnkkaeen qfqsskilgn ptaaprkrgr pvgstrgsrr | ||
+ | 181 klgvnlqrsq sdmgfprpai pnssisttql psirstmgpq sptlgileee rhdsrqqqpq | ||
+ | 241 qnnsaqfkei dledglssdv epsqqlqqvf nqntgfvpqq qssliqtqqt esmatsvsss | ||
+ | 301 pslptspgdf adsnpfeerf pgggtspiis miprypvtsr pqtsdindkv nkylsklvdy | ||
+ | 361 fisnemksnk slpqvllhpp phsapyidap idpelhtafh wacsmgnlpi aealyeagts | ||
+ | 421 irstnsqgqt plmrsslfhn sytrrtfpri fqllhetvfd idsqsqtvih hivkrksttp | ||
+ | 481 savyyldvvl skikdfspqy rielllntqd kngdtalhia skngdvvffn tlvkmgaltt | ||
+ | 541 isnkegltan eimnqqyeqm miqngtnqhv nssntdlnih vntnnietkn dvnsmvimsp | ||
+ | 601 vspsdyityp sqiatnisrn ipnvvnsmkq masiyndlhe qhdneikslq ktlksisktk | ||
+ | 661 iqvslktlev lkesskdeng eaqtnddfei lsrlqeqntk klrkrliryk rlikqkleyr | ||
+ | 721 qtvllnklie detqattnnt vekdnntler lelaqeltml qlqrknklss lvkkfednak | ||
+ | 781 ihkyrriire gtemnieevd ssldvilqtl iannnknkga eqiitisnan sha | ||
+ | // | ||
+ | |||
+ | |||
+ | {{task|1= | ||
+ | |||
+ | Navigate to http://regexpal.com and paste the sequence into the '''lower''' box. This site is one of a number of online regular expression testers; their immediate, visual feedback is invaluable when you are developing regular expression patterns. | ||
+ | |||
+ | Lets try some expressions: | ||
+ | |||
+ | ;Most characters are matched literally. | ||
+ | :Type "<code>a</code>" in to the '''upper''' box and you will see all "<code>a</code>" characters matched. Then replace <code>a</code> with <code>q</code>. | ||
+ | : Now type "<code>aa</code>" instead. Then <code>krnnkk</code>. ''Sequences'' of characters are also matched literally. | ||
+ | |||
+ | ;The pipe character {{pipe}} that symbolizes logical OR can be used to define that more than one character should match: | ||
+ | :<code>i(s{{pipe}}m{{pipe}}q)n</code> matches <code>isn</code> OR <code>imn</code> OR <code>iqn</code>. Note how we can group with parentheses, and try what would happen without them. | ||
+ | |||
+ | ;We can more conveniently specify more than one character to match if we place it in square brackets. This is a "character class". We will encounter those frequently | ||
+ | :<code>[lq]</code> matches <code>l</code> OR <code>q</code>. <code>[milcwyf]</code> matches hydrophobic amino acids. | ||
+ | |||
+ | ;Within square brackets, we can specify "ranges". | ||
+ | :<code>[1-5]</code> matches digits from 1 to 5. | ||
+ | |||
+ | ;Within square brackets, we can specify characters that should NOT be matched, with the "caret", <code>^</code>. | ||
+ | :<code>[^0-9]</code> matches everything EXCEPT digits. <code>[^a-z]</code> matches everything that is not a lower-case letter. That's what we need to remove characters that do not represent amino acids(try it). | ||
+ | |||
+ | }} | ||
+ | |||
+ | {{Vspace}} | ||
+ | |||
+ | Make frequent use of this site to develop your regular expressions step by step. | ||
{{Vspace}} | {{Vspace}} | ||
+ | |||
+ | ===Theory=== | ||
+ | |||
+ | According to the {{WP|Chomsky hierarchy}} regular expressions are a {{WP|Regular grammar|Type-3 (regular) grammar}}, thus their use forms a {{WP|regular language}}. Therefore, like all Type-3 grammatical expressions they can be decided by a {{WP|finite-state machine}}, ''i.e.'' a "machine" that is defined by possible states, and triggering conditions that control transitions between states. Think of such automata as a (possibly elaborate) <code>if ... else</code> construct. The "regex" processor translates the search pattern into such an automaton, which is then applied to the search domain - the string in which the occurrence of the pattern is to be sought. | ||
{{Vspace}} | {{Vspace}} | ||
− | == | + | ===What are they good for=== |
− | + | Regular expressions support virtually all pattern matching tasks in data clean-up, extracting information itms, data mining, "screen scraping", parsing of files, subsetting large tables, ''etc. etc.'' This means, they must be part of your everyday toolkit. | |
− | + | {{Vspace}} | |
− | |||
− | |||
− | |||
− | + | ===When should they not be used=== | |
+ | Since regular expressions are Type-3 grammars, they will fail when trying to parse any more complex grammar. This means, you can't reliably parse XML - and in perticular HTML - with regular expressions. There is a long discussion on this particular topic however, e.g. see [http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags '''here'''], and many other similar threads on stackoverflow, and see [http://programmers.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions '''here'''] for a discussion of when regular expressions should '''not''' be used. Use a real XML parser instead. | ||
{{Vspace}} | {{Vspace}} | ||
+ | |||
+ | ===Perl and POSIX=== | ||
+ | |||
+ | Two dialects of regular expressions exist, they differ in some details of syntax. One is the nearly universal "Perl" dialect ({{WP|Perl}} is a programming language), the other one is the "POSIX" standard that nearly no one uses. Except R. Tragically, in R the POSIX standard is the default. Fortunately this often does not make a difference, and we can explicitly turn this nonsense off. But just like the utterly annoying <code>stringsAsFactors = FALSE</code>, we need to type <code>perl = TRUE</code> much more often than we would like. Somebody, some time, made a wrong design decision and thousands of wasted man- and woman hours later we are still stuck with the consequences. If you use regular expressions according to the POSIX standard, you have to learn the Perl standard anyway. But then you can just use the Perl standard in the first place. The {{WP|Regular expression|Wikipedia page on Regular Expressions}} has a table with a side-by-side comparison of the different ways the two standards express character classes. | ||
+ | |||
+ | {{Vspace}} | ||
+ | |||
+ | ==Regular Expressions in R== | ||
+ | |||
+ | Regular expressions in R can be used | ||
+ | |||
+ | * to match patterns in strings for use in <code>if()</code> or <code>while()</code> conditions, or to retrieve specific instances of patterns with <code>grep()</code>; | ||
+ | * to substitute occurrences of patterns in strings with other strings with <code>gsub()</code>; | ||
+ | * to split strings into substrings that are delimited by the occurrence of a pattern with <code>strsplit()</code>; and more | ||
+ | |||
+ | Accordingly, a basic knowledge of regular expressions is needed to reade and write code, especially code that parses text. | ||
{{Vspace}} | {{Vspace}} | ||
==Syntax== | ==Syntax== | ||
− | Regular expressions are | + | Regular expressions in R are string enclosed in quotation marks. |
− | + | <source lang="rsplus"> | |
+ | "a" | ||
+ | </source> | ||
− | is a regular expression. | + | is a regular expression. It specifies the single, literal character <code>a</code> exactly. |
===Specifying symbols=== | ===Specifying symbols=== | ||
− | The power of regular expressions lies in their flexible syntax that allows to specify character ranges, classes of characters, unspecified characters and much more. This sometimes can be confusing, because the symbols that specify ranges, options, wildcards and the like are of course themselves characters. Characters that specify information about other characters are called metacharacters, | + | The power of regular expressions lies in their flexible syntax that allows to specify character ranges, classes of characters, unspecified characters, alternatives, and much more. This sometimes can be confusing, because the symbols that specify ranges, options, wildcards and the like are of course themselves characters. Characters that specify information about other characters are called '''metacharacters''', these include "<code>.</code>", "<code>?</code>", "<code>+</code>", "<code>*</code>", "<code>[</code>" and "<code>]</code>", "<code>{</code>" and "<code>}</code>" and more. And the opposite is also possible: some plain characters can be turned into metacharacters to symbolize character classes. |
+ | |||
+ | The "<code>\</code>" - '''escape character''' - allows to distinguish when a character is to be taken literally and when it is to be interpreted as a metacharacter. Note that some symbols have to be escaped to be read literally, while some letters have to be escaped to be read as metacharacters. | ||
− | + | But there is a catch in R, relating to when the escapoe charcatre is interpreted. Remember that "<code>\n</code>" is a linebreak in a string, "<code>\t</code>" is a tab, etc. Obviously if you write "<code>\?</code>" (a literal questionmark in a regex), or "<code>\+</code>" (a literal plus-sign in a regex) into a regular string, the mechanism that parses the string is going to see the escape character, then it expects an "n" or a "t" or the like - but what it gets instead is something it doesn't know. So it throws an error. Try: | |
− | Letters whose special meaning as a metacharacter is turned on with the escape character: | + | <source lang="rsplus"> |
+ | "\n" # fine | ||
+ | "\?" # Error: ... | ||
+ | </source> | ||
+ | |||
+ | But then how can we write something like "<code>\?</code>" when we need it? That becomes obvious when you consider what happens with the string: it gets sent to the regex engine for interpretation. Thus the regex engine needs to see: character "\", then character "?". So it needs two characters. The secret is: we need to prevent "\" from attaching to the next character, and specify it as a single character in its own right. We do that by '''escaping''' "\" itself - '''with a backslash'''. Thus "<code>\\</code>" is a literal "\" character - and can get sent to the regex engine. | ||
+ | |||
+ | <source lang="rsplus"> | ||
+ | "\\?" # ok | ||
+ | cat("\\?") # that's what the regex engine sees. | ||
+ | </source> | ||
+ | |||
+ | Consequence is: you need to double the "\\" in R when you want a single "\". That works differently from other programming languages who pass patterns to the regex engine as-is. | ||
+ | |||
+ | Letters whose special meaning as a metacharacter is turned '''on''' with the escape character: | ||
<table border="1" cellpadding="5"> | <table border="1" cellpadding="5"> | ||
Line 144: | Line 226: | ||
</table> | </table> | ||
− | Metacharacters whose special meaning is turned off with the escape character: | + | Metacharacters whose special meaning is turned '''off''' with the escape character: |
<table border="1" cellpadding="5"> | <table border="1" cellpadding="5"> | ||
<tr><th>Character</th><th>Means</th></tr> | <tr><th>Character</th><th>Means</th></tr> | ||
<tr><td><code>+</code></td><td>One or more repetitions of the preceeding expression</td></tr> | <tr><td><code>+</code></td><td>One or more repetitions of the preceeding expression</td></tr> | ||
− | <tr><td><code>\+</code></td><td>the character "+"</td></tr> | + | <tr><td><code>\+</code></td><td>the literal character "+"</td></tr> |
<tr><td><code>\</code></td><td>the escape character</td></tr> | <tr><td><code>\</code></td><td>the escape character</td></tr> | ||
− | <tr><td><code>\\</code></td><td>the character "\"</td></tr> | + | <tr><td><code>\\</code></td><td>the literal character "\"</td></tr> |
<tr><td><code>.</code></td><td>any single character except the newline (\n)</td></tr> | <tr><td><code>.</code></td><td>any single character except the newline (\n)</td></tr> | ||
− | <tr><td><code>\.</code></td><td>a period</td></tr> | + | <tr><td><code>\.</code></td><td>a literal period</td></tr> |
</table> | </table> | ||
Note that these examples are not exhaustive. A (nearly) complete table of symbols and meanings is given in the appendix. | Note that these examples are not exhaustive. A (nearly) complete table of symbols and meanings is given in the appendix. | ||
+ | {{Vspace}} | ||
− | ===Character | + | ===Character Classes=== |
Square brackets specify when more than one specific character can match at a position. | Square brackets specify when more than one specific character can match at a position. | ||
Line 168: | Line 251: | ||
For example: | For example: | ||
− | <code> | + | <code>"[AGR]AATT[CTY]"</code> matches all occurrences of an ApoI restriction site, either specified explicitly, or through the nucleotide ambiguity codes R (purines) or Y (pyrimidines). |
Within character sets, hyphens can specify character ranges. | Within character sets, hyphens can specify character ranges. | ||
Line 174: | Line 257: | ||
<table border="1" cellpadding="5"> | <table border="1" cellpadding="5"> | ||
<tr><th>Expression</th><th>Means</th></tr> | <tr><th>Expression</th><th>Means</th></tr> | ||
− | <tr><td><code>[a-z]</code></td><td>letters</td></tr> | + | <tr><td><code>[a-z]</code></td><td>lowercase letters</td></tr> |
<tr><td><code>[0-9]</code></td><td>digits</td></tr> | <tr><td><code>[0-9]</code></td><td>digits</td></tr> | ||
− | <tr><td><code>[0-9+* | + | <tr><td><code>[0-9+*/=^\\-]</code></td><td>digits and arithmetic symbols (Note the escaped hyphen)</td></tr> |
</table> | </table> | ||
− | Within character sets, some metacharacters that otherwise have special meanings do not need to be escaped | + | If you want to match a literal hyphen, you must escape it. Within character sets, some metacharacters that otherwise have special meanings usually do not need to be escaped. |
===The complement=== | ===The complement=== | ||
− | The caret character "^" denotes the ''complement'' of a character set; i.e. everything that is not that expression. | + | The caret character "^" denotes the ''complement'' of a character set; i.e. everything that is '''not''' that expression. |
<table border="1" cellpadding="5"> | <table border="1" cellpadding="5"> | ||
Line 191: | Line 274: | ||
</table> | </table> | ||
− | Note that outside of | + | Note that outside of square brackets, the "^" character is an "anchoring code" and means "beginning of the string". This can be confusing. |
For character classes, the class in upper case denotes the complement. This can also be confusing ! | For character classes, the class in upper case denotes the complement. This can also be confusing ! | ||
Line 197: | Line 280: | ||
<table border="1" cellpadding="5"> | <table border="1" cellpadding="5"> | ||
<tr><th>Character</th><th>Means</th></tr> | <tr><th>Character</th><th>Means</th></tr> | ||
− | <tr><td><code>\W</code></td><td>not a word character</td></tr> | + | <tr><td><code>\w</code></td><td>a word character</td></tr> |
− | <tr><td><code>\S</code></td><td>not a space character</td></tr> | + | <tr><td><code>\W</code></td><td>'''not''' a word character</td></tr> |
+ | <tr><td><code>\s</code></td><td>a space character</td></tr> | ||
+ | <tr><td><code>\S</code></td><td>'''not''' a space character</td></tr> | ||
</table> | </table> | ||
Line 207: | Line 292: | ||
<table border="1" cellpadding="5"> | <table border="1" cellpadding="5"> | ||
<tr><th>Expression</th><th>What it means</th><th>Example (meaning)</th></tr> | <tr><th>Expression</th><th>What it means</th><th>Example (meaning)</th></tr> | ||
− | <tr><td><code>?</code></td><td>match zero or one times</td><td>"? (there may or may not be a quote mark)</td></tr> | + | <tr><td><code>?</code></td><td>match zero or one times</td><td><code>"?</code> (there may or may not be a quote mark)</td></tr> |
− | <tr><td><code>+</code></td><td>match one or more</td><td> | + | <tr><td><code>+</code></td><td>match one or more</td><td><code>"?</code> (there's at least one uppercase letter)</td></tr> |
− | <tr><td><code>*</code></td><td>match any number</td><td>.* (there may be some characters)</td></tr> | + | <tr><td><code>*</code></td><td>match any number</td><td><code>.*</code> (there may be some characters)</td></tr> |
− | <tr><td><code>{min,max}</code></td><td>match between min and max times (assumes 0 and infinity respectively if not specified)</td><td>[ | + | <tr><td><code>{min,max}</code></td><td>match between min and max times (assumes 0 and infinity respectively if not specified)</td><td><code>[atAT]{20,200}</code> (a stretch of between 20 and 200 upper- or lowercase As or Ts)</td></tr> |
</table> | </table> | ||
For example: | For example: | ||
− | <code> | + | <code>"AAUAAA[ACGU]{10,30}$"</code> defines a polyadenylation site - a AAUAAA motif, followed by 10 to 30 of any nucleotide before the end of the RNA. |
Line 229: | Line 314: | ||
</table> | </table> | ||
+ | {{Vspace}} | ||
− | + | Of course defining a regular expression pattern does not yet do anything with it. Below are the most important R functions that use regular expressions. Write the small code samples that are provided below and test how the operators and regular expressions work. | |
− | + | {{Vspace}} | |
+ | |||
+ | ===Functions that don't use regular expressions=== | ||
+ | |||
+ | Not all pattern searches in strings use (and need) regular expressions. Sometimes | ||
+ | simple, exact string-matching is enough. R uses string matching in character equality (<code>==</code>) and by extension, the set operation functions (<code>union(), intersect()</code> etc.), (<code>match()</code>), and the (<code>%in%</code>). | ||
+ | |||
+ | <source lang="R"> | ||
+ | |||
+ | vA <- c("the", "quick", "brown", "fox", "jumps", "over", "a", "lazy", "dog") | ||
− | ==== | + | vA[2] == "quick" # TRUE |
+ | vA[2] == "quack" # FALSE | ||
− | + | vA == "fox" | |
− | + | # match tests for string equality | |
+ | match("fox", vA) # 4, i.e. the 4th element matches the string | ||
+ | match("o", vA) # NA: matches have to be to the WHOLE element | ||
− | + | # match("fox", vA) is equivalent to... | |
+ | which(vA == "fox") | ||
− | + | # %in% can be used for creating intersections | |
+ | # find whether elements from one vector are | ||
+ | # contained in another: | ||
− | + | vB <- c("Quacking", "the", "duck", "wings", "over", "my", "cozy", "cot") | |
− | |||
− | |||
− | + | vA %in% vB | |
+ | vB %in% vA # note that the length of the return vector is the same as the | ||
+ | # length of the first argument. So read this as: | ||
+ | # "Which of my vB are also in vA" | ||
− | + | # We can use this to subset the vector with elements that are present in | |
+ | # both: | ||
− | + | vB[vB %in% vA] | |
− | + | # which is, of course, the intersection set operation. | |
− | + | intersect(vA, vB) | |
− | |||
− | |||
</source> | </source> | ||
− | ... | + | {{Vspace}} |
+ | |||
+ | ===Functions that use regular expressions=== | ||
+ | |||
+ | The general online help page is [http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html '''here''']. Remember: R's default behaviour is extended POSIX. To be sure of the regex dialect pass the <code>perl = TRUE</code> parameter. | ||
+ | |||
+ | |||
+ | ====grep()==== | ||
− | + | <!-- for updates, see code in R_Exercise -Bioinformatics "Sequence.R script --> | |
− | + | <source lang="R"> | |
− | == | + | # grep() is like match(), but uses regular expressions. A variant of grep() that |
+ | # returns a boolean vector - like "==" does - is grepl(). That is useful | ||
+ | # becoause we can & or | the vector, or invert it with ! . | ||
− | + | grep("fox", vA) | |
+ | grep("o", vA) # Aha! now we get all elements that contain an "o" - | ||
+ | # Because we get partial matches with regular expressions. | ||
+ | vA[grep("o", vA)] # subset | ||
+ | |||
+ | grepl("o", vA) # logical | ||
+ | ! grepl("o", vA) # its inverse | ||
+ | |||
+ | vA[! grepl("o", vA)] # subset all words without "o" | ||
− | |||
− | |||
</source> | </source> | ||
− | + | ====Subsetting example==== | |
+ | |||
+ | Consider the following regular expression: | ||
+ | |||
+ | <source lang="R"> | ||
− | + | patt <- "^\\s*#" | |
− | |||
− | |||
</source> | </source> | ||
− | + | This matches if the string it is applied to does not begin with a "#", which may or may not be preceeded by whitespaces. This would be useful to ignore comment lines in a data file. | |
The regular expression above is decomposed as follows: | The regular expression above is decomposed as follows: | ||
− | |||
− | |||
#<code>^</code> the beginning of the line | #<code>^</code> the beginning of the line | ||
− | #<code>\s</code> any whitespace character ... | + | #<code>\\s</code> any whitespace character ... |
#<code>*</code> ... repeated 0 or more times | #<code>*</code> ... repeated 0 or more times | ||
#<code>#</code> the hash character | #<code>#</code> the hash character | ||
− | |||
− | |||
− | <source lang=" | + | The following example would read a file into an vector of lines, then drop all lines that are empty, and all lines that are comments. This is a straightforward idiom. |
− | + | ||
− | + | <source lang="R"> | |
− | + | ||
+ | IN <- "test.txt" | ||
+ | patt <- "^\\s*#" | ||
− | + | myData <- readLines(IN) | |
− | + | myData <- myData[myData != ""] # drop all elements that are the empty string | |
− | + | myData <- myData[! grepl(patt, myData)] # drop all elements match the pattern | |
− | |||
− | |||
− | |||
− | |||
− | |||
</source> | </source> | ||
− | ====Substitution - | + | {{Vspace}} |
+ | |||
+ | ==== Substitution - gsub() ==== | ||
− | + | Think of "gsub"" as "global substitution", and you'll understand that there exists another function, <code>sub()</code> that replaces only the first occurrence of a pattern, rather than all of them as <code>gsub()</code> does. I can't imagine what the use case for that might be and I don't think I have ever used that. I get an intuitive sense that code that needs such a function should probably be reconceived. But <code>gsub()</code> is very useful. | |
− | + | <source lang="R"> | |
− | + | (s <- " 1 MKLAACFLTL LPGFAVA... 17 ") # E-coli Alpha Amylase signal peptide | |
− | |||
− | |||
− | + | # Drop everything from this string that is not an amino acid one-letter code. | |
+ | # We use gsub() to first identify all non-amino acid letters with a character | ||
+ | # class regular expression, then we replace each occurrence with the empty | ||
+ | # string. | ||
+ | |||
+ | gsub("[^ACDEFGHIKLMNPQRSTVWY]", "", s) | ||
+ | |||
+ | # or, with assignment: ... | ||
+ | s <- gsub("[^ACDEFGHIKLMNPQRSTVWY]", "", s) | ||
− | |||
− | |||
</source> | </source> | ||
− | |||
− | + | ====strsplit() ==== | |
− | |||
− | |||
− | |||
− | + | Another function that makes use of regular expressions is <code>strsplit()</code>. It takes a vector of strings, and returns a list, one element for each element of the vector, in which each string has been split up by separating it along a regular expression. | |
− | |||
− | |||
− | |||
− | + | <source lang="R"> | |
+ | x <- c("a b c", "1 2") | ||
+ | strsplit(x, " ") | ||
+ | # [[1]] | ||
+ | # [1] "a" "b" "c" | ||
+ | # | ||
+ | # [[2]] | ||
+ | # [1] "1" "2" | ||
</source> | </source> | ||
+ | Since even a single string returns a list, you often have to extract the element you want as a vector for further use. | ||
− | + | <source lang="R"> | |
+ | corvidae <- c("crow:jackdaw:jay:magpie:raven:rook") | ||
+ | strsplit(corvidae, ":") | ||
+ | |||
+ | unlist(strsplit(corvidae, ":")) | ||
+ | strsplit(corvidae, ":")[[1]] | ||
− | + | # Consider: | |
− | + | length(strsplit(corvidae, ":")) | |
+ | length(unlist(strsplit(corvidae, ":"))) | ||
</source> | </source> | ||
− | |||
− | + | <code>strsplit()</code> is immensely useful to extract elements from strings with a relatively well defined structure. | |
− | + | <source lang="R"> | |
+ | s <- "1, 1, 2, 3, 5, 8" | ||
+ | strsplit(s, ", ")[[1]] # split on comma-space | ||
− | + | s <- "~`!@#$%^&*()_-=+[{]}\|;:',<.>/?" | |
+ | strsplit(s, "")[[1]] # split on empty string | ||
− | + | s <- "chronological lifespan:\tincreased\ncold sensitivity:\tincreased\nsporulation:\tnormal" | |
+ | strsplit(s, "\\t|\\n")[[1]] # split on tab or newline | ||
− | + | </source> | |
− | |||
− | |||
+ | {{Vspace}} | ||
− | == | + | ==Behaviour== |
− | + | {{Vspace}} | |
− | |||
− | |||
− | |||
− | + | ====Capturing matches ==== | |
+ | Finding and '''returning''' matches in R is a two-step process. (1) find matches with <code>regexpr()</code> (one match), <code>gregexpr()</code> (all matches), or <code>regexec()</code> (sub-expressions in parentheses). All of these return a "match object". (2) use the match object to extract the matching substrings from the original string. | ||
− | |||
− | + | <source lang="R"> | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
+ | # Extracting gene names in text. | ||
− | + | # Let's define a valid gene name to be a substring that is bounded by | |
+ | # word-boundaries, starts with an upper-case character, contains more upper-case | ||
+ | # characters or numbers or a hyphen or underscore, with a minimal length of 3. | ||
+ | # Here is a regex, and we put the part of the string that we want to recover in | ||
+ | # parentheses: | ||
− | + | patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b" | |
− | ; | + | # Test: positives |
− | + | grepl(patt, "MBP1") | |
− | + | grepl(patt, "AAT") | |
− | + | grepl(patt, " AI1") | |
+ | grepl(patt, "ASP3-1 ") | ||
+ | grepl(patt, " AI5_ALPHA; ") | ||
+ | grepl(patt, " (TY1B-PR3) ") | ||
+ | # Test: negatives | ||
+ | grepl(patt, "G1") # Too short | ||
+ | grepl(patt, "G1-") # Hyphen at end | ||
+ | grepl(patt, "Cell") # contains lower-case | ||
− | + | # Let's apply this to retrieve gene names in text | |
− | |||
− | |||
− | |||
− | |||
− | + | s <- "Transcriptional activation of CLN1, CLN2, and a putative new G1 cyclin (HCS26) by SWI4, a positive regulator of G1-specific transcription. Cell 66(5):1015-26" | |
− | + | (m <- regexpr(patt, s)) # found a match in position 31 | |
+ | regmatches(s, m) # retriev it | ||
− | < | + | (m <- gregexpr(patt, s)) # found all matches match in position 31 |
− | # | + | regmatches(s, m) # retrieve them (note, this is a list) |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | # The function of choice however is regexec(). It returns whatever the pattern | |
+ | # has defined in parentheses, the others return the entire match. The | ||
+ | # parentheses are quite importnat, because we might want to specify additional | ||
+ | # context for a valid match, but we might not want the context in the match | ||
+ | # itself. In our example we used word boundaries - \\b - for such context; but | ||
+ | # these are zero-length and don't actually match a character, so they don't | ||
+ | # contaminate the substring anyway. But in general we need to be able to | ||
+ | # precisely define what we want. | ||
− | + | (m <- regexec(patt, s)) # only the parenthesized substring | |
+ | regmatches(s, m) # retrieve it | ||
− | + | # Note that there are two elements: the first is the whole match, the second | |
+ | # is the substring that is in parentheses. In our example these are the same. | ||
+ | # Here is an example where they are not: | ||
+ | s <- "Find the last word. And tell me." | ||
+ | (m <- regexec("\\s(\\w+)\\.", s)) | ||
+ | regmatches(s, m) # retrieve it | ||
− | + | # Unfortunately there is no option to capture multiple matches | |
− | + | # in base R: regexec() lacks a corresponding gregexec()... | |
− | + | patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b" | |
− | + | s <- "Transcriptional activation of CLN1, CLN2, and a putative new G1 cyclin (HCS26) by SWI4, a positive regulator of G1-specific transcription. Cell 66(5):1015-26" | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | # Solution 1 (base R): you can use multiple matches in an sapply() | |
− | + | # statement... | |
− | + | sapply(regmatches(s, gregexpr(patt, s))[[1]], | |
− | + | function(M){regmatches(M, regexec(patt, M))}) | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
+ | # Solution 2 (probably preferred): you can use | ||
+ | # str_match_all() from the very useful library "stringr" ... | ||
+ | if (!require(stringr, quietly=TRUE)) { | ||
+ | install.packages("stringr") | ||
+ | library(stringr) | ||
+ | } | ||
− | + | str_match_all(s, patt) | |
+ | str_match_all(s, patt)[[1]][,2] | ||
+ | # [1] "CLN1" "CLN2" "HCS26" "SWI4" | ||
− | + | # Note that str_match_all() handles the match object internally, no need for | |
− | + | # the two-step code. | |
− | |||
− | |||
− | |||
− | |||
− | + | </source> | |
− | |||
− | |||
+ | An interesting new alternative/complement to the base '''R''' regex libraries is the {{R|ore|ore()|package "'''ore'''"}} that uses the {{WP|Oniguruma}} libraries and supports multiple character encodings. | ||
− | |||
− | |||
− | |||
− | |||
− | + | <source lang="R"> | |
− | + | if (!require(ore)) { | |
− | + | install.packages("ore") | |
− | + | library(ore) | |
− | + | } | |
− | + | S <- "The quick brown fox jumps over a lazy dog" | |
− | |||
− | + | ore.search(". .", S) | |
+ | ore.search(". .", S, all=TRUE) | ||
+ | M <- ore.search(". .", S, all=TRUE) | ||
+ | M$nMatches | ||
+ | M$match[2:4] | ||
</source> | </source> | ||
+ | According to the author John Clayden, key advantages include: | ||
+ | * Search results focus around the matched substrings (including | ||
+ | parenthesised groups), rather than the locations of matches. This saves | ||
+ | extra work with "substr" or similar to extract the matches themselves. | ||
+ | * [http://rpubs.com/jonclayden/regex-performance Substantially better performance], especially when matching against | ||
+ | long strings. | ||
+ | * Substitutions can be functions as well as strings. | ||
+ | * Matches can be efficiently obtained over only part of the strings. | ||
+ | * Fewer core functions, with more consistent names. | ||
− | + | {Vspace} | |
− | + | ===Modifiers=== | |
− | |||
− | |||
− | |||
− | + | A number of modifiers can be applied as arguments to regular expression functions that may be useful. Here are the two most important ones. | |
− | |||
− | |||
− | |||
− | + | <source lang="R"> | |
− | |||
− | |||
− | |||
− | + | # Option "ignore.case" allows to have case-insensitive matches. This is usually | |
− | + | # poor programming style, a more explicit (= better) way is to define your | |
+ | # character classes appropriately. | ||
+ | patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b" | ||
− | + | s <- "The MBP1 gene encodes the Mbp1 protein." | |
− | + | m <- gregexpr(patt, s) | |
− | + | regmatches(s, m)[[1]] | |
− | + | m <- gregexpr(patt, s, ignore.case = TRUE) | |
+ | regmatches(s, m)[[1]] | ||
− | |||
− | |||
− | |||
− | |||
− | + | # For regex functions in the stringr package, you can compile the pattern | |
− | + | # with the regex() function, and include the option "comments = TRUE". This | |
+ | # allows you to insert whitespace and # characters into the pattern | ||
+ | # which will be ingnored by the regex engine. Thus you can comment | ||
+ | # complex regular expressions inline. | ||
− | + | library(stringr) | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | myRegex <- regex("\\b # word boundary | |
− | + | ( # begin capture | |
− | + | [A-Z] # one uppercase letter | |
+ | [A-Z0-9\\-_]+ # one or more letters, numbers, hyphen or | ||
+ | # underscore | ||
+ | [A-Z0-9] # one letter or number. | ||
+ | # Note: this captured subexpression has a minimum length of 3. | ||
+ | ) # end capture | ||
+ | \\b", # word boundary | ||
+ | comments = TRUE) | ||
− | + | str_match_all(s, myRegex)[[1]][2] | |
− | |||
− | |||
</source> | </source> | ||
− | |||
− | + | ===Greed=== | |
+ | By default, quantitative matches except zero/one (i.e. the ? character) are "greedy", i.e. they will match the largest possible number of characters. For example: | ||
− | <source lang=" | + | <source lang="R"> |
− | + | s <- "abc123" | |
− | + | ||
+ | patt <- "(\\w+)(\\d+)" # word characters, followed by digits | ||
+ | str_match_all(s, patt)[[1]][-1] | ||
+ | # yields "abc12" and "3" . This is because \w+ is greedy and grabs as many alphanumeric characters as it can before \d+ gets a chance to match. A "?" after a quantity specifier makes it non-greedy, therefore ... | ||
− | + | patt <- "(\\w+?)(\\d+)" | |
+ | str_match_all(s, patt)[[1]][-1] | ||
− | + | # ... now \d+ gets a chance to match as many digits as possible | |
+ | </source> | ||
− | + | {{Vspace}} | |
− | + | ==Regular Expressions in other languages== | |
− | |||
− | |||
− | + | {{Vspace}} | |
− | |||
− | |||
+ | ===PHP=== | ||
− | |||
<source lang="PHP"> | <source lang="PHP"> | ||
<?php | <?php | ||
Line 619: | Line 738: | ||
{{Vspace}} | {{Vspace}} | ||
− | + | ===Python=== | |
− | |||
− | == | ||
Python regular expression are provided through the module <code>re</code>. See [https://docs.python.org/2/library/re.html '''here''' for documentation]. | Python regular expression are provided through the module <code>re</code>. See [https://docs.python.org/2/library/re.html '''here''' for documentation]. | ||
Line 634: | Line 751: | ||
− | + | ====Example==== | |
− | |||
− | ===Example=== | ||
Download [http://biochemistry.utoronto.ca/steipe/abc/CourseMaterials/BCB410/sample.svg '''this <code>.svg</code> file'''] to experiment. | Download [http://biochemistry.utoronto.ca/steipe/abc/CourseMaterials/BCB410/sample.svg '''this <code>.svg</code> file'''] to experiment. | ||
Line 685: | Line 800: | ||
{{Vspace}} | {{Vspace}} | ||
− | + | ===Javascript=== | |
− | |||
− | == | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
Javascript is probably the most convenient choice if you want to interact directly with the DOM (Document Object Model) of a Web page. Here is example code for a "bookmarklet" that rewrites the URL of a journal-article page for access from outside the UofT network. | Javascript is probably the most convenient choice if you want to interact directly with the DOM (Document Object Model) of a Web page. Here is example code for a "bookmarklet" that rewrites the URL of a journal-article page for access from outside the UofT network. | ||
Line 857: | Line 818: | ||
{{Vspace}} | {{Vspace}} | ||
− | + | ===POSIX (Unix, the bash shell)=== | |
− | |||
− | == | ||
Use in: | Use in: | ||
*<code>grep</code> | *<code>grep</code> | ||
Line 875: | Line 834: | ||
{{Vspace}} | {{Vspace}} | ||
− | + | ==Practice== | |
− | |||
− | == | ||
− | + | {{ABC-unit|RPR-RegEx.R}} | |
{{Vspace}} | {{Vspace}} | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
==Appendix I: Metacharacters and their meaning== | ==Appendix I: Metacharacters and their meaning== | ||
Line 1,487: | Line 860: | ||
</table> | </table> | ||
− | + | {{Vspace}} | |
==Appendix II: Character classes and their meaning== | ==Appendix II: Character classes and their meaning== | ||
Line 1,510: | Line 883: | ||
</table> | </table> | ||
− | + | {{Vspace}} | |
==Appendix III: Anchor codes and their meaning== | ==Appendix III: Anchor codes and their meaning== | ||
Line 1,527: | Line 900: | ||
</table> | </table> | ||
− | + | {{Vspace}} | |
==Appendix IV: Modifiers and their meaning== | ==Appendix IV: Modifiers and their meaning== | ||
Line 1,541: | Line 914: | ||
</table> | </table> | ||
− | + | {{Vspace}} | |
− | |||
====A Brief First Encounter of Regular Expressions==== | ====A Brief First Encounter of Regular Expressions==== | ||
Line 1,548: | Line 920: | ||
{{Vspace}} | {{Vspace}} | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
One of the '''R''' functions that uses regular expressions is the function <code>gsub()</code>. It replaces characters that match a "regex" with other characters. That is useful for our purpose: we can | One of the '''R''' functions that uses regular expressions is the function <code>gsub()</code>. It replaces characters that match a "regex" with other characters. That is useful for our purpose: we can | ||
Line 1,612: | Line 938: | ||
== Further reading, links and resources == | == Further reading, links and resources == | ||
+ | |||
<div class="reference-box">[https://en.wikipedia.org/wiki/Regular_expression Regular expressions (Wikipedia)]</div> | <div class="reference-box">[https://en.wikipedia.org/wiki/Regular_expression Regular expressions (Wikipedia)]</div> | ||
<div class="reference-box">[http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html '''R''' regular expressions]</div> | <div class="reference-box">[http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html '''R''' regular expressions]</div> | ||
<div class="reference-box">[http://regexpal.com/ '''RegexPal''' - a javascript regex tester]</div> | <div class="reference-box">[http://regexpal.com/ '''RegexPal''' - a javascript regex tester]</div> | ||
+ | <div class="reference-box">Visit [http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags the stackoverflow thread on regex and HTML parsing]. What's your opinion on the OP's question?</div> | ||
<div class="reference-box">[http://xkcd.com/208/ '''XKCD''']</div> | <div class="reference-box">[http://xkcd.com/208/ '''XKCD''']</div> | ||
+ | |||
{{Vspace}} | {{Vspace}} | ||
Line 1,676: | Line 1,005: | ||
:2017-08-05 | :2017-08-05 | ||
<b>Modified:</b><br /> | <b>Modified:</b><br /> | ||
− | :2017- | + | :2017-10-01 |
<b>Version:</b><br /> | <b>Version:</b><br /> | ||
− | :0 | + | :1.0 |
<b>Version history:</b><br /> | <b>Version history:</b><br /> | ||
+ | *1.0 First live version, translated from Perl examples in old version | ||
*0.1 First stub | *0.1 First stub | ||
</div> | </div> |
Revision as of 16:33, 2 October 2017
Regular Expressions (regex) with R
Keywords: Regular expressions
Contents
- 1 Abstract
- 2 This unit ...
- 3 Contents
- 4 First steps
- 5 Regular Expressions in R
- 6 Syntax
- 7 Behaviour
- 8 Regular Expressions in other languages
- 9 Practice
- 10 Appendix I: Metacharacters and their meaning
- 11 Appendix II: Character classes and their meaning
- 12 Appendix III: Anchor codes and their meaning
- 13 Appendix IV: Modifiers and their meaning
- 14 Further reading, links and resources
- 15 Notes
- 16 Self-evaluation
This unit is under development. There is some contents here but it is incomplete and/or may change significantly: links may lead to nowhere, the contents is likely going to be rearranged, and objectives, deliverables etc. may be incomplete or missing. Do not work with this material until it is updated to "live" status.
Abstract
Using regular expressions is a key skill for all apsects of data science, and, by extension, for bioinformatics. There is a bit of a learning curve involved, but once you are even slightly comfortable with their syntax and use in code, you will appreciate their power in automating otherwise unbearably tedious tasks. This unit introduces the principles, the syntax, and provides a fair amount of practice.
This unit ...
Prerequisites
You need to complete the following units before beginning this one:
Objectives
This unit will ...
- ... introduce regular expressions;
- ... demonstrate thier use in R functions;
- ... teach how to apply them in common tasks.
Outcomes
After working through this unit you ...
- ... can express pattern-matching tasks as regular expressions;
- ... are familar with online regex testing sites that help you troubleshoot your expressions;
- ... have begun write code that uses them for a variety of purposes.
Deliverables
- Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
- Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
- Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.
Evaluation
Evaluation: NA
- This unit is not evaluated for course marks.
Contents
First steps
A Regular Expression is a specification of a pattern of characters. The typical use of a regular expression is to find occurrences of the pattern in a string. Regular expressions have a flexible syntax that allows them to handle a range of tasks - from trivial substring matches to complex nested motifs. The syntax of regular expressions is a programming language in its own right, and is a powerful way of concisely and uniquely defining a pattern.
Regular expressions are examples of deterministic pattern matching - they either match a particular expression or not. This is in contrast to probabilistic pattern matching in which a pattern is more or less similar to a query.
Truth be told, many programmers have a love-hate relationship with regular expressions. The syntax of regular expressions is very powerful and expressive, but also terse, not always intuitive, and sometimes hard to understand. Let's try a few simple things:
Here is string to play with: the sequence of Mbp1, copied from the NCBI Protein database page for yeast Mbp1.
1 msnqiysary sgvdvyefih stgsimkrkk ddwvnathil kaanfakakr trilekevlk 61 ethekvqggf gkyqgtwvpl niakqlaekf svydqlkplf dftqtdgsas pppapkhhha 121 skvdrkkair sastsaimet krnnkkaeen qfqsskilgn ptaaprkrgr pvgstrgsrr 181 klgvnlqrsq sdmgfprpai pnssisttql psirstmgpq sptlgileee rhdsrqqqpq 241 qnnsaqfkei dledglssdv epsqqlqqvf nqntgfvpqq qssliqtqqt esmatsvsss 301 pslptspgdf adsnpfeerf pgggtspiis miprypvtsr pqtsdindkv nkylsklvdy 361 fisnemksnk slpqvllhpp phsapyidap idpelhtafh wacsmgnlpi aealyeagts 421 irstnsqgqt plmrsslfhn sytrrtfpri fqllhetvfd idsqsqtvih hivkrksttp 481 savyyldvvl skikdfspqy rielllntqd kngdtalhia skngdvvffn tlvkmgaltt 541 isnkegltan eimnqqyeqm miqngtnqhv nssntdlnih vntnnietkn dvnsmvimsp 601 vspsdyityp sqiatnisrn ipnvvnsmkq masiyndlhe qhdneikslq ktlksisktk 661 iqvslktlev lkesskdeng eaqtnddfei lsrlqeqntk klrkrliryk rlikqkleyr 721 qtvllnklie detqattnnt vekdnntler lelaqeltml qlqrknklss lvkkfednak 781 ihkyrriire gtemnieevd ssldvilqtl iannnknkga eqiitisnan sha //
Task:
Navigate to http://regexpal.com and paste the sequence into the lower box. This site is one of a number of online regular expression testers; their immediate, visual feedback is invaluable when you are developing regular expression patterns.
Lets try some expressions:
- Most characters are matched literally.
- Type "
a
" in to the upper box and you will see all "a
" characters matched. Then replacea
withq
. - Now type "
aa
" instead. Thenkrnnkk
. Sequences of characters are also matched literally.
- The pipe character | that symbolizes logical OR can be used to define that more than one character should match
i(s|m|q)n
matchesisn
ORimn
ORiqn
. Note how we can group with parentheses, and try what would happen without them.
- We can more conveniently specify more than one character to match if we place it in square brackets. This is a "character class". We will encounter those frequently
[lq]
matchesl
ORq
.[milcwyf]
matches hydrophobic amino acids.
- Within square brackets, we can specify "ranges".
[1-5]
matches digits from 1 to 5.
- Within square brackets, we can specify characters that should NOT be matched, with the "caret",
^
. [^0-9]
matches everything EXCEPT digits.[^a-z]
matches everything that is not a lower-case letter. That's what we need to remove characters that do not represent amino acids(try it).
Make frequent use of this site to develop your regular expressions step by step.
Theory
According to the Chomsky hierarchy regular expressions are a Type-3 (regular) grammar, thus their use forms a regular language. Therefore, like all Type-3 grammatical expressions they can be decided by a finite-state machine, i.e. a "machine" that is defined by possible states, and triggering conditions that control transitions between states. Think of such automata as a (possibly elaborate) if ... else
construct. The "regex" processor translates the search pattern into such an automaton, which is then applied to the search domain - the string in which the occurrence of the pattern is to be sought.
What are they good for
Regular expressions support virtually all pattern matching tasks in data clean-up, extracting information itms, data mining, "screen scraping", parsing of files, subsetting large tables, etc. etc. This means, they must be part of your everyday toolkit.
When should they not be used
Since regular expressions are Type-3 grammars, they will fail when trying to parse any more complex grammar. This means, you can't reliably parse XML - and in perticular HTML - with regular expressions. There is a long discussion on this particular topic however, e.g. see here, and many other similar threads on stackoverflow, and see here for a discussion of when regular expressions should not be used. Use a real XML parser instead.
Perl and POSIX
Two dialects of regular expressions exist, they differ in some details of syntax. One is the nearly universal "Perl" dialect (Perl is a programming language), the other one is the "POSIX" standard that nearly no one uses. Except R. Tragically, in R the POSIX standard is the default. Fortunately this often does not make a difference, and we can explicitly turn this nonsense off. But just like the utterly annoying stringsAsFactors = FALSE
, we need to type perl = TRUE
much more often than we would like. Somebody, some time, made a wrong design decision and thousands of wasted man- and woman hours later we are still stuck with the consequences. If you use regular expressions according to the POSIX standard, you have to learn the Perl standard anyway. But then you can just use the Perl standard in the first place. The Wikipedia page on Regular Expressions has a table with a side-by-side comparison of the different ways the two standards express character classes.
Regular Expressions in R
Regular expressions in R can be used
- to match patterns in strings for use in
if()
orwhile()
conditions, or to retrieve specific instances of patterns withgrep()
; - to substitute occurrences of patterns in strings with other strings with
gsub()
; - to split strings into substrings that are delimited by the occurrence of a pattern with
strsplit()
; and more
Accordingly, a basic knowledge of regular expressions is needed to reade and write code, especially code that parses text.
Syntax
Regular expressions in R are string enclosed in quotation marks.
"a"
is a regular expression. It specifies the single, literal character a
exactly.
Specifying symbols
The power of regular expressions lies in their flexible syntax that allows to specify character ranges, classes of characters, unspecified characters, alternatives, and much more. This sometimes can be confusing, because the symbols that specify ranges, options, wildcards and the like are of course themselves characters. Characters that specify information about other characters are called metacharacters, these include ".
", "?
", "+
", "*
", "[
" and "]
", "{
" and "}
" and more. And the opposite is also possible: some plain characters can be turned into metacharacters to symbolize character classes.
The "\
" - escape character - allows to distinguish when a character is to be taken literally and when it is to be interpreted as a metacharacter. Note that some symbols have to be escaped to be read literally, while some letters have to be escaped to be read as metacharacters.
But there is a catch in R, relating to when the escapoe charcatre is interpreted. Remember that "\n
" is a linebreak in a string, "\t
" is a tab, etc. Obviously if you write "\?
" (a literal questionmark in a regex), or "\+
" (a literal plus-sign in a regex) into a regular string, the mechanism that parses the string is going to see the escape character, then it expects an "n" or a "t" or the like - but what it gets instead is something it doesn't know. So it throws an error. Try:
"\n" # fine
"\?" # Error: ...
But then how can we write something like "\?
" when we need it? That becomes obvious when you consider what happens with the string: it gets sent to the regex engine for interpretation. Thus the regex engine needs to see: character "\", then character "?". So it needs two characters. The secret is: we need to prevent "\" from attaching to the next character, and specify it as a single character in its own right. We do that by escaping "\" itself - with a backslash. Thus "\\
" is a literal "\" character - and can get sent to the regex engine.
"\\?" # ok
cat("\\?") # that's what the regex engine sees.
Consequence is: you need to double the "\\" in R when you want a single "\". That works differently from other programming languages who pass patterns to the regex engine as-is.
Letters whose special meaning as a metacharacter is turned on with the escape character:
Character | Means |
---|---|
w the letter "w" | |
\w a "word" character, ie one of A-Z, a-z, 0-9 and "_" | |
s the letter "s" | |
\s a "space" character, i.e. one of " ", tab or newline | |
b the letter "b" | |
\b a word boundary |
Metacharacters whose special meaning is turned off with the escape character:
Character | Means |
---|---|
+ | One or more repetitions of the preceeding expression |
\+ | the literal character "+" |
\ | the escape character |
\\ | the literal character "\" |
. | any single character except the newline (\n) |
\. | a literal period |
Note that these examples are not exhaustive. A (nearly) complete table of symbols and meanings is given in the appendix.
Character Classes
Square brackets specify when more than one specific character can match at a position.
Expression | Means |
---|---|
[acgtACGT] | Any non-degenerate nucleotide |
For example:
"[AGR]AATT[CTY]"
matches all occurrences of an ApoI restriction site, either specified explicitly, or through the nucleotide ambiguity codes R (purines) or Y (pyrimidines).
Within character sets, hyphens can specify character ranges.
Expression | Means |
---|---|
[a-z] | lowercase letters |
[0-9] | digits |
[0-9+*/=^\\-] | digits and arithmetic symbols (Note the escaped hyphen) |
If you want to match a literal hyphen, you must escape it. Within character sets, some metacharacters that otherwise have special meanings usually do not need to be escaped.
The complement
The caret character "^" denotes the complement of a character set; i.e. everything that is not that expression.
Expression | Means |
---|---|
[^9] | Everything but the digit "9" |
[^ACGT] | Not a nucleotide code letter |
Note that outside of square brackets, the "^" character is an "anchoring code" and means "beginning of the string". This can be confusing.
For character classes, the class in upper case denotes the complement. This can also be confusing !
Character | Means |
---|---|
\w | a word character |
\W | not a word character |
\s | a space character |
\S | not a space character |
Specifying quantity
Special characters in regular expressions control how often a pattern must be present in order to match:
Expression | What it means | Example (meaning) |
---|---|---|
? | match zero or one times | "? (there may or may not be a quote mark) |
+ | match one or more | "? (there's at least one uppercase letter) |
* | match any number | .* (there may be some characters) |
{min,max} | match between min and max times (assumes 0 and infinity respectively if not specified) | [atAT]{20,200} (a stretch of between 20 and 200 upper- or lowercase As or Ts) |
For example:
"AAUAAA[ACGU]{10,30}$"
defines a polyadenylation site - a AAUAAA motif, followed by 10 to 30 of any nucleotide before the end of the RNA.
Specifying position (anchoring)
If a pattern must be matched at a particular location, special terms denote string anchors.
Anchoring Term | Meaning |
---|---|
^ | Start of a line or string |
$ | End of a line or string |
\A | Start of the string |
\Z | End of the string |
\G | Last global match end |
Of course defining a regular expression pattern does not yet do anything with it. Below are the most important R functions that use regular expressions. Write the small code samples that are provided below and test how the operators and regular expressions work.
Functions that don't use regular expressions
Not all pattern searches in strings use (and need) regular expressions. Sometimes
simple, exact string-matching is enough. R uses string matching in character equality (==
) and by extension, the set operation functions (union(), intersect()
etc.), (match()
), and the (%in%
).
vA <- c("the", "quick", "brown", "fox", "jumps", "over", "a", "lazy", "dog")
vA[2] == "quick" # TRUE
vA[2] == "quack" # FALSE
vA == "fox"
# match tests for string equality
match("fox", vA) # 4, i.e. the 4th element matches the string
match("o", vA) # NA: matches have to be to the WHOLE element
# match("fox", vA) is equivalent to...
which(vA == "fox")
# %in% can be used for creating intersections
# find whether elements from one vector are
# contained in another:
vB <- c("Quacking", "the", "duck", "wings", "over", "my", "cozy", "cot")
vA %in% vB
vB %in% vA # note that the length of the return vector is the same as the
# length of the first argument. So read this as:
# "Which of my vB are also in vA"
# We can use this to subset the vector with elements that are present in
# both:
vB[vB %in% vA]
# which is, of course, the intersection set operation.
intersect(vA, vB)
Functions that use regular expressions
The general online help page is here. Remember: R's default behaviour is extended POSIX. To be sure of the regex dialect pass the perl = TRUE
parameter.
grep()
# grep() is like match(), but uses regular expressions. A variant of grep() that
# returns a boolean vector - like "==" does - is grepl(). That is useful
# becoause we can & or | the vector, or invert it with ! .
grep("fox", vA)
grep("o", vA) # Aha! now we get all elements that contain an "o" -
# Because we get partial matches with regular expressions.
vA[grep("o", vA)] # subset
grepl("o", vA) # logical
! grepl("o", vA) # its inverse
vA[! grepl("o", vA)] # subset all words without "o"
Subsetting example
Consider the following regular expression:
patt <- "^\\s*#"
This matches if the string it is applied to does not begin with a "#", which may or may not be preceeded by whitespaces. This would be useful to ignore comment lines in a data file.
The regular expression above is decomposed as follows:
^
the beginning of the line\\s
any whitespace character ...*
... repeated 0 or more times#
the hash character
The following example would read a file into an vector of lines, then drop all lines that are empty, and all lines that are comments. This is a straightforward idiom.
IN <- "test.txt"
patt <- "^\\s*#"
myData <- readLines(IN)
myData <- myData[myData != ""] # drop all elements that are the empty string
myData <- myData[! grepl(patt, myData)] # drop all elements match the pattern
Substitution - gsub()
Think of "gsub"" as "global substitution", and you'll understand that there exists another function, sub()
that replaces only the first occurrence of a pattern, rather than all of them as gsub()
does. I can't imagine what the use case for that might be and I don't think I have ever used that. I get an intuitive sense that code that needs such a function should probably be reconceived. But gsub()
is very useful.
(s <- " 1 MKLAACFLTL LPGFAVA... 17 ") # E-coli Alpha Amylase signal peptide
# Drop everything from this string that is not an amino acid one-letter code.
# We use gsub() to first identify all non-amino acid letters with a character
# class regular expression, then we replace each occurrence with the empty
# string.
gsub("[^ACDEFGHIKLMNPQRSTVWY]", "", s)
# or, with assignment: ...
s <- gsub("[^ACDEFGHIKLMNPQRSTVWY]", "", s)
strsplit()
Another function that makes use of regular expressions is strsplit()
. It takes a vector of strings, and returns a list, one element for each element of the vector, in which each string has been split up by separating it along a regular expression.
x <- c("a b c", "1 2")
strsplit(x, " ")
# [[1]]
# [1] "a" "b" "c"
#
# [[2]]
# [1] "1" "2"
Since even a single string returns a list, you often have to extract the element you want as a vector for further use.
corvidae <- c("crow:jackdaw:jay:magpie:raven:rook")
strsplit(corvidae, ":")
unlist(strsplit(corvidae, ":"))
strsplit(corvidae, ":")[[1]]
# Consider:
length(strsplit(corvidae, ":"))
length(unlist(strsplit(corvidae, ":")))
strsplit()
is immensely useful to extract elements from strings with a relatively well defined structure.
s <- "1, 1, 2, 3, 5, 8"
strsplit(s, ", ")[[1]] # split on comma-space
s <- "~`!@#$%^&*()_-=+[{]}\|;:',<.>/?"
strsplit(s, "")[[1]] # split on empty string
s <- "chronological lifespan:\tincreased\ncold sensitivity:\tincreased\nsporulation:\tnormal"
strsplit(s, "\\t|\\n")[[1]] # split on tab or newline
Behaviour
Capturing matches
Finding and returning matches in R is a two-step process. (1) find matches with regexpr()
(one match), gregexpr()
(all matches), or regexec()
(sub-expressions in parentheses). All of these return a "match object". (2) use the match object to extract the matching substrings from the original string.
# Extracting gene names in text.
# Let's define a valid gene name to be a substring that is bounded by
# word-boundaries, starts with an upper-case character, contains more upper-case
# characters or numbers or a hyphen or underscore, with a minimal length of 3.
# Here is a regex, and we put the part of the string that we want to recover in
# parentheses:
patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b"
# Test: positives
grepl(patt, "MBP1")
grepl(patt, "AAT")
grepl(patt, " AI1")
grepl(patt, "ASP3-1 ")
grepl(patt, " AI5_ALPHA; ")
grepl(patt, " (TY1B-PR3) ")
# Test: negatives
grepl(patt, "G1") # Too short
grepl(patt, "G1-") # Hyphen at end
grepl(patt, "Cell") # contains lower-case
# Let's apply this to retrieve gene names in text
s <- "Transcriptional activation of CLN1, CLN2, and a putative new G1 cyclin (HCS26) by SWI4, a positive regulator of G1-specific transcription. Cell 66(5):1015-26"
(m <- regexpr(patt, s)) # found a match in position 31
regmatches(s, m) # retriev it
(m <- gregexpr(patt, s)) # found all matches match in position 31
regmatches(s, m) # retrieve them (note, this is a list)
# The function of choice however is regexec(). It returns whatever the pattern
# has defined in parentheses, the others return the entire match. The
# parentheses are quite importnat, because we might want to specify additional
# context for a valid match, but we might not want the context in the match
# itself. In our example we used word boundaries - \\b - for such context; but
# these are zero-length and don't actually match a character, so they don't
# contaminate the substring anyway. But in general we need to be able to
# precisely define what we want.
(m <- regexec(patt, s)) # only the parenthesized substring
regmatches(s, m) # retrieve it
# Note that there are two elements: the first is the whole match, the second
# is the substring that is in parentheses. In our example these are the same.
# Here is an example where they are not:
s <- "Find the last word. And tell me."
(m <- regexec("\\s(\\w+)\\.", s))
regmatches(s, m) # retrieve it
# Unfortunately there is no option to capture multiple matches
# in base R: regexec() lacks a corresponding gregexec()...
patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b"
s <- "Transcriptional activation of CLN1, CLN2, and a putative new G1 cyclin (HCS26) by SWI4, a positive regulator of G1-specific transcription. Cell 66(5):1015-26"
# Solution 1 (base R): you can use multiple matches in an sapply()
# statement...
sapply(regmatches(s, gregexpr(patt, s))[[1]],
function(M){regmatches(M, regexec(patt, M))})
# Solution 2 (probably preferred): you can use
# str_match_all() from the very useful library "stringr" ...
if (!require(stringr, quietly=TRUE)) {
install.packages("stringr")
library(stringr)
}
str_match_all(s, patt)
str_match_all(s, patt)[[1]][,2]
# [1] "CLN1" "CLN2" "HCS26" "SWI4"
# Note that str_match_all() handles the match object internally, no need for
# the two-step code.
An interesting new alternative/complement to the base R regex libraries is the package "ore"
that uses the Oniguruma libraries and supports multiple character encodings.
if (!require(ore)) {
install.packages("ore")
library(ore)
}
S <- "The quick brown fox jumps over a lazy dog"
ore.search(". .", S)
ore.search(". .", S, all=TRUE)
M <- ore.search(". .", S, all=TRUE)
M$nMatches
M$match[2:4]
According to the author John Clayden, key advantages include:
- Search results focus around the matched substrings (including
parenthesised groups), rather than the locations of matches. This saves extra work with "substr" or similar to extract the matches themselves.
- Substantially better performance, especially when matching against
long strings.
- Substitutions can be functions as well as strings.
- Matches can be efficiently obtained over only part of the strings.
- Fewer core functions, with more consistent names.
{Vspace}
Modifiers
A number of modifiers can be applied as arguments to regular expression functions that may be useful. Here are the two most important ones.
# Option "ignore.case" allows to have case-insensitive matches. This is usually
# poor programming style, a more explicit (= better) way is to define your
# character classes appropriately.
patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b"
s <- "The MBP1 gene encodes the Mbp1 protein."
m <- gregexpr(patt, s)
regmatches(s, m)[[1]]
m <- gregexpr(patt, s, ignore.case = TRUE)
regmatches(s, m)[[1]]
# For regex functions in the stringr package, you can compile the pattern
# with the regex() function, and include the option "comments = TRUE". This
# allows you to insert whitespace and # characters into the pattern
# which will be ingnored by the regex engine. Thus you can comment
# complex regular expressions inline.
library(stringr)
myRegex <- regex("\\b # word boundary
( # begin capture
[A-Z] # one uppercase letter
[A-Z0-9\\-_]+ # one or more letters, numbers, hyphen or
# underscore
[A-Z0-9] # one letter or number.
# Note: this captured subexpression has a minimum length of 3.
) # end capture
\\b", # word boundary
comments = TRUE)
str_match_all(s, myRegex)[[1]][2]
Greed
By default, quantitative matches except zero/one (i.e. the ? character) are "greedy", i.e. they will match the largest possible number of characters. For example:
s <- "abc123"
patt <- "(\\w+)(\\d+)" # word characters, followed by digits
str_match_all(s, patt)[[1]][-1]
# yields "abc12" and "3" . This is because \w+ is greedy and grabs as many alphanumeric characters as it can before \d+ gets a chance to match. A "?" after a quantity specifier makes it non-greedy, therefore ...
patt <- "(\\w+?)(\\d+)"
str_match_all(s, patt)[[1]][-1]
# ... now \d+ gets a chance to match as many digits as possible
Regular Expressions in other languages
PHP
<?php
$string = "The quick brown fox jumps over a lazy dog";
$words = preg_split('/\s+/', $string);
print_r($words);
preg_match('/.\W./', $string, $matches);
print_r($matches);
preg_match_all('/.\W./', $string, $matches);
print_r($matches);
#indexed preg_replace, iterates over array elements
$pat = array(); #broken
$pat[0] = '/quick brown/';
$pat[1] = '/fox/';
$pat[2] = '/lazy/';
$pat[3] = '/dog/';
$rep = array();
$rep[0] = 'lazy';
$rep[1] = 'dog';
$rep[2] = 'quick brown';
$rep[3] = 'fox';
print(preg_replace($pat, $rep, $string));
print("\n");
$pat = array();
$pat[0] = '/quick brown fox/';
$pat[1] = '/lazy dog/';
$pat[2] = '/foo/';
$pat[3] = '/bar/';
$rep = array();
$rep[0] = 'foo';
$rep[1] = 'bar';
$rep[2] = 'lazy dog';
$rep[3] = 'quick brown fox';
print(preg_replace($pat, $rep, $string));
print("\n");
?>
Python
Python regular expression are provided through the module re
. See here for documentation.
.re
functions in general operate on a string and return a MatchObject. The MatchObject is then further analyzed by supplied methods.
The most frequently used functions are:
re.match(pattern, string)
matches only at the beginning of a line.re.search(pattern, string)
matches anywhere in a line.re.split(pattern, string)
returns the split string as a list.re.findall(pattern, string)
returns all matches in a list.
Example
Download this .svg
file to experiment.
# parse_SVG_example.py
# Read an svg file line by line and process path data
# to write commands separately to an output file, line by line.
import re
filePath = "/my/working/directory/whatever/"
myIn = filePath + "sample.svg"
myOut = filePath + "test.svg"
IN = open(myIn)
OUT = open(myOut, "w")
for line in IN:
path = re.search('\sd=\"(.*?)\"', line) # returns the MatchObject "path"
if path:
# Found. Process the result with a second regex.
# path.group() is a method of the MatchObject
pathData = re.findall('([aAcChHlLmMqQsStTvVzZ]|-?\d*\.?\d+)',
path.group(1))
# Write it nicely formatted to output, one command per line
OUT.write("d=\"")
s = "" # we accumulate output lines in this variable
for token in pathData:
if re.match('[aAcChHlLmMqQsStTvVzZ]', token):
# it's a letter:
OUT.write("\n "+s) # flush s to output
s = token + " " # new s
else:
s = s + token + " " # append to s
OUT.write("\n " + s + "\"\n") # flush s, close string, and add \n
else:
OUT.write(line)
IN.close()
OUT.close()
Javascript
Javascript is probably the most convenient choice if you want to interact directly with the DOM (Document Object Model) of a Web page. Here is example code for a "bookmarklet" that rewrites the URL of a journal-article page for access from outside the UofT network.
javascript:(function(){
var url=window.location.href;
var re=/\/([\w.]+)\/(.*$)/;
var match=url.match(re);
var newURL="http://"+match[1]+".myaccess.library.utoronto.ca/"+match[2];
window.location.href=newURL;
})();
void 0
POSIX (Unix, the bash shell)
Use in:
grep
grep
finds patterns in files. Patterns are regular expressions and can come in basic or extended flavors. In GNUgrep
there is no difference between these; in implementations where there is, you switch from basic to extended syntax with thegrep -E
flag which is the same as invokingegrep
.- Example: what demons run on your system?
ps -ax | egrep -o "/([^A-Z]\w+d)\b" | sort -u
Other uses of regular expressions in:
find
sed
awk
cut
... see the man
pages.
Practice
Task:
- Open RStudio and load the
ABC-units
R project. If you have loaded it before, choose File → Recent projects → ABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit. - Choose Tools → Version Control → Pull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
- Type
init()
if requested. - Open the file
RPR-RegEx.R
and follow the instructions.
Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.
Appendix I: Metacharacters and their meaning
Expression | Meaning |
---|---|
\ | Escape character |
| | Alternation character. Matches either one of specified alternatives. For example, /Asp|Glu/i matches ASP, Asp, asp, GLU, Glu or glu. |
^ | If the caret occurs at the beginning of an expression it anchors the expression at the beginning of a line or input. For example, /^AT/ does not match the 'AT' in "HETATM" but does match it in "ATOM". If the caret occurs as the first character of a character set as in [^a-z] it specifies the complement of the character set. Everywhere else, it simply matches the character "^". |
$ | Matches end of input or line. For example, /t$/ does not match the 't' in "eater", but does match it in "eat" |
* | Matches the preceding character 0 or more times. For example, /bo*/ matches 'boooo' in "A ghost booooed" and 'b' in "A bird warbled", but nothing in "A goat grunted". |
+ | Matches the preceding character 1 or more times. Equivalent to {1,} . For example, /a+/ matches the 'a' in "candy" and all the a's in "caaaaaaandy." |
? | Matches the preceding character 0 or 1 time. For example, /e?le?/ matches the 'el' in "angel" and the 'le' in "angle." |
. | (The decimal point) matches any single character except the newline character. |
(x) | Matches 'x' and remembers the match. For example, /(foo) bar/ matches "foo bar" and stores 'foo' in the special variable $1. /(more) (joy)/ matches "more joy", then stores 'more' in $1 and 'joy' in $2. |
{n} | Where n is a positive integer. Matches exactly n occurrences of the preceding character. For example, /a{2}/ doesn't match the 'a' in "candy," but it matches all of the a's in "caandy," and the first two a's in "caaandy." |
{n,} | Where n is a positive integer. Matches at least n occurrences of the preceding character. For example, /a{2,}/ doesn't match the 'a' in "candy", but matches all of the a's in "caandy" and in "caaaaaaandy." |
{n,m} | Where n and m are positive integers. Matches at least n and at most m occurrences of the preceding character. For example, /a{1,3}/ matches nothing in "cndy", the 'a' in "candy," the first two a's in "caandy," and the first three a's in "caaaaaaandy" Notice that when matching "caaaaaaandy", the match is "aaa", even though the original string had more a's in it. |
[xyz] | A character set. Matches any one of the enclosed characters. You can specify a range of characters by using a hyphen. For example, [bcd] is the same as [b-d] . They match the 'c' in "cysteine" and the 'd' in "ached" . |
[^xyz] | A negated or complemented character set. That is, it matches anything that is not enclosed in the brackets. Note that the caret has to be the first character in the bracket set. For example, [^abc] is the same as [^a-c] . They initially match 'l' in "alanine" and 'y' in "cysteine" |
Appendix II: Character classes and their meaning
Expression | Meaning |
---|---|
[\b] | Matches a backspace. (Not to be confused with \b .) |
\b | Matches a word boundary, such as a space. (Not to be confused with [\b].) The difference is that one is within a character set and the other is not! For example, /\bn\w/ matches the 'no' in "noonday"; /\wy\b/ matches the 'ly' in "possibly yesterday." |
\B | Matches a non-word boundary. For example, /\w\Bn/ matches 'on' in "noonday", and /y\B\w/ matches 'ye' in "possibly yesterday." |
\cX | Where X is a control character. Matches a control character in a string. For example, /\cM/ matches control-M in a string. |
\d | Matches a digit character. Equivalent to [0-9] . For example, /\d/ or /[0-9]/ matches '2' in "B2 is the suite number." |
\D | Matches any non-digit character. Equivalent to [^0-9] . For example, /\D/ or /[^0-9]/ matches 'B' in "B2 is the suite number." |
\f | Matches a form-feed. |
\n | Matches a linefeed. |
\r | Matches a carriage return. |
\s | Matches a single white space character, including space, tab, form feed, line feed. Equivalent to [ \f\n\r\t\v] . For example, /\s\w*/ matches ' bar' in "foo bar." |
\S | Matches a single character other than white space. Equivalent to [^ \f\n\r\t\v] . For example, /\S/\w* matches 'foo' in "foo bar." |
\t | Matches a tab |
\v | Matches a vertical tab. |
\w | Matches any alphanumeric character including the underscore. Equivalent to [A-Za-z0-9_] . For example, /\w/ matches 'a' in "apple," '5' in "$5.28," and '3' in "3D." |
\W | Matches any non-word character. Equivalent to [^A-Za-z0-9_] . For example, /\W/ or /[^$A-Za-z0-9_]/ matches '%' in "50%." |
Appendix III: Anchor codes and their meaning
Expression | Meaning |
---|---|
^ | If the caret occurs at the beginning of an expression it anchors the expression at the beginning of a line or input. For example, /^AT/ does not match the 'AT' in "HETATM" but does match it in "ATOM". |
$ | Matches end of input or line. For example, /t$/ does not match the 't' in "eater", but does match it in "eat" as well as in "eat\n". |
\b | Matches a word boundary, such as a space. (Not to be confused with [\b].) The difference is that one is within a character set and the other is not! For example, /\bn\w/ matches the 'no' in "noonday"; /\wy\b/ matches the 'ly' in "possibly yesterday." |
\B | Matches a non-word boundary. For example, /\w\Bn/ matches 'on' in "noonday", and /y\B\w/ matches 'ye' in "possibly yesterday." |
\A | Matches at the start of a string. Like "^". For example, /\AAT/ matches "AT" in "ATOM " but not in "HETATM" |
\Z | Matches at the end of a string. Like "$". For example, /\t\Z/ matches a tab at the end of the string but not anywhere else. |
(?: … ) | Group what's between the brackets, but discard match. |
(?= … ) | The preceeding pattern must be followed by this one in order to match. |
(?! … ) | The preceeding pattern must not be followed by this one in order to match. |
Appendix IV: Modifiers and their meaning
Expression< | Meaning |
---|---|
g | Matches globally - i.e. matches all occurrences of pattern, one after the other, do not stop at the first one. |
i | Match in a case-insensitive manner. For example, /[ACGT]/i matches any specific nucleotide in upper or lower case. |
x | Ignore whitespace in the expression |
o | Evaluate pattern only once. |
m | Treat the whole string as multiple lines. |
s | Treat the whole string as a single line, i.e. don't treat "\n" as line separators. For example, /(<table>.*?</table>)/s captures an entire table, including newline characters. Without the modifier nothing would match if there is even a single newline in between the tags. |
A Brief First Encounter of Regular Expressions
One of the R functions that uses regular expressions is the function gsub()
. It replaces characters that match a "regex" with other characters. That is useful for our purpose: we can
- match all characters that are NOT a letter, and
- replace them by - nothing: the empty string
""
.
This deletes them.
Task:
- study the code in the
An excursion into regular expressions
section of the R script
Further reading, links and resources
Notes
Self-evaluation
If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2017-08-05
Modified:
- 2017-10-01
Version:
- 1.0
Version history:
- 1.0 First live version, translated from Perl examples in old version
- 0.1 First stub
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.