Difference between revisions of "RPR-RegEx"

From "A B C"
Jump to navigation Jump to search
m
m
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
<div id="BIO">
+
<div id="ABC">
  <div class="b1">
+
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce; font-size:300%; font-weight:400; color: #000000; width:100%;">
 
Regular Expressions (regex) with R
 
Regular Expressions (regex) with R
  </div>
+
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#b3dbce; font-size:30%; font-weight:200; color: #000000; ">
 +
(Regular expressions)
 +
</div>
 +
</div>
 +
 
 +
{{Smallvspace}}
 +
 
  
  {{Vspace}}
+
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce33; font-size:85%;">
 
+
<div style="font-size:118%;">
<div class="keywords">
+
<b>Abstract:</b><br />
<b>Keywords:</b>&nbsp;
+
<section begin=abstract />
Regular expressions
+
Using regular expressions is a key skill for all apsects of data science, and, by extension, for bioinformatics. There is a bit of a learning curve involved, but once you are even slightly comfortable with their syntax and use in code, you will appreciate their power in automating otherwise unbearably tedious tasks. This unit introduces the principles, the syntax, and provides a fair amount of practice.
 +
<section end=abstract />
 +
</div>
 +
<!-- ============================  -->
 +
<hr>
 +
<table>
 +
<tr>
 +
<td style="padding:10px;">
 +
<b>Objectives:</b><br />
 +
This unit will ...
 +
* ... introduce regular expressions;
 +
* ... demonstrate their use in R functions;
 +
* ... teach how to apply them in common tasks.
 +
</td>
 +
<td style="padding:10px;">
 +
<b>Outcomes:</b><br />
 +
After working through this unit you ...
 +
* ... can express pattern-matching tasks as regular expressions and correctly use a variety of functions that use them;
 +
* ... are familar with online regex testing sites that help you troubleshoot your expressions during development;
 +
* ... have written to code that uses regular expressions for a variety of purposes.
 +
</td>
 +
</tr>
 +
</table>
 +
<!-- ============================  -->
 +
<hr>
 +
<b>Deliverables:</b><br />
 +
<section begin=deliverables />
 +
<li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li>
 +
<li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li>
 +
<li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li>
 +
<section end=deliverables />
 +
<!-- ============================  -->
 +
<hr>
 +
<section begin=prerequisites />
 +
<b>Prerequisites:</b><br />
 +
This unit builds on material covered in the following prerequisite units:<br />
 +
*[[RPR-Introduction|RPR-Introduction (Introduction to R)]]
 +
<section end=prerequisites />
 +
<!-- ============================  -->
 
</div>
 
</div>
  
{{Vspace}}
+
{{Smallvspace}}
 +
 
 +
 
 +
 
 +
{{Smallvspace}}
  
  
Line 19: Line 67:
  
  
{{STUB}}
+
=== Evaluation ===
 +
<b>Evaluation: NA</b><br />
 +
<div style="margin-left: 2rem;">This unit is not evaluated for course marks.</div>
 +
== Contents ==
 +
 
 +
==First steps==
 +
 
 +
A {{WP|Regular expression|Regular Expression}} is a specification of a pattern of characters. The typical use of a regular expression is to find occurrences of the pattern in a string. Regular expressions have a flexible syntax that allows them to handle a range of tasks - from trivial substring matches to complex nested motifs. The syntax of regular expressions is a programming language in its own right, and is a powerful way of concisely and uniquely defining a pattern.
  
{{Vspace}}
+
Regular expressions are examples of '''deterministic pattern matching''' - they either match a particular expression or not. This is in contrast to probabilistic pattern matching in which a pattern is ''more or less'' similar to a query.
 +
 
 +
Truth be told, many programmers have a love-hate relationship with regular expressions. The syntax of regular expressions is very powerful and expressive, but also terse, not always intuitive, and sometimes hard to understand. Let's try a few simple things:
  
 +
Here is string to play with: the sequence of Mbp1, copied from the [https://www.ncbi.nlm.nih.gov/protein/NP_010227 NCBI Protein database page for yeast Mbp1].
 +
 +
        1 msnqiysary sgvdvyefih stgsimkrkk ddwvnathil kaanfakakr trilekevlk
 +
      61 ethekvqggf gkyqgtwvpl niakqlaekf svydqlkplf dftqtdgsas pppapkhhha
 +
      121 skvdrkkair sastsaimet krnnkkaeen qfqsskilgn ptaaprkrgr pvgstrgsrr
 +
      181 klgvnlqrsq sdmgfprpai pnssisttql psirstmgpq sptlgileee rhdsrqqqpq
 +
      241 qnnsaqfkei dledglssdv epsqqlqqvf nqntgfvpqq qssliqtqqt esmatsvsss
 +
      301 pslptspgdf adsnpfeerf pgggtspiis miprypvtsr pqtsdindkv nkylsklvdy
 +
      361 fisnemksnk slpqvllhpp phsapyidap idpelhtafh wacsmgnlpi aealyeagts
 +
      421 irstnsqgqt plmrsslfhn sytrrtfpri fqllhetvfd idsqsqtvih hivkrksttp
 +
      481 savyyldvvl skikdfspqy rielllntqd kngdtalhia skngdvvffn tlvkmgaltt
 +
      541 isnkegltan eimnqqyeqm miqngtnqhv nssntdlnih vntnnietkn dvnsmvimsp
 +
      601 vspsdyityp sqiatnisrn ipnvvnsmkq masiyndlhe qhdneikslq ktlksisktk
 +
      661 iqvslktlev lkesskdeng eaqtnddfei lsrlqeqntk klrkrliryk rlikqkleyr
 +
      721 qtvllnklie detqattnnt vekdnntler lelaqeltml qlqrknklss lvkkfednak
 +
      781 ihkyrriire gtemnieevd ssldvilqtl iannnknkga eqiitisnan sha
 +
//
  
</div>
 
<div id="ABC-unit-framework">
 
== Abstract ==
 
<!-- included from "../components/RPR-RegEx.components.wtxt", section: "abstract" -->
 
...
 
  
{{Vspace}}
+
{{task|1=
  
 +
Navigate to http://regexpal.com and paste the sequence into the '''lower''' box. This site is one of a number of online regular expression testers; their immediate, visual feedback is invaluable when you are developing regular expression patterns.
  
== This unit ... ==
+
Lets try some expressions:
=== Prerequisites ===
 
<!-- included from "../components/RPR-RegEx.components.wtxt", section: "prerequisites" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "notes-prerequisites" -->
 
You need to complete the following units before beginning this one:
 
*[[RPR-Introduction]]
 
  
{{Vspace}}
+
;Most characters are matched literally.
 +
:Type "<code>a</code>" in to the '''upper''' box and you will see all "<code>a</code>" characters matched. Then replace <code>a</code> with <code>q</code>.
 +
: Now type "<code>aa</code>" instead. Then <code>krnnkk</code>. ''Sequences'' of characters are also matched literally.
  
 +
;The pipe character {{pipe}} that symbolizes logical OR can be used to define that more than one character should match:
 +
:<code>i(s{{pipe}}m{{pipe}}q)n</code> matches <code>isn</code> OR <code>imn</code> OR <code>iqn</code>. Note how we can group with parentheses, and try what would happen without them.
  
=== Objectives ===
+
;We can more conveniently specify more than one character to match if we place it in square brackets. This is a "character class". We will encounter those frequently
<!-- included from "../components/RPR-RegEx.components.wtxt", section: "objectives" -->
+
:<code>[lq]</code> matches <code>l</code> OR <code>q</code>. <code>[milcwyf]</code> matches hydrophobic amino acids.
...
 
  
{{Vspace}}
+
;Within square brackets, we can specify "ranges".
 +
:<code>[1-5]</code> matches digits from 1 to 5.
  
 +
;Within square brackets, we can specify characters that should NOT be matched, with the "caret", <code>^</code>.
 +
:<code>[^0-9]</code> matches everything EXCEPT digits. <code>[^a-z]</code> matches everything that is not a lower-case letter. That's what we would need to remove characters that do not represent amino acids. Note that '''outside''' of the square brackets the caret means "beginning of the string". When yopu see a caret, you need to consider its context carefully.
  
=== Outcomes ===
+
}}
<!-- included from "../components/RPR-RegEx.components.wtxt", section: "outcomes" -->
 
...
 
  
 
{{Vspace}}
 
{{Vspace}}
  
 
+
Make frequent use of this site to develop your regular expressions step by step.
=== Deliverables ===
 
<!-- included from "../components/RPR-RegEx.components.wtxt", section: "deliverables" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-time_management" -->
 
*<b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-journal" -->
 
*<b>Journal</b>: Document your progress in your [[FND-Journal|course journal]].
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-insights" -->
 
*<b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|insights! page]].
 
  
 
{{Vspace}}
 
{{Vspace}}
  
 +
===Theory===
  
=== Evaluation ===
+
According to the {{WP|Chomsky hierarchy}} regular expressions are a {{WP|Regular grammar|Type-3 (regular) grammar}}, thus their use forms a {{WP|regular language}}. Therefore, like all Type-3 grammatical expressions they can be decided by a {{WP|finite-state machine}}, ''i.e.'' a "machine" that is defined by possible states, plus triggering conditions that control transitions between states. Think of such automata as (elaborate) <code>if ... else</code> constructs. The "regex" processor translates the search pattern into such an automaton, which is then applied to the search domain - the string in which the occurrence of the pattern is to be sought.
<!-- included from "../components/RPR-RegEx.components.wtxt", section: "evaluation" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "eval-none" -->
 
<b>Evaluation: NA</b><br />
 
:This unit is not evaluated for course marks.
 
  
 
{{Vspace}}
 
{{Vspace}}
  
 +
===What are they good for===
  
</div>
+
Regular expressions support virtually all pattern matching tasks in data clean-up, extracting information items, data mining, "screen scraping", parsing of files, subsetting large tables, ''etc. etc.'' This means, they must be part of your everyday toolkit.
<div id="BIO">
 
== Contents ==
 
<!-- included from "../components/RPR-RegEx.components.wtxt", section: "contents" -->
 
  
==Regular Expressions==
+
{{Vspace}}
 
 
A {{WP|Regular expression|Regular Expression}} is a specification of a pattern of characters. The typical use of a regular expression is to find occurrences of the pattern in a string. Regular expressions have a flexible syntax that allows them to handle a range of tasks - from trivial substring matches to complex nested motifs. The syntax of regular expressions is a programming language in its own right, and is a powerful way of concisely and uniquely defining a pattern.
 
  
Regular expressions are examples of '''deterministic pattern matching''' - they either match a particular expression or not. This is in contrast to probabilistic pattern matching in which a pattern is more or less similar to an example.
+
===When should they not be used===
 +
Since regular expressions are Type-3 grammars, they must fail when trying to parse more complex grammars - i.e. gramars that can't be expressed in a regular language. This means, you can't reliably parse XML - and in perticular HTML - with regular expressions. There is a long discussion on this particular topic however, e.g. see [http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags '''here'''], and many other similar threads on stackoverflow, and see [http://programmers.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions '''here'''] for a discussion of when regular expressions should '''not''' be used. Use a real XML parser instead.
  
*Theory
+
{{Vspace}}
** According to the {{WP|Chomsky hierarchy}} regular expressions are a {{WP|Regular grammar|Type-3 (regular) grammar}}, thus their use forms a {{WP|regular language}}. Therefore, like all Type-3 grammatical expressions they can be decided by a {{WP|finite-state machine}}, ''i.e.'' a "machine" that is defined by possible states, and triggering conditions that control transitions between states. Think of such automata as a (possibly elaborate) <code>if ... else</code> construct. The "regex" processor translates the search pattern into such an automaton, which is then applied to the search domain - the string in which the occurrence of the pattern is to be sought.
 
  
*What are they good for?
+
===Perl and POSIX===
** Most pattern matching tasks in screen scraping, data reformatting, simple parsing of log files, search through large tables, ''etc. etc.'' This means, they ought to be part of your everyday toolkit.
 
  
*When should they not be used; what are alternatives for these cases?
+
Two dialects of regular expressions exist, they differ in some details of syntax. One is the nearly universal "Perl" dialect ({{WP|Perl}} is a programming language), the other one is the "POSIX" standard that nearly no one uses. Except R. Tragically, in R the POSIX standard is the default. Fortunately this often does not make a difference, and we can explicitly turn this nonsense off. But we need to type <code>perl = TRUE</code> much more often than we would like. Somebody, some time, made a wrong design decision and thousands of wasted man- and woman hours later we are still stuck with the consequences. If you use regular expressions according to the POSIX standard, you have to learn the Perl standard anyway. But then you can just use the Perl standard in the first place. The {{WP|Regular expression|Wikipedia page on Regular Expressions}} has a table with a side-by-side comparison of the different ways the two standards express character classes. Also see the help page on <code>regex</code> in R for details.
** Since they are Type-3 grammars, they will fail when trying to parse any more complex grammar. In particular, you can't reliably parse HTML with regular expressions. Use a real XML parser instead. There is a long discussion on this particular topic however, e.g. see [http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags '''here'''], and many other similar threads on stackoverflow, and see [http://programmers.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions '''here'''] for a discussion of when regular expressions should '''not''' be used.
 
  
 
{{Vspace}}
 
{{Vspace}}
  
{{Vspace}}
+
==Regular Expressions in R==
  
==Regular Expressions in Perl==
+
Regular expressions in R can be used
  
Many programming languages support their own style of regular expressions - the one we are dicusssing here is the one that Perl uses - although most of its syntax would be the same as that of Unix or PHP regular expressions. The support of regular expressions in Perl is one of its main strengths. Regular expressions in Perl can be used
+
* to match patterns in strings for use in <code>if()</code> or <code>while()</code> conditions, or to retrieve specific instances of patterns with the <code>regexpr()</code> family of functions;
 +
* to substitute occurrences of patterns in strings with other strings with <code>gsub()</code>;
 +
* to split strings into substrings that are delimited by the occurrence of a pattern with <code>strsplit()</code>;
  
* to match patterns in strings for use in <code>if()</code> or <code>while()</code> conditions, or to retrieve specific instances of patterns,
+
...and more.
* to substitute occurrences of patterns with strings,
 
* to translate all occurrences of a pattern into different characters, or
 
* to split strings into substrings that are delimited by the occurrence of a pattern.
 
  
 
Accordingly, a basic knowledge of regular expressions is needed to reade and write code, especially code that parses text.
 
Accordingly, a basic knowledge of regular expressions is needed to reade and write code, especially code that parses text.
 
{{Vspace}}
 
  
 
{{Vspace}}
 
{{Vspace}}
  
 
==Syntax==
 
==Syntax==
Regular expressions are formed of characters and/or numbers, enclosed in special quotation marks.
+
Regular expressions in R are strings, thus they are enclosed in quotation marks.
  
/a/
+
<pre>
 +
"a"
 +
</pre>
  
is a regular expression. The lowercase "<code>a</code>" is the expression, the "<code>/</code>" are delimiters that bound the expression. This expression specifies the single character <code>a</code> exactly.
+
is a regular expression. It specifies the single, literal character <code>a</code> exactly.
  
  
 
===Specifying symbols===
 
===Specifying symbols===
The power of regular expressions lies in their flexible syntax that allows to specify character ranges, classes of characters, unspecified characters and much more. This sometimes can be confusing, because the symbols that specify ranges, options, wildcards and the like are of course themselves characters. Characters that specify information about other characters are called metacharacters, they include "<code>.</code>", "<code>*</code>", "<code>[</code>" and "<code>]</code>" and more. And the opposite is also possible: some plain characters can be turned into metacharacters to symbolize character classes.
+
The power of regular expressions lies in their flexible syntax that allows to specify character ranges, classes of characters, unspecified characters, alternatives, and much more. This sometimes can be confusing, because the symbols that specify ranges, options, wildcards and the like are of course themselves characters. Characters that specify information about other characters are called '''metacharacters''', these include "<code>.</code>", "<code>?</code>", "<code>+</code>", "<code>*</code>", "<code>[</code>" and "<code>]</code>", "<code>{</code>" and "<code>}</code>" and more. And the opposite is also possible: some plain characters can be turned into metacharacters to denote character classes.
 +
 
 +
The "<code>\</code>" - '''escape character''' - allows to distinguish when a character is to be taken literally and when it is to be interpreted as a metacharacter. Note that some symbols have to be escaped to be read literally, while some letters have to be escaped to be read as metacharacters.
 +
 
 +
But there is a catch in R, relating to '''when''' the escape characater is interpreted. Remember that "<code>\n</code>" is a linebreak in a string, "<code>\t</code>" is a tab, etc. Obviously if you write "<code>\?</code>" (a literal questionmark in a regex), or  "<code>\+</code>" (a literal plus-sign in a regex) into a regular string, the mechanism that parses the string is going to see the escape character, then it expects an "n" or a "t" or the like - but what it gets instead is something it doesn't know. So it throws an error. Try:
 +
 
 +
<pre>
 +
"\n" # fine
 +
"\?" # Error: ...
 +
</pre>
 +
 
 +
But then how can we write something like "<code>\?</code>" when we need it? That becomes obvious when you consider what happens with the string: it gets sent to the regex engine for interpretation. Thus the regex engine needs to see: character "\", then character "?". So it needs two characters. The secret is: we need to prevent "\" from attaching to the next character, and specify it as a single character in its own right. We do that by '''escaping''' "\" itself - '''with a backslash'''. Thus "<code>\\</code>" is a literal "\" character - and can get sent to the regex engine.
 +
 
 +
<pre>
 +
"\\?" # ok
 +
cat("\\?") # that's what the regex engine sees.
 +
</pre>
  
In Perl the "<code>\</code>" - Perl's escape character - allows to distinguish when a character is to be taken literally and when it is to be interpreted as a metacharacter. Note that some symbols have to be escaped to be read literally, while some letters have to be escaped to be read as metacharacters.
+
Consequence is: you need to double the "\\" in R when you want a single "\". That works differently from other programming languages who pass patterns to the regex engine as-is. You need to be aware of this, for example when you develop a pattern in an online regex tool, and then copy it back into your R code. You need to double all occurrences of "<code>\</code>" in your R string.
  
Letters whose special meaning as a metacharacter is turned on with the escape character:
+
Letters whose special meaning as a metacharacter is turned '''on''' with the escape character:
  
 
<table border="1" cellpadding="5">
 
<table border="1" cellpadding="5">
 
<tr><th>Character</th><th>Means</th></tr>
 
<tr><th>Character</th><th>Means</th></tr>
<tr><td><code></code>w the letter "w"</td></tr>
+
<tr><td><code>w</code>the letter "w"</td></tr>
<tr><td><code></code>\w a "word" character, ie one of A-Z, a-z, 0-9 and "_"</td></tr>
+
<tr><td><code>\w</code>a "word" character, ie one of A-Z, a-z, 0-9 and "_"</td></tr>
<tr><td><code></code>s the letter "s"</td></tr>
+
<tr><td><code>s</code>the letter "s"</td></tr>
<tr><td><code></code>\s a "space" character, i.e. one of " ", tab or newline</td></tr>
+
<tr><td><code>\s</code>a "space" character, i.e. one of " ", tab or newline</td></tr>
<tr><td><code></code>b the letter "b"</td></tr>
+
<tr><td><code>b</code>the letter "b"</td></tr>
<tr><td><code></code>\b a word boundary</td></tr>
+
<tr><td><code>\b</code>a word boundary</td></tr>
 
</table>
 
</table>
  
Metacharacters whose special meaning is turned off with the escape character:
+
{{Smallvspace}}
 +
 
 +
Metacharacters whose special meaning is turned '''off''' with the escape character:
  
 
<table border="1" cellpadding="5">
 
<table border="1" cellpadding="5">
 
<tr><th>Character</th><th>Means</th></tr>
 
<tr><th>Character</th><th>Means</th></tr>
 
<tr><td><code>+</code></td><td>One or more repetitions of the preceeding expression</td></tr>
 
<tr><td><code>+</code></td><td>One or more repetitions of the preceeding expression</td></tr>
<tr><td><code>\+</code></td><td>the character "+"</td></tr>
+
<tr><td><code>\+</code></td><td>the literal character "+"</td></tr>
 
<tr><td><code>\</code></td><td>the escape character</td></tr>
 
<tr><td><code>\</code></td><td>the escape character</td></tr>
<tr><td><code>\\</code></td><td>the character "\"</td></tr>
+
<tr><td><code>\\</code></td><td>the literal character "\"</td></tr>
 
<tr><td><code>.</code></td><td>any single character except the newline (\n)</td></tr>
 
<tr><td><code>.</code></td><td>any single character except the newline (\n)</td></tr>
<tr><td><code>\.</code></td><td>a period</td></tr>
+
<tr><td><code>\.</code></td><td>a literal period</td></tr>
 
</table>
 
</table>
  
 
Note that these examples are not exhaustive. A (nearly) complete table of symbols and meanings is given in the appendix.
 
Note that these examples are not exhaustive. A (nearly) complete table of symbols and meanings is given in the appendix.
  
 +
{{Vspace}}
  
===Character Sets===
+
===Character Classes===
 
Square brackets specify when more than one specific character can match at a position.
 
Square brackets specify when more than one specific character can match at a position.
  
Line 166: Line 236:
  
 
For example:
 
For example:
<code>/[AGR]AATT[CTY]/</code> matches all occurrences of an ApoI restriction site, either specified explicitly, or through the nucleotide ambiguity codes R (purines) or Y (pyrimidines).
+
<code>"[AGR]AATT[CTY]"</code> matches all occurrences of an ApoI restriction site, either specified explicitly, or through the nucleotide ambiguity codes R (purines) or Y (pyrimidines).
  
 
Within character sets, hyphens can specify character ranges.
 
Within character sets, hyphens can specify character ranges.
Line 172: Line 242:
 
<table border="1" cellpadding="5">
 
<table border="1" cellpadding="5">
 
<tr><th>Expression</th><th>Means</th></tr>
 
<tr><th>Expression</th><th>Means</th></tr>
<tr><td><code>[a-z]</code></td><td>letters</td></tr>
+
<tr><td><code>[a-z]</code></td><td>lowercase letters</td></tr>
 
<tr><td><code>[0-9]</code></td><td>digits</td></tr>
 
<tr><td><code>[0-9]</code></td><td>digits</td></tr>
<tr><td><code>[0-9+*\/=^-]</code></td><td>digits and arithmetic symbols</td></tr>
+
<tr><td><code>[0-9+*/=^\\-]</code></td><td>digits and arithmetic symbols (Note the escaped hyphen)</td></tr>
 
</table>
 
</table>
  
Within character sets, some metacharacters that otherwise have special meanings do not need to be escaped. In the example above, only "/" is escaped, it would otherwise terminate the regular expression. Other characters that need to be escaped include "$", "%" and "@" since the Perl compiler would try to interpolate them as variables.
+
If you want to match a literal hyphen, you must escape it. Within character sets, some metacharacters that otherwise have special meanings usually do not need to be escaped.
  
  
 
===The complement===
 
===The complement===
The caret character "^" denotes the ''complement'' of a character set; i.e. everything that is not that expression.
+
The caret character "^" denotes the ''complement'' of a character set; i.e. everything that is '''not''' that expression.
  
 
<table border="1" cellpadding="5">
 
<table border="1" cellpadding="5">
Line 189: Line 259:
 
</table>
 
</table>
  
Note that outside of character sets, the "^" character denotes "beginning of the string". This can be confusing.
+
Note that outside of square brackets, the "^" character is an "anchoring code" and means "beginning of the string". This can be confusing.
  
For character classes, the class in upper case denotes the complement. This can also be confusing !
+
For many metacharacters that denoite character classes, the metacharacter in upper case denotes the complement. This can also be confusing !
  
 
<table border="1" cellpadding="5">
 
<table border="1" cellpadding="5">
 
<tr><th>Character</th><th>Means</th></tr>
 
<tr><th>Character</th><th>Means</th></tr>
<tr><td><code>\W</code></td><td>not a word character</td></tr>
+
<tr><td><code>\w</code></td><td>a word character</td></tr>
<tr><td><code>\S</code></td><td>not a space character</td></tr>
+
<tr><td><code>\W</code></td><td>'''not''' a word character</td></tr>
 +
<tr><td><code>\s</code></td><td>a space character</td></tr>
 +
<tr><td><code>\S</code></td><td>'''not''' a space character</td></tr>
 
</table>
 
</table>
  
Line 205: Line 277:
 
<table border="1" cellpadding="5">
 
<table border="1" cellpadding="5">
 
<tr><th>Expression</th><th>What it means</th><th>Example (meaning)</th></tr>
 
<tr><th>Expression</th><th>What it means</th><th>Example (meaning)</th></tr>
<tr><td><code>?</code></td><td>match zero or one times</td><td>"? (there may or may not be a quote mark)</td></tr>
+
<tr><td><code>?</code></td><td>match zero or one times</td><td><code>"?</code> (there may or may not be a quote mark)</td></tr>
<tr><td><code>+</code></td><td>match one or more</td><td>[A-Z]+ (there's at least one uppercase letter)</td></tr>
+
<tr><td><code>+</code></td><td>match one or more</td><td><code>[A-Z]+</code> (there's at least one uppercase letter)</td></tr>
<tr><td><code>*</code></td><td>match any number</td><td>.* (there may be some characters)</td></tr>
+
<tr><td><code>*</code></td><td>match any number</td><td><code>.*</code> (there may be some characters)</td></tr>
<tr><td><code>{min,max}</code></td><td>match between min and max times (assumes 0 and infinity respectively if not specified)</td><td>[acgt]{20,200} (a stretch of between 20 and 200 non-ambiguous bases)</td></tr>
+
<tr><td><code>{min,max}</code></td><td>match between min and max times (assumes 0 for min, if min is omitted; assumes infinity for max, if max is omitted).</td><td><code>[atAT]{20,200}</code> (a stretch of between 20 and 200 upper- or lowercase As or Ts)</td></tr>
 
</table>
 
</table>
  
 
For example:
 
For example:
<code>/AAUAAA[ACGU]{10,30}$/</code> defines a polyadenylation site - a AAUAAA motif, followed by 10 to 30 of any nucleotide before the end of the RNA.
+
<code>"AAUAAA[ACGU]{10,30}$"</code> defines a polyadenylation site - a AAUAAA motif, followed by 10 to 30 of any nucleotide before the end of the RNA.
  
  
 
===Specifying position (anchoring)===
 
===Specifying position (anchoring)===
If a pattern must be matched at a particular location, special terms denote string anchors.
+
If a pattern must be matched at a particular location in the string, special terms denote string anchors.
  
 
<table border="1" cellpadding="5">
 
<table border="1" cellpadding="5">
Line 227: Line 299:
 
</table>
 
</table>
  
 +
{{Vspace}}
 +
 +
Of course defining a regular expression pattern does not yet do anything with it. Below are the most important R functions that use regular expressions. Write the small code samples that are provided below, play with variations, and test how the operators and regular expressions work.
  
===Operators that use regular expressions===
+
{{Vspace}}
  
Of course specifying a regular expression does not yet do anything with it. Below are the most important Perl operators that use regular expressions. Write the small Perl program samples that are provided below and test how the operators and regular expressions work.
+
===Functions that don't use regular expressions===
  
====Matching====
+
Not all pattern searches in strings use (and need) regular expressions. Sometimes
 +
simple, exact string-matching is enough. R uses string matching in character equality (<code>==</code>) and by extension, the set operation functions (<code>union(), intersect()</code> etc.), the <code>match()</code> function, and the <code>%in%</code> operator.
  
Matching is the default behaviour of Perl regular expressions. The matching operator is
+
<pre>
  
m
+
vA <- c("the", "quick", "brown", "fox", "jumps", "over", "a", "lazy", "dog")
  
and the syntax is
+
vA[2] == "quick"  # TRUE
 +
vA[2] == "quack"  # FALSE
  
  m/&gt;Expression&lt;/&gt;Modifier&lt;
+
vA == "fox" # boolean vector
  
*&gt;Expression&lt; is a regular expression.
+
# match tests for string equality
*&gt;Modifier&lt; is one or more characters from a list of modifiers detailed below.
+
match("fox", vA)  # 4, i.e. the 4th element matches the string
 +
match("o", vA)    # NA: matches have to be to the WHOLE element
  
Since m is the default behaviour for a regular expression in a Perl program ...
+
# match("fox", vA) is equivalent to...
 +
which(vA == "fox")
  
/&gt;Expression&lt;/
+
# %in% can be used for creating intersections
 +
# find whether elements from one vector are
 +
# contained in another:
  
... works the same way.
+
vB <- c("Quacking", "the", "duck", "wings", "over", "my", "cozy", "cot")
  
There is one difference though:  if the m operator is specified the default delimiter "/" can be replaced with any other character, for matching. Thus ...
 
  
<source lang="perl">
+
vA %in% vB
/a/
+
vB %in% vA  # note that the length of the return vector is the same as the
m/a/
+
            # length of the first argument. So read this as:
m:a:
+
            # "Which of my vB are also in vA"
</source>
 
  
... are all valid regular matching operations, but ...
+
# We can use this to subset the vector with elements that are present in
 +
# both:
  
:a:
+
vB[vB %in% vA]
  
... is not.
+
# which is, of course, the intersection set operation.
 +
intersect(vA, vB)
 +
</pre>
  
====The matching (binding) operators =~  and  !~ ====
+
{{Vspace}}
  
The =~ operator makes Perl apply the regular expression on the right to the variable on the left. It returns TRUE if the variable contains the pattern, FALSE otherwise. This can be used in conditional expressions (<code>if (...) { }</code>) while matching (<code>m//</code>), substituting  (<code>s/abc/xyz/</code>) or transposing  (<code>tr/[A-Z]/[a-z]/</code>).
+
===Functions that use regular expressions===
  
<source lang="perl">
+
The general online help page is [http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html '''here''']. Remember: R's default behaviour is extended POSIX. To be sure which regex dialect is used, pass the <code>perl = TRUE</code> parameter.
$test =~ /\w/;
 
</source>
 
  
is '''TRUE''' if the variable $test contains word-characters.
+
{{Vspace}}
  
Its inverse is the !~ operator, for example
+
====grep()====
  
<source lang="perl">
+
<!-- for updates, see code in R_Exercise -Bioinformatics "Sequence.R script -->
$line !~ m/^\s*#/;
 
</source>
 
  
 +
<pre>
  
is '''TRUE''' if the string contained in $line does not start "#", which may or may not be preceeded by a number of whitespaces. This would be useful to ignore comment lines.
+
# grep() is like match(), but uses regular expressions. A variant of grep() that
 +
# returns a boolean vector - like "==" does - is grepl(). That is useful
 +
# because we can & or | the vector, or invert it with ! .
 +
 
 +
grep("fox", vA)
 +
grep("o", vA) # Aha! now we get all elements that contain an "o" -
 +
              # Because we get partial matches with regular expressions.
 +
vA[grep("o", vA)] # subset
 +
 
 +
grepl("o", vA)    # logical
 +
! grepl("o", vA)  # its inverse
 +
 
 +
vA[! grepl("o", vA)] # subset all words without "o"
 +
 
 +
</pre>
 +
 
 +
{{Vspace}}
 +
 
 +
====Subsetting example====
 +
 
 +
Consider the following regular expression:
 +
 
 +
<pre>
 +
 
 +
patt <- "^\\s*#"
 +
 
 +
</pre>
 +
 
 +
 
 +
This matches if the string it is applied to does not begin with a "#", which may or may not be preceeded by whitespaces. This would be useful to ignore comment lines in a data file.
  
 
The regular expression above is decomposed as follows:
 
The regular expression above is decomposed as follows:
  
#<code>m</code>&nbsp;&nbsp;&nbsp;the matching operator (optional)
 
#<code>/</code>&nbsp;&nbsp;&nbsp; the opening delimiter of the regular expression
 
 
#<code>^</code>&nbsp;&nbsp;&nbsp;the beginning of the line
 
#<code>^</code>&nbsp;&nbsp;&nbsp;the beginning of the line
#<code>\s</code>&nbsp;&nbsp;&nbsp;any whitespace character ...
+
#<code>\\s</code>&nbsp;&nbsp;&nbsp;any whitespace character ...
 
#<code>&#42;</code>&nbsp;&nbsp;&nbsp; ... repeated 0 or more times
 
#<code>&#42;</code>&nbsp;&nbsp;&nbsp; ... repeated 0 or more times
 
#<code>&#35;</code>&nbsp;&nbsp;&nbsp; the hash character
 
#<code>&#35;</code>&nbsp;&nbsp;&nbsp; the hash character
#<code>/</code>&nbsp;&nbsp;&nbsp;the closing delimiter of the regular expression
 
  
The following example would process a file and store all lines that are not comments in an array:
 
  
<source lang="perl">
+
The following example would read a file into an vector of lines, then drop all lines that are empty, and all lines that are comments. This is a straightforward idiom.
#!/usr/bin/perl
 
use strict;
 
use warnings;
 
  
my @input;
+
<pre>
while (my $line = <STDIN>) {    # while something is being read
+
 
  if ($line !~ m/^\s*#/) {      # if its not a comment ...
+
IN <- "test.txt"
      push(@input, $line);      # ... store line in array
+
patt <- "^\\s*#"
  }
+
 
}
+
myData <- readLines(IN)
print(@input,"\n");              # print whole array
+
myData <- myData[myData != ""]  # drop all elements that are the empty string
 +
myData <- myData[! grepl(patt, myData)# drop all elements match the pattern
 +
 
 +
</pre>
 +
 
 +
{{Vspace}}
  
exit();
+
==== Substitution - gsub() ====
</source>
 
  
====Substitution - s====
+
Think of "gsub"" as "global substitution", and you'll understand that there exists another function, <code>sub()</code> that replaces only the first occurrence of a pattern, rather than all of them as <code>gsub()</code> does. I can't imagine what the use case for that might be and I don't think I have ever used <code>sub()</code>. I get an intuitive sense that code that needs such a function should probably be reconceived. But <code>gsub()</code> is very useful.
  
The substitution operator s substitutes the expression in the first part with the expression in the second part once per line. Its syntax is
+
<pre>
  
s/&gt;Expression&lt;/&gt;Replacement&lt;/&gt;Modifier&lt;
+
(s <- "  1 MKLAACFLTL LPGFAVA... 17  ") # E-coli Alpha Amylase signal peptide
  
&gt;Expression&lt; is a regular expression.
+
# Drop everything from this string that is not an amino acid one-letter code.
&gt;Replacement&lt; is a specific pattern.
+
# We use gsub() to first identify all non-amino acid letters with a character
&gt;Modifier&lt; is one or more characters from a list of modifiers detailed below.
+
# class regular expression, then we replace each occurrence with the empty
 +
# string.
  
Example (substitutes the first instance of ugly in a line with pretty):
+
gsub("[^ACDEFGHIKLMNPQRSTVWY]", "", s)
  
<source lang="perl">
+
# or, with assignment: ...
$line =~ s/ugly/pretty/;
+
s <- gsub("[^ACDEFGHIKLMNPQRSTVWY]", "", s)
</source>
 
  
Try the folowing example:
+
</pre>
  
<source lang="perl">
+
{{Vspace}}
#!/usr/bin/perl
 
use strict;
 
use warnings;
 
  
print("input>");
+
====strsplit() ====
my $line = <STDIN>;
 
$line =~ s/[^0-9+*\/=^-]//g;  # substitute
 
print($line,"\n");
 
  
exit();
+
Another function that makes use of regular expressions is <code>strsplit()</code>. It takes a vector of strings, and returns a list, one element for each element of the vector, in which each string has been split up by separating it along a regular expression.
</source>
 
  
 +
<pre>
 +
x <- c("a b c", "1 2")
 +
strsplit(x, " ")
 +
# [[1]]
 +
# [1] "a" "b" "c"
 +
#
 +
# [[2]]
 +
# [1] "1" "2"
 +
</pre>
  
The key is the following command:
+
Since even a single string returns a list, you often have to extract the element you want as a vector for further use.
  
<source lang="perl">
+
<pre>
$line =~ s/[^0-9+*\/=^-]//g;
+
corvidae <- c("crow:jackdaw:jay:magpie:raven:rook")
</source>
+
strsplit(corvidae, ":")
  
The substitution is applied to the contents of the variable $line. It is of the form
+
unlist(strsplit(corvidae, ":"))
 +
strsplit(corvidae, ":")[[1]]
  
s/...//g;
+
# Consider:
 +
length(strsplit(corvidae, ":"))
 +
length(unlist(strsplit(corvidae, ":")))
 +
</pre>
  
which means substitute all occurrences ( g modifier !) of the pattern […] with nothing (because the replacement pattern is empty). This deletes all matching characters from the string.
 
  
The expression itself is a character set. It matches any character which is not a digit (0-9), a "+" or "*" character, a "/" character (which has to be preceded with an escape, as "\/", otherwise it would be parsed as the delimiter of the expression), or an "=", "^", or "-" character. Since it is itself a negation, only the characters specified thus are not deleted.
+
<code>strsplit()</code> is immensely useful to extract elements from strings with a relatively well defined structure.
  
For example the input
+
<pre>
 +
s <- "1, 1, 2, 3, 5, 8"
 +
strsplit(s, ", ")[[1]] # split on comma-space
  
aa2bb^4cc,.<>=16....
+
s <- "~`!@#$%^&*()_-=+[{]}\|;:',<.>/?"
 +
strsplit(s, "")[[1]]  # split on empty string
  
is changed into the output:
+
s <- "chronological lifespan:\tincreased\ncold sensitivity:\tincreased\nsporulation:\tnormal"
 +
strsplit(s, "\\t|\\n")[[1]]  # split on tab or newline
  
2^4=16
+
</pre>
  
  
====Transliteration - tr====
 
  
The transliteration operator tr substitutes a range of characters with another range of characters.
+
{{Vspace}}
  
<source lang="perl">
+
==Behaviour==
$line =~ tr/[a-z]/[A-Z]/;
 
</source>
 
  
turns the contents of $line all into uppercase.
+
{{Vspace}}
  
  
====split() ====
+
====Capturing and using  matches ====
  
Another operator that makes use of regular expressions is the split operator. You can split on a regular expression and thus remove unneeded characters from input, as in the following example:
+
Matches can be captured and used, e.g. in <code>gsub()</code>.
  
<source lang="perl">
+
<pre>
#!/usr/bin/perl -w
+
# Capture matches by placing them in parentheses. To immediatley reuse them, refer to them with "backreferences": <code>\\1</code>, <code>\\2</code>, <code>\\3</code>.
use strict;
 
my $string = "A :colon:delimited: string: with:  random :spaces";
 
my ( @lines ) = split(/\s*:\s*/, $string);
 
# splits on colons surrounded by optional spaces
 
...
 
</source>
 
  
 +
# Example 1:
 +
# The beginning and ending three words of some text...
 +
s <- "I know, however, that its precarious and remote villages lie within the lowlands of the Wisla River."
 +
gsub("^((\\S+\\s+){3}).*((\\s\\S+){3})$", "\\1 ... \\3", s)
  
@lines now contains each entry in its own array element, without colons or whitespace.
+
# Note: matches \\2 and \\4 are the inside the parentheses that are there to
 +
# group things to be found {3}-times.
  
In practice, when should you use matching, and when is split() more appropriate?
 
  
;Use matching when you know what you want to keep:
+
# Example 2:
<source lang="perl">
+
# A binomial species name has a genus, a species, and possibly a strain name.
@words = $input =~ /\w+/g; # captures all blocks of characters
+
# We use \\S (not whitespace) and \\s (whitespace) to tease this apart into
</source>
+
# three captures expressions:
 +
s <- "Saccharomyces cerevisiae S288C"
 +
gsub("^(\\S+)\\s(\\S+)\\s*(.*)$",
 +
    "genus: \\1; species: \\1 \\2; (strain: \\3)",
 +
    s)
 +
gsub
  
;Use split() when you know what you want to discard:
+
</pre>
<source lang="perl">
 
@words = split( /\s+/, $input); # splits on whitespace
 
                                # and discards it
 
</source>
 
  
Consider how punctuation marks would influence the results of these examples.
+
====Capturing and returning matches ====
  
The most frequent use of the split function is for processing structured input data, such as comma- or tab delimited text:
+
Finding and '''returning''' matches in R is a two-step process. (1) find matches with <code>regexpr()</code> (one match), <code>gregexpr()</code> (all matches), or <code>regexec()</code> (sub-expressions in parentheses). All of these return a "match object". (2) use the match object to extract the matching substrings from the original string.
 +
 
 +
 
 +
<pre>
  
<source lang="perl">
 
#!/usr/bin/perl
 
use strict;
 
use warnings;
 
my @fields;
 
while (@fields = split(/\t/, <STDIN>) { #tab separated values
 
  # ... process fields
 
}
 
exit();
 
</source>
 
  
{{Vspace}}
+
# Extracting gene names in text.
  
{{Vspace}}
+
# Let's define a valid gene name to be a substring that is bounded by
 +
# word-boundaries, starts with an upper-case character, contains more upper-case
 +
# characters or numbers or a hyphen or underscore, with a minimal length of 3.
 +
# Here is a regex, and we put the part of the string that we want to recover, in
 +
# parentheses:
  
==Behaviour==
+
patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b"
  
===Returning values===
+
# Test: positives
It is often desirable to group terms together. This is done with various forms of parentheses. By default, grouping values with parentheses allows to capture the actual match to the special variables $1, $2, $3, etc. in the order in which the complete phrases of the groups are defined, from outermost to innermost !
+
grepl(patt, "MBP1")
 +
grepl(patt, "AAT")
 +
grepl(patt, " AI1")
 +
grepl(patt, "ASP3-1 ")
 +
grepl(patt, " AI5_ALPHA; ")
 +
grepl(patt, " (TY1B-PR3) ")
 +
# Test: negatives
 +
grepl(patt, "G1") # Too short
 +
grepl(patt, "G1-") # Hyphen at end
 +
grepl(patt, "Cell") # contains lower-case
  
Here is one example - the groupings are shown below the parentheses.
+
# Let's apply this to retrieve gene names in text
  
This is how it works:
+
s <- "Transcriptional activation of CLN1, CLN2, and a putative new G1 cyclin (HCS26) by SWI4, a positive regulator of G1-specific transcription. Cell 66(5):1015-26"
( ( ) ( ( ) ) )
 
1-------------1
 
  2-2
 
      3-----3
 
        4-4
 
  
 +
(m <- regexpr(patt, s)) # found a match in position 31
 +
regmatches(s, m)        # retrieve it
  
This is how it does not work:
+
(m <- gregexpr(patt, s)) # found all matches
( ( ) ( ( ) ) )
+
regmatches(s, m)        # retrieve them (note, this is a list)
1---1
 
  2-------2
 
      3-----3
 
        4-----4
 
  
<table border="1" cellpadding="5">
+
# The function of choice however is regexec(). It returns whatever the pattern
<tr><th>Grouping Syntax</th><th>Meaning</th><th>Where it occurs in the regex</th></tr>
+
# has defined in parentheses, the others return the entire match. The
<tr><td><code>()</code></td><td>Group what's between the brackets and remember match</td><td>Anywhere</td></tr>
+
# parentheses are quite important, because we might want to specify additional
<tr><td><code>(?: … )</code></td><td>Group what's between the brackets, but discard match</td><td>Anywhere</td></tr>
+
# context for a valid match, but we might not want the context in the match
<tr><td><code>(?= … )</code></td><td>must follow the match</td><td>End of a regex</td></tr>
+
# itself. In our example we used word boundaries - \\b - for such context; but
<tr><td><code>(?! … )</code></td><td>must not follow the match</td><td>End of a regex</td></tr>
+
# these are zero-length and don't actually match a character, so they don't
</table>
+
# contaminate the substring anyway. But in general we need to be able to
 +
# precisely retrieve only the target substring.
  
 +
(m <- regexec(patt, s)) # only the parenthesized substring
 +
regmatches(s, m)        # retrieve it
  
In terms of saved values, also note that string parts are saved to special global variables.
+
# Note that there are two elements: the first is the whole match, the second
 +
# is the substring that is in parentheses. In our example these are the same.
 +
# Here is an example where they are not:
 +
s <- "Find the last word. And tell me."
 +
(m <- regexec("\\s(\\w+)\\.", s))
 +
regmatches(s, m)        # retrieve it
  
<table border="1" cellpadding="5">
+
# Unfortunately there is no option to capture multiple matches
<tr><th>Variable</th><th>What it contains</th></tr>
+
# in base R: regexec() lacks a corresponding gregexec()...
<tr><td><code>$`</code></td><td>Part of string before match</td></tr>
 
<tr><td><code>$&amp;</code></td><td>Part of string matched</td></tr>
 
<tr><td><code>$'</code></td><td>Part of string after match</td></tr>
 
</table>
 
  
Note the following: if these are not used anywhere in your code, Perl doesn't bother to maintain them, when your program is compiled. This makes all regexes much faster. It seems sensible to avoid them for all but quick and dirty programming work; use parentheses when you need to capture matches and never to put such special variables in modules!
+
patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b"
  
====Capturing matches directly====
+
s <- "Transcriptional activation of CLN1, CLN2, and a putative new G1 cyclin (HCS26) by SWI4, a positive regulator of G1-specific transcription. Cell 66(5):1015-26"
In addition to using parentheses and the special variables, you can capture values directly by assignment from the match operator if you use the "global" modifier.
 
  
  
<source lang="perl">
+
# Solution 1 (base R): you can use multiple matches in an sapply()
#!/usr/bin/perl
+
# statement...
use strict;
+
sapply(regmatches(s, gregexpr(patt, s))[[1]],
use warnings;
+
      function(M){regmatches(M, regexec(patt, M))})
  
my $Ubiquitin ="
 
MQIFVKTLTG KTITLEVEPS\n
 
DTIENVKAKI QDKEGIPPDQ\n
 
QRLIFAGKQL EDGRTLSDYN\n
 
IQKESTLHLV LRLRGG\n";
 
  
my @hydrophobics = $Ubiquitin =~ m/[FAMILYVW]/gs;
+
# Solution 2 (probably preferred): you can use
print @hydrophobics;
+
# str_match_all() from the very useful library "stringr" ...
 +
if (! requireNamespace("stringr", quietly=TRUE)) {
 +
  install.packages("stringr")
 +
}
 +
# Package information:
 +
#  library(help = stringr)      # basic information
 +
#  browseVignettes("stringr")    # available vignettes
 +
#  data(package = "stringr")    # available datasets
  
exit();
 
</source>
 
  
 +
stringr::str_match_all(s, patt)
 +
stringr::str_match_all(s, patt)[[1]][,2]
 +
# [1] "CLN1"  "CLN2"  "HCS26" "SWI4"
  
You can also use <code>grep()</code> and collect matching lines in an array. Here is an example that downloads a coordinate file from the PDB and extracts the <code>ATOM  </code> records.
+
# Note that str_match_all() handles the match object internally, no need for
 +
# the two-step code.
  
<source lang="perl">
+
</pre>
#!/usr/bin/perl
 
use strict;
 
use warnings;
 
  
my $PDBpref = "http://pdb.org/pdb/files/";
 
my $PDB_ID  = uc("2imm");
 
my $PDBsuff = ".pdb";
 
my $URL = $PDBpref . $PDB_ID . $PDBsuff;
 
  
my @raw = split(/\n/, `curl -s $URL`); # backtick operator captures output of system commandline function "curl"
+
An interesting new alternative/complement to the base '''R''' regex libraries is the {{R|ore|ore()|package "'''ore'''"}} that uses the {{WP|Oniguruma}} libraries and supports multiple character encodings, which you need when you work with Unicode and/or CJK character sets.
my @atoms = grep(/^ATOM  /, @raw);
 
  
print (join("\n", @atoms), "\n"); # join lines with linebreaks, add a final linebreak at the end
 
  
exit();
+
<pre>
</source>
+
if (! requireNamespace("ore"), quietly = TRUE) {
 +
    install.packages("ore")
 +
}
 +
# Package information:
 +
#  library(help = ore)      # basic information
 +
#  browseVignettes("ore")   # available vignettes
 +
#  data(package = "ore")    # available datasets
  
  
&nbsp;
+
S <- "The quick brown fox jumps over a lazy dog"
 +
 
 +
ore::ore.search(". .", S)
 +
ore::ore.search(". .", S, all=TRUE)
 +
M <- ore::ore.search(". .", S, all=TRUE)
 +
M$nMatches
 +
M$match[2:4]
 +
</pre>
 +
 
 +
According to the author John Clayden, key advantages include:
 +
* Search results focus around the matched substrings (including parenthesised groups), rather than the locations of matches. This saves
 +
extra work with <code>regmatches()</code> or similar to extract the matches themselves.
 +
* [http://rpubs.com/jonclayden/regex-performance Substantially better performance], especially when matching against long strings.
 +
* Substitutions can be functions as well as strings.
 +
* Matches can be efficiently obtained over only part of the strings.
 +
* Fewer core functions, with more consistent names.
 +
 
 +
{{Vspace}}
  
 
===Modifiers===
 
===Modifiers===
After the trailing / delimiter of the regular expression, an i makes the match case insensitive (e.g. /foo/i will match FOO too). An x causes Perl to ignore whitespace in the regex (e.g. /foo s?/x will match foo and foos, but not "foo s"; this is useful when an expression is long and may span several lines - just insert linebreaks, tabs or characters as needed.
 
  
For example the following is a valid regular expression in a Perl program that parses a Fasta file into header and sequence.
+
A number of modifiers can be applied as arguments to regular expression functions that may be useful. Here are the two most important ones.
 +
 
 +
<pre>
 +
 
 +
# Option "ignore.case" allows to have case-insensitive matches. This is usually
 +
# poor programming style, a more explicit (= better) way is to define your
 +
# character classes appropriately.
 +
 
 +
patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b"
 +
 
 +
s <- "The MBP1 gene encodes the Mbp1 protein."
 +
 
 +
m <- gregexpr(patt, s)
 +
regmatches(s, m)[[1]]
 +
 
 +
m <- gregexpr(patt, s, ignore.case = TRUE)
 +
regmatches(s, m)[[1]]
 +
 
  
<source lang="perl">
+
# For regex functions in the stringr package, you can compile the pattern
#!/usr/bin/perl
+
# with the regex() function, and include the option "comments = TRUE". This
use strict;
+
# allows you to insert whitespace and # characters into the pattern
use warnings;
+
# which will be ingnored by the regex engine. Thus you can comment
 +
# complex regular expressions inline.
  
my $fasta ='';
 
while (my $line .= <STDIN>) { $fasta .= $line; }
 
  
$fasta =~ /    # Begin regular expression
+
myRegex <- stringr::regex("\\b            # word boundary
    (?:.*)    # discard whatever precedes next match
+
                          (               # begin capture
    \s*        # there could be whitespaces
+
                          [A-Z]          # one uppercase letter
    >(.*\n)    # match the header line and collect its contents
+
                          [A-Z0-9\\-_]+  # one or more letters, numbers, hyphen or
    \s*       # there could again be whitespaces
+
                                          #   underscore
    ((.*\n)*)  # match everything else to the end
+
                          [A-Z0-9]       # one letter or number.
    /x;        # ignore whitespace in the regex
+
                          # Note: this captured subexpression has a minimum length of 3.
 +
                          )               # end capture
 +
                          \\b",          # word boundary
 +
                          comments = TRUE)
  
my $header = $1;
+
stringr::str_match_all(s, myRegex)[[1]][2]
my $sequence = $2;
 
$sequence =~ s/\s//g;  # remove all whitespace from sequence
 
  
print($header,"\n");
+
</pre>
print($sequence,"\n");
 
  
exit();
 
</source>
 
  
Here the Perl compiler first discards the comments and the "x" modifier discards all the whitespaces inside the regular expressions.
+
===Greed===
  
Contrast this to the impenetrable expression you would have had to write otherwise !
+
By default, quantitative matches except zero/one (i.e. the ? character) are "greedy", i.e. they will match the largest possible number of characters.  For example:
  
 +
<pre>
 +
s <- "abc123"
  
<source lang="perl">
+
patt <- "(\\w+)(\\d+)"  # word characters, followed by digits. This pattern ...
$fasta =~ /(?:.*)\s*>(.*\n)\s*((.*\n)*)/;
+
stringr::str_match_all(s, patt)[[1]][-1]
</source>
 
  
 +
# ...yields "abc12" and "3" . This is because \w+ is greedy and grabs as many
 +
# alphanumeric characters as it can before \d+ gets a chance to match.  A "?"
 +
# after a quantity specifier makes it non-greedy, therefore ...
  
The s modifier treats multi-line strings (with new-line characters in them) as a single line, otherwise matching ends at the first new-line (e.g. /fo\no/s will match foo split over two lines).  The g modifier is useful in loops, making consecutive attempts to match, starting at the place in the string where the previous match ended (e.g. while($foo =~ /o/g){$o_count++} will give an o_count of two if $foo contains "foo" because there are two o's in "foo").
+
patt <- "(\\w+?)(\\d+)"  # Note the questionmark in (\\w+?)
 +
stringr::str_match_all(s, patt)[[1]][-1]
  
All of the modifiers can be used together. Just type them one after another after the delimiter.
+
# ... now \d+ gets a chance to match as many digits as possible
  
 +
</pre>
  
===Greed===
+
{{Vspace}}
  
By default, quantitative matches except zero/one (i.e. the ? character) are "greedy", i.e. they will match the largest possible number of characters.  For example
+
==Regular Expressions in other languages==
/(\w+)(\d+)/
 
against "abc123" yields "abc12" and "3" for $1 and $2 respectively
 
  
This is because \w+ is greedy and grabs as many alphanumeric characters as it can before \d+ gets a chance to match.  A ? after a quantity specifier makes it non-greedy, therefore
+
{{Vspace}}
/(\w+?)(\d+)/
 
against "abc123" yields "abc" and"123" for $1 and $2 respectively.
 
  
 +
===PHP===
  
==Regular Expressions in PHP==
+
<pre>
<source lang="PHP">
 
 
<?php
 
<?php
 
$string = "The quick brown fox jumps over a lazy dog";
 
$string = "The quick brown fox jumps over a lazy dog";
Line 613: Line 758:
  
 
?>
 
?>
</source>
+
</pre>
 
 
{{Vspace}}
 
  
 
{{Vspace}}
 
{{Vspace}}
  
==Regular expressions in Python==
+
===Python===
 
Python regular expression are provided through the module <code>re</code>. See [https://docs.python.org/2/library/re.html '''here''' for documentation].
 
Python regular expression are provided through the module <code>re</code>. See [https://docs.python.org/2/library/re.html '''here''' for documentation].
  
Line 632: Line 775:
  
  
 
+
====Python example====
 
 
===Example===
 
  
 
Download [http://biochemistry.utoronto.ca/steipe/abc/CourseMaterials/BCB410/sample.svg '''this <code>.svg</code> file'''] to experiment.
 
Download [http://biochemistry.utoronto.ca/steipe/abc/CourseMaterials/BCB410/sample.svg '''this <code>.svg</code> file'''] to experiment.
  
  
<source lang="python">
+
<pre>
 
# parse_SVG_example.py
 
# parse_SVG_example.py
 
# Read an svg file line by line and process path data
 
# Read an svg file line by line and process path data
Line 679: Line 820:
 
OUT.close()
 
OUT.close()
  
</source>
+
</pre>
  
 
{{Vspace}}
 
{{Vspace}}
  
{{Vspace}}
+
===Javascript===
 
 
==Regular Expressions in '''R'''==
 
 
 
The online help page is [http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html '''here''']. Default behaviour is not standard POSIX. To be sure, pass the <code>perl=TRUE</code> parameter.
 
 
 
<!-- for updates, see code in R_Exercise -Bioinformatics "Sequence.R script -->
 
<source lang="R">
 
# R regular expressions in base R
 
 
 
string <- "The quick brown fox jumps over a lazy dog"
 
vector <- unlist(strsplit(string, "\\s"))
 
 
 
# Not all pattern searches use (and need) regular expression. Sometimes
 
# simple string-matching is enough.
 
 
 
# R has match(), the %in% operator, and grep()
 
 
 
# match test for string equality
 
match("fox", vector)  # 4, i.e. the 4th element matches the string
 
match("o", vector)    # NA matches have to be to the WHOLE element
 
 
 
# equivalent to...
 
which(vector == "fox")
 
 
 
# %in% can be used for creating intersections
 
# find whether elements from one vector are
 
# contained in another:
 
 
 
english <- unlist(strsplit(
 
"what's in a name ? that which we call a rose by any other name would smell as sweet ."
 
                          , "\\s"))
 
german <- unlist(strsplit(
 
"was ist ein name ? was uns rose heißt , wie es auch hieße , würde lieblich duften ."
 
                          , "\\s"))
 
english
 
german
 
german %in% english
 
german[german %in% english]
 
 
 
 
 
 
 
# grep() is like match(), but uses regular expressions. parts of the string
 
# may match. The result is a logical vector.
 
 
 
grep("fox", vector)
 
grep("o", vector)
 
grep("[opq]", vector)
 
english[grep("a", english)]
 
 
 
 
 
 
 
# strsplit()  Note: the regex comes *after* the string in default ordering
 
# we have seen its use to split on whitespace (\s) above.
 
# NOTE: the regular expression in the pattern is <backslash> "s". But if
 
# we write "\s" into the string, R thinks we are "escaping" the s. That's
 
# not what we want. We have to escape the backslash, then write "s". The
 
# "escaped" backslash is "\\". Thus the regex pattern as R string is "\\s".
 
 
 
# The return value of strsplit() is a list, thus we unlist() to use
 
# the result as a vector.
 
 
 
unlist(strsplit(english, "\\s"))
 
 
 
 
 
 
 
# regexpr(), regmatches()
 
#get all word characters adjacent to "o"
 
pattern <- "\\w{0,1}o\\w{0,1}" # 0-1 "\w" character left and
 
                              # right of "o"
 
regexpr(pattern, vector) # positions of matches
 
M <- regexpr(pattern, vector) # assign the result object
 
regmatches(vector, M) # use regmatches to process
 
                      # the match-object M against the
 
                      # source vector
 
 
 
 
 
# regexec()
 
# capture groups from a string. Here we don't just want to know
 
# whether a match exists, but what it is. Example: is there
 
#  a three-consonant cluster in our string?
 
pattern <- "([bcdfghjklmnpqrstvwxz]{3})"  # Note the parentheses
 
                                          # that indicate the match
 
                                          # should be "captured"
 
grep(pattern, string)
 
 
 
M <- regexec(pattern, string) #
 
regmatches(string, M)
 
regmatches(string, M)[[1]]
 
regmatches(string, M)[[1]][1]
 
 
 
 
 
# Unfortunately there is no option to capture multiple matches
 
# in base R: regexec() lacks a corresponding gregexec()...
 
M <- regexec("(. .)", string)
 
regmatches(string, M)
 
 
 
# ... matches only the first character/blank/character pattern,
 
# not all of them.
 
 
 
 
 
# Solution 1 (base R): you can use multiple matches in an sapply()
 
# statement...
 
pattern <- "(. .)"  # the regex: capture two characters adjacent to a single blank
 
sapply(regmatches(string, gregexpr(pattern, string))[[1]],
 
      function(M){regmatches(M, regexec(pattern, M))})
 
 
 
 
 
# Solution 2 (probably preferred): you can use
 
# str_match_all() from the very useful library "stringr" ...
 
if (!require(stringr, quietly=TRUE)) {
 
    install.packages("stringr")
 
    library(stringr)
 
}
 
 
 
str_match_all(string, pattern)[[1]][,2]
 
# [1] "e q" "k b" "n f" "x j" "s o" "r a" "y d"
 
 
 
 
 
</source>
 
 
 
 
 
An interesting new alternative/complement to the base '''R''' regex libraries is the {{R|ore|ore()|package "'''ore'''"}} that uses the {{WP|Oniguruma}} libraries and supports multiple character encodings.
 
 
 
 
 
<source lang="R">
 
if (!require(ore)) {
 
    install.packages("ore")
 
    library(ore)
 
}
 
 
 
S <- "The quick brown fox jumps over a lazy dog"
 
 
 
ore.search(". .", S)
 
ore.search(". .", S, all=TRUE)
 
M <- ore.search(". .", S, all=TRUE)
 
M$nMatches
 
M$match[2:4]
 
</source>
 
 
 
According to the author John Clayden, key advantages include:
 
* Search results focus around the matched substrings (including
 
parenthesised groups), rather than the locations of matches. This saves
 
extra work with "substr" or similar to extract the matches themselves.
 
* [http://rpubs.com/jonclayden/regex-performance Substantially better performance], especially when matching against
 
long strings.
 
* Substitutions can be functions as well as strings.
 
* Matches can be efficiently obtained over only part of the strings.
 
* Fewer core functions, with more consistent names.
 
 
 
{{Vspace}}
 
 
 
{{Vspace}}
 
 
 
==Regular Expressions in Javascript==
 
  
 
Javascript is probably the most convenient choice if you want to interact directly with the DOM (Document Object Model) of a Web page. Here is example code for a "bookmarklet" that rewrites the URL of a journal-article page for access from outside the UofT network.
 
Javascript is probably the most convenient choice if you want to interact directly with the DOM (Document Object Model) of a Web page. Here is example code for a "bookmarklet" that rewrites the URL of a journal-article page for access from outside the UofT network.
  
<source lang="javascript">
+
<pre>
 
javascript:(function(){
 
javascript:(function(){
 
   var url=window.location.href;
 
   var url=window.location.href;
Line 851: Line 838:
 
void 0
 
void 0
  
</source>
+
</pre>
 +
 
 +
Put this into the body of an arbitrary bookmark on your browser, then click it to be redirected to our library's free access system of a paywalled journal article.
  
{{Vspace}}
 
  
 
{{Vspace}}
 
{{Vspace}}
  
==Regular Expressions in POSIX (Unix, the shell)==
+
===POSIX (Unix, the bash shell)===
 
Use in:
 
Use in:
 
*<code>grep</code>
 
*<code>grep</code>
Line 873: Line 861:
 
{{Vspace}}
 
{{Vspace}}
  
{{Vspace}}
+
==Practice==
 
 
==Discussion points==
 
 
 
* Revisit [http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags the stackoverflow thread on regex and HTML parsing]. What's your opinion on the OP's question?
 
  
{{Vspace}}
+
{{ABC-unit|RPR-RegEx.R}}
  
 
{{Vspace}}
 
{{Vspace}}
 
==Exercises==
 
<section begin=exercises />
 
 
<!-- Exercise template with sample data, hint and solution ...
 
=== Heading===
 
<div class="mw-collapsible mw-collapsed exercise-box" data-expandtext="Hint" data-collapsetext="Collapse">
 
Task ...
 
 
<div class="mw-collapsible mw-collapsed exercise-box" data-expandtext="Expand" data-collapsetext="Collapse" style="background-color:#EEEEF9;">
 
Sample data ...
 
<div class="mw-collapsible-content">
 
 
<source lang="text">
 
Data ...
 
</source>
 
 
</div>
 
</div>
 
 
 
<div class="mw-collapsible-content exercise-box">
 
<div class="mw-collapsible mw-collapsed" data-expandtext="Solution" data-collapsetext="Collapse">
 
Hint ...
 
 
 
<div class="mw-collapsible-content exercise-box">
 
Solution ...
 
 
 
</div>
 
</div>
 
</div>
 
</div>
 
-->
 
 
===Counting lines===
 
<div class="mw-collapsible mw-collapsed exercise-box" data-expandtext="Hint" data-collapsetext="Collapse">
 
Write a unix command that returns the number of atoms in a PDB file.
 
 
<div class="mw-collapsible mw-collapsed exercise-box" data-expandtext="Expand" data-collapsetext="Collapse" style="background-color:#EEEEF9;">
 
Sample data ...
 
<div class="mw-collapsible-content">
 
 
<source lang="text">
 
HEADER  TEST                                                0TST      0TST  1
 
REMARK  ATOM  AND HETATM RECORDS FOR COUNTING                        0TST  2
 
ATOM      1  N  GLY    1      -6.253  75.745  53.559  1.00 36.34      0TST  3
 
ATOM      2  CA  GLY    1      -5.789  75.223  52.264  1.00 44.94      0TST  4
 
ATOM      3  C  GLY    1      -5.592  73.702  52.294  1.00 32.28      0TST  5
 
ATOM      4  O  GLY    1      -5.140  73.148  53.304  1.00 19.32      0TST  6
 
TER      5      GLY    1                                              0TST  7
 
HETATM    6  O  HOH    1      -4.169  60.050  40.145  1.00  3.00      0TST  8
 
HETATM    7 CA  CA      1      -1.258 -71.579  50.253  1.00  3.00      0TST  9
 
END                                                                    0TST  10
 
</source>
 
 
</div>
 
</div>
 
 
 
<div class="mw-collapsible-content exercise-box">
 
<div class="mw-collapsible mw-collapsed" data-expandtext="Solution" data-collapsetext="Collapse">
 
<code>egrep</code>  "ATOM  " OR "HETATM" records at the beginning of a line, then pipe the output through <code>wc</code>.
 
 
<div class="mw-collapsible-content exercise-box">
 
; the unix solution:
 
<source lang="bash">
 
egrep "^ATOM  |^HETATM" test.pdb | wc -l
 
</source>
 
 
 
; a Perl solution
 
<source lang="perl">
 
#!/usr/bin/perl
 
use warnings;
 
use strict;
 
 
my $numberOfAtoms = 0;
 
 
while (my $line = <STDIN>) {        # read in from STDIN
 
 
  if ($line =~ /^ATOM  |^HETATM/) { # match on "ATOM  " or
 
      $numberOfAtoms++;              # "HETATM" at the beginning
 
  }                                # of a line
 
}
 
print("Number of atoms in input file: ", $numberOfAtoms, "\n");
 
 
exit();
 
</source>
 
 
 
</div>
 
</div>
 
</div>
 
</div>
 
 
==== ...CA atoms only====
 
<div class="mw-collapsible mw-collapsed exercise-box" data-expandtext="Hint" data-collapsetext="Collapse">
 
Change your unix command to count C-alpha atoms only. Work only with regular expressions. Don't get fooled by calcium atoms!
 
 
<div class="mw-collapsible-content exercise-box">
 
<div class="mw-collapsible mw-collapsed" data-expandtext="Solution" data-collapsetext="Collapse">
 
Several possibilities. You might change the regex to  be more specific, or you might grep again on the output of a previous grep.
 
 
<div class="mw-collapsible-content exercise-box">
 
Here's a version with grep-ing twice. <small>This strategy is convenient when you can't be sure about the order in which your required patterns appear in the input.</small>
 
 
<source lang="bash">
 
egrep "^ATOM" test.pdb | egrep "[0-9]  CA " | wc -l
 
</source>
 
 
 
</div>
 
</div>
 
</div>
 
</div>
 
 
===eMail addresses ===
 
<div class="mw-collapsible mw-collapsed exercise-box" data-expandtext="Hint" data-collapsetext="Collapse">
 
Write a program in a language of your choice that reads a file from STDIN and prints any valid e-mail address this file might contain !
 
 
 
<div class="mw-collapsible mw-collapsed  exercise-box" data-expandtext="Expand" data-collapsetext="Collapse" style="background-color:#EEEEF9; ">
 
What is a valid eMail address ...  ?
 
<div class="mw-collapsible-content">
 
 
The protocols that govern the Internet are maintained by the IETF (www.ietf.org). They are developed as so-called RFCs (Requests For Comment) and are an impressive example of voluntary, self-organized technical administration that works. E-mail address formats are specified in RFC2822. The short of section 3.4.1 is the following:
 
 
;A valid e-mail address (this is slightly simplified from the RFC) consists of:
 
 
:local-part "@" domain
 
::where "local-part" is either
 
 
:::;1. a string containing the following characters:
 
:::any Letter
 
:::any Digit
 
:::any of !#$%&'*+-/=?^_`{|}~
 
:::Or conversely any printable character except ()<>@,;:\".[]
 
:::... elements of which can be separted by a period, which must not  occur as the first or last element ...
 
 
 
:::;2. or any quoted string (i.e. one enclosed in double-quotes).
 
 
 
::"@" is the character <code>@</code>.
 
 
::"domain" is a valid organizational domain i.e. a string:
 
::*with at least two elements,
 
::*containing only letters, digits or hyphens,
 
::*separated by periods,
 
::*where the last element is a TLD (Top Level Domain) - currently these are either 2 or 3 characters long,
 
::*where the domain(s) preceding the TLD are not longer than 63 characters.
 
 
</div>
 
</div>
 
 
 
<div class="mw-collapsible mw-collapsed exercise-box" data-expandtext="Expand" data-collapsetext="Collapse" style="background-color:#EEEEF9;">
 
Sample input data <ref>Contributed by Jennifer Tsai</ref>...
 
<div class="mw-collapsible-content" style="padding:10px;">
 
 
<source lang="text">
 
Hi,
 
blah blah blah hello joy joy giggle
 
g2g - alice@wonderland.org cheshire.d'cat@disappear.net
 
moose nibble on bark@lichens.com
 
Three valid addresses above. this.one@breaks.
 
to soooon "within the domain" and this.one@is.an.invalid+domain.com
 
Domains can h@ve.hy-phens.org but not under@scor_es.dunce.net
 
quoted strings can contain characters that are normally
 
disallowed - like this convincing sample: "Yo, :-) so kewl"@hotmail.com
 
invalid@.this.is , young padawan.
 
sh@rt.one is good but sh@rt.1 is bad
 
a.a.a.a.a.a.b.c@com.tw works, as does
 
user@mailbox.department.faculty.university.ac.uk but
 
a.@a@b@blah.tv is not valid RFC2822, please pick out the valid part
 
too.looooong@top.level.domain
 
oK@top.level.dom.ain
 
thats@it.end
 
 
</source>
 
 
</div>
 
</div>
 
 
 
 
 
<div class="mw-collapsible-content exercise-box">
 
<div class="mw-collapsible mw-collapsed" data-expandtext="Solution" data-collapsetext="Collapse" style="width:100%; ">
 
Match something at a word boundary, followed by "@", followed by something, bounded by whitespace. Group this appropriately. Then return $1, $2, $3.
 
 
 
<div class="mw-collapsible-content exercise-box">
 
The code below implements all of the RFC2822 rules, except it does not check that the length of the subdomain does not exceed 63 characters.
 
 
<source lang="perl">
 
#!/usr/bin/perl
 
use warnings;
 
use strict;
 
# Define valid character sets
 
my $LocalChars = 'a-zA-Z0-9!#$%&*+-/=?^_`{|}~\'';
 
my $DomainChars = 'a-zA-Z0-9-';
 
 
while (my $line = <STDIN>) {
 
 
  # Do a *global* match for e-mail addresses, the inner while loop repeats as long as
 
  # matches can be found. Omitting the modifier "g" at the end would report only the
 
  # first match.
 
  # Elements are parsed in several alternative groupings - only the outer ones are
 
  # stored, the others are discarded with (?: ...)
 
  while ($line =~ /                  # do while a match can be found
 
      (                            # open first grouping
 
        "[^"]+" |                  # quoted string, (quotes enclosing non-quotes) or ...
 
        \b                        # ... word boundary, followed by
 
        (?:[$LocalChars]+)        # at least one group of at least one character and ...
 
        (?:\.[$LocalChars]+)*      # ... 0 or more additional groups, separated by "."
 
      )
 
      @                            # The "@"
 
      (                            # open second grouping
 
        (?:[$DomainChars]+)        # at least one subdomain
 
        (?:\.[$DomainChars]+)*    # 0 or more repetitions
 
        (?:\.[$DomainChars]{2,3})  # Top Level Domain !
 
      )
 
      \s+                          # separated by whitespace
 
      /gx) {                      # do globally, ignore whitespace in expression
 
  print($1, "@", $2, "\n");
 
  }  # while - parse
 
}  # while - read <STDIN>
 
 
exit();
 
</source>
 
 
This is the output the program produces on the sample text:
 
 
<source lang="text">
 
alice@wonderland.org
 
cheshire.d'cat@disappear.net
 
bark@lichens.com
 
h@ve.hy-phens.org
 
"Yo, :-) so kewl"@hotmail.com
 
sh@rt.one
 
a.a.a.a.a.a.b.c@com.tw
 
user@mailbox.department.faculty.university.ac.uk
 
b@blah.tv
 
oK@top.level.dom.ain
 
thats@it.end
 
</source>
 
 
</div>
 
</div>
 
</div>
 
</div>
 
 
 
===Mutiple sequence alignment===
 
 
<div class="mw-collapsible mw-collapsed exercise-box" data-expandtext="Hint" data-collapsetext="Collapse">
 
Write a program in a language of your choice that extracts the multi-line sequences from a CLUSTAL or MSF formatted multiple sequence alignment and concatenates them into single sequences .
 
 
<div class="mw-collapsible mw-collapsed exercise-box" data-expandtext="Expand" data-collapsetext="Collapse" style="background-color:#EEEEF9;">
 
Sample input data ...
 
<div class="mw-collapsible-content" style="padding:10px;">
 
;CLUSTAL formatted alignment:
 
CLUSTAL multiple sequence alignment by MUSCLE (3.8)
 
 
 
SOK2_SACCE      --NGISVVRRADNDMVNGTKLLN-----VTKMTRGRRDGILKAEKIR----------HVV
 
PHD1_SACCE      --NGISVVRRADNNMINGTKLLN-----VTKMTRGRRDGILRSEKVR----------EVV
 
KILA_ESCCO      -IDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQSF
 
MBP1_SACCE      IHSTGSIMKRKKDDWVNATHILK-----AANFAKAKRTRILEKEVLKETH-------EKV
 
SWI4_SACCE      ---TKIVMRRTKDDWINITQVFK-----IAQFSKTKRTKILEKESNDMQH-------EKV
 
                      :  * .:. :* * : .      :. :. .    :  *              .
 
 
SOK2_SACCE      KIGSMHLKGVWIPFERALAIAQREKI-
 
PHD1_SACCE      KIGSMHLKGVWIPFERAYILAQREQI-
 
KILA_ESCCO      KGGRPENQGTWVHPDIAINLAQ-----
 
MBP1_SACCE      QGGFGKYQGTWVPLNIAKQLAEKFSVY
 
SWI4_SACCE      QGGYGRFQGTWIPLDSAKFLVNKYEI-
 
 
 
----
 
 
;MSF formatted alignment:
 
PileUp
 
 
  MSF: 87  Type: A  Check: 0000  ..
 
 
  Name: SOK2_SACCE  Len: 87  Check:  9836  Weight: 0.160458
 
  Name: PHD1_SACCE  Len: 87  Check:  2117  Weight: 0.160458
 
  Name: KILA_ESCCO  Len: 87  Check:  6044  Weight: 0.256296
 
  Name: MBP1_SACCE  Len: 87  Check:  4979  Weight: 0.211395
 
  Name: SWI4_SACCE  Len: 87  Check:  5197  Weight: 0.211395
 
 
//
 
 
SOK2_SACCE    ..NGISVVRR ADNDMVNGTK LLN.....VT KMTRGRRDGI LKAEKIR...
 
PHD1_SACCE    ..NGISVVRR ADNNMINGTK LLN.....VT KMTRGRRDGI LRSEKVR...
 
KILA_ESCCO    .IDGEIIHLR AKDGYINATS MCRTAGKLLS DYTRLKTTQE FFDELSRDMG
 
MBP1_SACCE    IHSTGSIMKR KKDDWVNATH ILK.....AA NFAKAKRTRI LEKEVLKETH
 
SWI4_SACCE    ...TKIVMRR TKDDWINITQ VFK.....IA QFSKTKRTKI LEKESNDMQH
 
 
SOK2_SACCE    .......HVV KIGSMHLKGV WIPFERALAI AQREKI.
 
PHD1_SACCE    .......EVV KIGSMHLKGV WIPFERAYIL AQREQI.
 
KILA_ESCCO    IPISELIQSF KGGRPENQGT WVHPDIAINL AQ.....
 
MBP1_SACCE    .......EKV QGGFGKYQGT WVPLNIAKQL AEKFSVY
 
SWI4_SACCE    .......EKV QGGYGRFQGT WIPLDSAKFL VNKYEI.
 
 
</div>
 
</div>
 
 
 
<div class="mw-collapsible-content exercise-box">
 
<div class="mw-collapsible mw-collapsed" data-expandtext="Solution" data-collapsetext="Collapse">
 
Write a regex for a valid sequence line. Capture the ID part and the sequence part separately. Use the ID part as a key to a hash, and add the sequence part to the value for that key.
 
 
<div class="mw-collapsible-content exercise-box">
 
;Perl example:
 
:This code uses a regex that recognizes both CLUSTAL and MSF formats:
 
:<code>^(\w+) {2,}([A-Za-z.\- ]+)$</code>
 
:Capture a sequence of word characters, followed by at least two consecutive blank spaces, and capture a sequence of alphabetic characters, gap characters (<code>-</code> or <code>.</code>) or spaces until the end of line. Note that <code>-</code> needs to be escaped (<code>\-</code>) since it has the meaning of a character range in the context of a character class (i.e. square brackets. Lines that contain numerals fail the match, as well as lines that contain special characters, or lines that begin with spaces. <small>Caution: this may not be fully compliant with the format specification.</small>
 
 
<source lang="perl">
 
#!/usr/bin/perl
 
use strict;
 
use warnings;
 
 
my %MSA; # Hash to store the MSA
 
while (my $line = <STDIN>) {
 
if ($line =~ m/^(\w+) {2,}([A-Za-z.\- ]+)$/) {
 
my $k = $1; # save special variables so they don't get mangled before using them
 
my $v = $2;
 
$v =~ s/\s//g;  # remove blanks in case there are any
 
$MSA{$k} .= $v;  # "." is the perl string concatenation operator
 
}
 
}
 
#Done. Now do something with the sequences ...
 
foreach my $k (keys(%MSA)) {
 
print("$k: $MSA{$k}\n");
 
}
 
 
exit();
 
 
</source>
 
 
</div>
 
</div>
 
</div>
 
</div>
 
 
===Screenscraping===
 
<div class="mw-collapsible mw-collapsed exercise-box" data-expandtext="Hint" data-collapsetext="Collapse">
 
Write a PHP program that screenscrapes resolution data for a protein structure file from the PDB.
 
 
Here is a [http://www.pdb.org/pdb/explore/explore.do?structureId=2imm '''link to a PDB record'''] to illustrate the URL format.
 
 
<div class="mw-collapsible-content exercise-box">
 
<div class="mw-collapsible mw-collapsed" data-expandtext="Solution" data-collapsetext="Collapse" style="width:90%; padding:10px; margin:5px; border:solid 1px #99999;">
 
Fetch the contents of the URL into a string. Use a regex that captures the data you want to retrieve as part of some unique pattern in the HTML source. Print.
 
 
<div class="mw-collapsible-content exercise-box">
 
;The regex:
 
:<code>/<td id="se_xrayResolution">\s*(\d+\.\d+)/</code>
 
 
*<code><td id="se_xrayResolution"></code>&nbsp;&nbsp;&nbsp;<small>identifying tag for the information we are looking for, ...</small>
 
*<code>\s*</code>&nbsp;&nbsp;&nbsp;<small>... probably followed by whitespace, ...</small>
 
*<code>(\d+\.\d+)</code>&nbsp;&nbsp;&nbsp;<small>... the "payload" of the match: one or more digits, a literal dot and and one or more digits.</small>
 
 
;The code:
 
<source lang="PHP">
 
<?php
 
$URLpath = "http://www.pdb.org/pdb/explore/explore.do?structureId=";
 
$PDBid = "2imm";
 
$source = file_get_contents($URLpath . $PDBid);
 
preg_match('/<td id="se_xrayResolution">\s*?(\d+\.\d+)/', $source, $resolution);
 
print($resolution[1]);
 
?>
 
</source>
 
</div>
 
</div>
 
</div>
 
</div>
 
 
===Labeling===
 
<div class="mw-collapsible mw-collapsed  exercise-box" data-expandtext="Hint" data-collapsetext="Collapse">
 
Write an '''R''' script that creates ''meaningful'' labels for data elements from metadata and shows them in a plot. Use the sample data below - or any other data you are interested in.
 
 
 
 
<div class="mw-collapsible mw-collapsed exercise-box" data-expandtext="Expand" data-collapsetext="Collapse" style="background-color:#EEEEF9;">
 
Sample input data from GEO, and task description ...
 
<div class="mw-collapsible-content">
 
These data were downloaded from the NCBI GEO database using the GEO2R tool, this is a microarray expression data study that compares tumor and metastasis tissue. You can access the dataset [http://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE42952 '''here'''.] Grouping primary PDAC (pancreatic ductal adenocarcinoma) as "tumor" and liver/peritoneal metastasis as "metastasis", an '''R''' script on the server calculates significantly differentially expressed genes using the {{[http://www.bioconductor.org/packages/2.12/bioc/html/limma.html Bioconductor limma package]. I have selected the top 100 genes, and now would like to plot significance (adjusted P value) vs. level of differential expression (logFC). Moreover I would like to vaguely identify the function of each gene if that is discernible from the  "Gene title".
 
 
<source lang="text">
 
"ID" "adj.P.Val" "P.Value" "t" "B" "logFC" "Gene.symbol" "Gene.title"
 
"238376_at" "3.69e-19" "4.53e-23" "-49.138515" "42.43328" "-2.202043" "LOC100505564///DEXI" "uncharacterized LOC100505564///Dexi homolog (mouse)"
 
"214041_x_at" "2.36e-17" "8.74e-21" "38.089228" "37.60995" "4.541989" "RPL37A" "ribosomal protein L37a"
 
"241662_x_at" "2.36e-17" "1.03e-20" "-37.793765" "37.45851" "-2.105123" "" ""
 
"231628_s_at" "2.36e-17" "1.16e-20" "-37.574182" "37.34507" "-1.97516" "SERPINB6" "serpin peptidase inhibitor, clade B (ovalbumin), member 6"
 
"224760_at" "3.23e-17" "2.10e-20" "36.500909" "36.77932" "3.798724" "SP1" "Sp1 transcription factor"
 
"214149_s_at" "3.23e-17" "2.38e-20" "36.282193" "36.66167" "4.246787" "ATP6V0E1" "ATPase, H+ transporting, lysosomal 9kDa, V0 subunit e1"
 
"243177_at" "4.15e-17" "3.57e-20" "-35.573827" "36.275" "-1.801709" "" ""
 
"243800_at" "5.63e-17" "5.52e-20" "-34.825113" "35.85663" "-2.018088" "NR1H4" "nuclear receptor subfamily 1, group H, member 4"
 
"238398_s_at" "1.10e-16" "1.21e-19" "-33.519208" "35.10201" "-2.245806" "" ""
 
"1569856_at" "1.48e-16" "1.82e-19" "-32.860752" "34.70891" "-1.810438" "TPP2" "tripeptidyl peptidase II"
 
"1555116_s_at" "1.51e-16" "2.14e-19" "-32.598656" "34.55" "-1.990665" "SLC11A1" "solute carrier family 11 (proton-coupled divalent metal ion transporters), member 1"
 
"218733_at" "1.51e-16" "2.23e-19" "32.535823" "34.51169" "2.764663" "MSL2" "male-specific lethal 2 homolog (Drosophila)"
 
"201225_s_at" "2.72e-16" "4.33e-19" "31.497695" "33.86667" "3.447828" "SRRM1" "serine/arginine repetitive matrix 1"
 
"217052_x_at" "4.45e-16" "7.64e-19" "30.636232" "33.31345" "1.601527" "" ""
 
"1569348_at" "5.24e-16" "9.65e-19" "-30.289176" "33.08577" "-1.793925" "TPTEP1" "transmembrane phosphatase with tensin homology pseudogene 1"
 
"219492_at" "6.96e-16" "1.37e-18" "29.777415" "32.74483" "3.586919" "CHIC2" "cysteine-rich hydrophobic domain 2"
 
"215047_at" "7.51e-16" "1.58e-18" "-29.567379" "32.60307" "-2.033635" "TRIM58" "tripartite motif containing 58"
 
"232877_at" "7.51e-16" "1.66e-18" "-29.491388" "32.55151" "-1.65225" "" ""
 
"229265_at" "7.51e-16" "1.75e-18" "29.419139" "32.50236" "3.933071" "SKI" "v-ski sarcoma viral oncogene homolog (avian)"
 
"1553842_at" "8.16e-16" "2.00e-18" "-29.226409" "32.37061" "-1.832581" "BEND2" "BEN domain containing 2"
 
"220791_x_at" "1.11e-15" "2.87e-18" "-28.71601" "32.01715" "-1.969381" "SCN11A" "sodium channel, voltage-gated, type XI, alpha subunit"
 
"212911_at" "1.17e-15" "3.15e-18" "28.584094" "31.92471" "2.143175" "DNAJC16" "DnaJ (Hsp40) homolog, subfamily C, member 16"
 
"243464_at" "1.22e-15" "3.43e-18" "-28.463254" "31.83963" "-1.675747" "" ""
 
"243823_at" "1.30e-15" "3.81e-18" "-28.316669" "31.7359" "-1.499823" "" ""
 
"201533_at" "1.56e-15" "4.80e-18" "27.999089" "31.5092" "4.054743" "CTNNB1" "catenin (cadherin-associated protein), beta 1, 88kDa"
 
"210878_s_at" "1.59e-15" "5.06e-18" "27.927536" "31.45775" "2.982033" "KDM3B" "lysine (K)-specific demethylase 3B"
 
"227712_at" "3.18e-15" "1.05e-17" "26.938855" "30.73223" "2.426311" "LYRM2" "LYR motif containing 2"
 
"228520_s_at" "3.56e-15" "1.22e-17" "26.742683" "30.58495" "3.744881" "APLP2" "amyloid beta (A4) precursor-like protein 2"
 
"210242_x_at" "3.80e-15" "1.36e-17" "26.605262" "30.48111" "1.815311" "ST20" "suppressor of tumorigenicity 20"
 
"217301_x_at" "3.80e-15" "1.40e-17" "26.565414" "30.45089" "3.275566" "RBBP4" "retinoblastoma binding protein 4"
 
"1557551_at" "6.17e-15" "2.35e-17" "-25.892664" "29.93351" "-1.78824" "" ""
 
"201392_s_at" "6.17e-15" "2.42e-17" "25.856344" "29.90519" "3.283483" "IGF2R" "insulin-like growth factor 2 receptor"
 
"210371_s_at" "7.18e-15" "2.91e-17" "25.62344" "29.72255" "3.463431" "RBBP4" "retinoblastoma binding protein 4"
 
"204252_at" "9.08e-15" "3.79e-17" "25.291186" "29.45902" "2.789842" "CDK2" "cyclin-dependent kinase 2"
 
"243200_at" "1.04e-14" "4.48e-17" "-25.082134" "29.29138" "-1.539093" "" ""
 
"201140_s_at" "1.16e-14" "5.13e-17" "24.916407" "29.15746" "2.834707" "RAB5C" "RAB5C, member RAS oncogene family"
 
"1559066_at" "1.23e-14" "5.57e-17" "-24.813534" "29.07387" "-1.595061" "" ""
 
"201123_s_at" "1.27e-14" "5.91e-17" "24.741268" "29.01494" "4.870779" "EIF5A" "eukaryotic translation initiation factor 5A"
 
"218291_at" "1.41e-14" "6.83e-17" "24.565645" "28.87099" "2.605328" "LAMTOR2" "late endosomal/lysosomal adaptor, MAPK and MTOR activator 2"
 
"217704_x_at" "1.41e-14" "6.91e-17" "-24.550405" "28.85845" "-1.711476" "SUZ12P1" "suppressor of zeste 12 homolog pseudogene 1"
 
"227338_at" "1.44e-14" "7.22e-17" "-24.498114" "28.81536" "-2.927581" "LOC440983" "hypothetical gene supported by BC066916"
 
"210231_x_at" "1.64e-14" "8.47e-17" "24.305184" "28.65556" "4.548338" "SET" "SET nuclear oncogene"
 
"225289_at" "1.86e-14" "9.82e-17" "24.127523" "28.50726" "3.062123" "STAT3" "signal transducer and activator of transcription 3 (acute-phase response factor)"
 
"204658_at" "1.93e-14" "1.04e-16" "24.056703" "28.44783" "2.868797" "TRA2A" "transformer 2 alpha homolog (Drosophila)"
 
"208819_at" "2.54e-14" "1.40e-16" "23.705016" "28.15009" "2.593365" "RAB8A" "RAB8A, member RAS oncogene family"
 
"210011_s_at" "2.58e-14" "1.46e-16" "23.660126" "28.11176" "2.309763" "EWSR1" "EWS RNA-binding protein 1"
 
"202397_at" "2.58e-14" "1.48e-16" "23.638422" "28.0932" "4.332132" "NUTF2" "nuclear transport factor 2"
 
"1552628_a_at" "2.86e-14" "1.68e-16" "23.492249" "27.96778" "2.892763" "HERPUD2" "HERPUD family member 2"
 
"233757_x_at" "3.85e-14" "2.31e-16" "23.123802" "27.64812" "2.430056" "" ""
 
"201545_s_at" "5.07e-14" "3.16e-16" "22.767216" "27.33385" "2.568005" "PABPN1" "poly(A) binding protein, nuclear 1"
 
"1562463_at" "5.07e-14" "3.17e-16" "-22.763883" "27.33089" "-1.119718" "" ""
 
"219859_at" "5.41e-14" "3.45e-16" "-22.669239" "27.24664" "-1.787549" "CLEC4E" "C-type lectin domain family 4, member E"
 
"1569136_at" "6.91e-14" "4.50e-16" "-22.372385" "26.98011" "-1.95396" "MGAT4A" "mannosyl (alpha-1,3-)-glycoprotein beta-1,4-N-acetylglucosaminyltransferase, isozyme A"
 
"208601_s_at" "7.15e-14" "4.74e-16" "-22.314594" "26.92781" "-1.323653" "TUBB1" "tubulin, beta 1 class VI"
 
"226194_at" "1.11e-13" "7.47e-16" "21.813583" "26.46872" "2.331245" "CHAMP1" "chromosome alignment maintaining phosphoprotein 1"
 
"217877_s_at" "1.15e-13" "7.93e-16" "21.748093" "26.40795" "2.862688" "GPBP1L1" "GC-rich promoter binding protein 1-like 1"
 
"225371_at" "1.25e-13" "8.73e-16" "21.644444" "26.31139" "2.518013" "GLE1" "GLE1 RNA export mediator homolog (yeast)"
 
"1563431_x_at" "1.44e-13" "1.02e-15" "21.472848" "26.15053" "1.874743" "CALM3" "calmodulin 3 (phosphorylase kinase, delta)"
 
"211505_s_at" "1.45e-13" "1.06e-15" "21.437744" "26.11746" "2.642609" "STAU1" "staufen double-stranded RNA binding protein 1"
 
"201585_s_at" "1.45e-13" "1.07e-15" "21.430113" "26.11027" "2.787833" "SFPQ" "splicing factor proline/glutamine-rich"
 
"225197_at" "1.75e-13" "1.31e-15" "21.212989" "25.90451" "2.845005" "" ""
 
"220336_s_at" "1.83e-13" "1.41e-15" "-21.132294" "25.82752" "-1.848273" "GP6" "glycoprotein VI (platelet)"
 
"216515_x_at" "1.83e-13" "1.42e-15" "21.128023" "25.82343" "2.877477" "MIR1244-2///MIR1244-3///MIR1244-1///PTMAP5///PTMA" "microRNA 1244-2///microRNA 1244-3///microRNA 1244-1///prothymosin, alpha pseudogene 5///prothymosin, alpha"
 
"241773_at" "3.49e-13" "2.74e-15" "-20.441442" "25.15639" "-1.835223" "" ""
 
"1558011_at" "3.89e-13" "3.15e-15" "-20.297118" "25.01342" "-1.577874" "LOC100510697" "putative POM121-like protein 1-like"
 
"215240_at" "3.89e-13" "3.15e-15" "-20.29699" "25.01329" "-1.613308" "ITGB3" "integrin, beta 3 (platelet glycoprotein IIIa, antigen CD61)"
 
"233746_x_at" "3.95e-13" "3.25e-15" "20.265986" "24.98245" "2.364699" "HYPK///SERF2" "huntingtin interacting protein K///small EDRK-rich factor 2"
 
"1555338_s_at" "4.10e-13" "3.42e-15" "-20.214797" "24.93143" "-1.280803" "AQP10" "aquaporin 10"
 
"217714_x_at" "4.12e-13" "3.48e-15" "20.195128" "24.91179" "2.247023" "STMN1" "stathmin 1"
 
"202276_at" "4.75e-13" "4.08e-15" "20.035595" "24.75183" "2.654202" "SHFM1" "split hand/foot malformation (ectrodactyly) type 1"
 
"225414_at" "6.34e-13" "5.52e-15" "19.733786" "24.44585" "3.287225" "RNF149" "ring finger protein 149"
 
"243930_x_at" "7.43e-13" "6.64e-15" "-19.55046" "24.2578" "-1.219467" "" ""
 
"1569263_at" "7.43e-13" "6.66e-15" "-19.548534" "24.25581" "-1.662363" "" ""
 
"1554876_a_at" "8.55e-13" "7.77e-15" "-19.397142" "24.09923" "-1.388081" "S100Z" "S100 calcium binding protein Z"
 
"220001_at" "1.08e-12" "9.97e-15" "-19.15375" "23.84505" "-1.412727" "PADI4" "peptidyl arginine deiminase, type IV"
 
"228170_at" "1.12e-12" "1.05e-14" "-19.106672" "23.79554" "-1.840114" "OLIG1" "oligodendrocyte transcription factor 1"
 
"211445_x_at" "1.29e-12" "1.22e-14" "-18.959325" "23.63981" "-1.134266" "NACAP1" "nascent-polypeptide-associated complex alpha polypeptide pseudogene 1"
 
"1555311_at" "1.33e-12" "1.27e-14" "-18.91869" "23.59666" "-1.45603" "" ""
 
"201643_x_at" "1.47e-12" "1.43e-14" "18.808994" "23.47974" "1.867155" "KDM3B" "lysine (K)-specific demethylase 3B"
 
"216449_x_at" "1.51e-12" "1.48e-14" "18.773094" "23.44134" "3.178009" "HSP90B1" "heat shock protein 90kDa beta (Grp94), member 1"
 
"218680_x_at" "1.51e-12" "1.50e-14" "18.763896" "23.43149" "2.262739" "HYPK///SERF2" "huntingtin interacting protein K///small EDRK-rich factor 2"
 
"225954_s_at" "1.65e-12" "1.67e-14" "18.662853" "23.32298" "2.405388" "MIDN" "midnolin"
 
"203102_s_at" "1.65e-12" "1.68e-14" "18.658192" "23.31796" "2.476697" "MGAT2" "mannosyl (alpha-1,6-)-glycoprotein beta-1,2-N-acetylglucosaminyltransferase"
 
"1569345_at" "1.69e-12" "1.74e-14" "18.624203" "23.28133" "1.236884" "" ""
 
"214001_x_at" "1.71e-12" "1.78e-14" "18.598496" "23.25358" "2.570012" "" ""
 
"231812_x_at" "1.72e-12" "1.81e-14" "18.583236" "23.2371" "1.678685" "PHAX" "phosphorylated adaptor for RNA export"
 
"232075_at" "1.93e-12" "2.06e-14" "-18.462717" "23.10643" "-2.150701" "WDR61" "WD repeat domain 61"
 
"200669_s_at" "1.96e-12" "2.12e-14" "18.438729" "23.08033" "1.891968" "UBE2D3" "ubiquitin-conjugating enzyme E2D 3"
 
"236995_x_at" "2.04e-12" "2.23e-14" "-18.389604" "23.02677" "-1.879369" "TFEC" "transcription factor EC"
 
"218008_at" "2.24e-12" "2.48e-14" "18.291537" "22.91946" "2.445428" "TMEM248" "transmembrane protein 248"
 
"217140_s_at" "2.30e-12" "2.56e-14" "18.260017" "22.88485" "3.983721" "VDAC1" "voltage-dependent anion channel 1"
 
"210183_x_at" "2.46e-12" "2.79e-14" "18.183339" "22.80044" "1.79105" "PNN" "pinin, desmosome associated protein"
 
"216954_x_at" "2.46e-12" "2.80e-14" "-18.177967" "22.79451" "-1.090193" "ATP5O" "ATP synthase, H+ transporting, mitochondrial F1 complex, O subunit"
 
"207688_s_at" "2.53e-12" "2.92e-14" "18.141153" "22.75385" "2.492309" "INHBC" "inhibin, beta C"
 
"218020_s_at" "2.63e-12" "3.06e-14" "18.095669" "22.70351" "1.772689" "ZFAND3" "zinc finger, AN1-type domain 3"
 
"217756_x_at" "3.12e-12" "3.67e-14" "17.930201" "22.51939" "1.914366" "SERF2" "small EDRK-rich factor 2"
 
"214150_x_at" "3.42e-12" "4.07e-14" "-17.835551" "22.41336" "-1.177963" "ATP6V0E1" "ATPase, H+ transporting, lysosomal 9kDa, V0 subunit e1"
 
"208750_s_at" "3.48e-12" "4.18e-14" "17.812279" "22.38721" "2.649599" "ARF1" "ADP-ribosylation factor 1"
 
"201749_at" "3.59e-12" "4.42e-14" "17.761415" "22.32994" "1.917794" "ECE1" "endothelin converting enzyme 1"
 
</source>
 
</div>
 
</div>
 
 
 
 
 
<div class="mw-collapsible-content  exercise-box">
 
<div class="mw-collapsible mw-collapsed" data-expandtext="Solution" data-collapsetext="Collapse">
 
Read the data into '''R'''. Plot log(P) against log(FC). Define some regular expressions that identify keywords in the gene title: things like "X-ase", "Y factor", "Z gene" etc. Apply these to the gene titles using {{R|regex()||regexpr()}} and store the results by applying {{R|regmatches()}} to the text. Then use {{R|graphics|text()}} to plot the extracted strings.
 
 
 
<div class="mw-collapsible-content  exercise-box">
 
 
<source lang="R">
 
#GEO-hits.R
 
# bs - Sept. 2013
 
 
dat <- read.table("GEO-hits_100.txt", header = TRUE) # this is a file of GEO
 
                                                    # differential expression data
 
head(dat)
 
 
plot(-log(dat[,"adj.P.Val"]), dat[,"logFC"], cex=0.7, pch=16, col="#BB0000")
 
# Note that all these genes have at least one log of
 
# differential expression - up or down. As a trend,
 
# higher probabilities are found for higher levels of
 
# differential expression.
 
 
# The dataframe produced by R's read.table() function
 
# defines all character-containing rows as _factors_.
 
# However to process them as strings, we need to convert
 
# them to characters.
 
 
dat[,"Gene.title"] <- as.character(dat[,"Gene.title"])
 
 
# First, let's define some regexes for keywords to guess
 
# a function ...
 
 
# (Note the need for doubled escape characters in R!)
 
 
r <- c(  "\\b(\\w+ase)\\b")  # peptidase, kinase ...
 
r <- c(r, "\\b(?!factor)(\\w+or)") # suppressor, adaptor ...
 
r <- c(r, "\\b(\\w+)\\b\\s(factor|protein|homolog)") # the preceeding word ...
 
 
 
# Now iterate over the Gene.title column and for each row try all regular
 
# expressions.
 
 
for (i in 1:nrow(dat)) { # for all rows ...
 
for (j in 1:length(r)) { # for all regular expressions
 
dat[i,"Function.guess"] <- "" # clear the contents of the column
 
M <- regexpr(r[j], dat[i, "Gene.title"], perl = TRUE)
 
if (M[1] > 0) {
 
dat[i,"Function.guess"] <- regmatches(dat[i,"Gene.title"], M)
 
break  # stop regexing if something was found
 
}
 
}
 
}
 
 
dat[,"Function.guess"] # check what we found ...
 
# ... and plot the strings to the right of its point.
 
text(-log(dat[,"adj.P.Val"]), dat[,"logFC"], dat[,"Function.guess"], cex=0.4, pos=4)
 
 
# I'm not sure we are actually learning anything important from this.
 
# But the code was merely meant to illustrate how
 
# to work with regular expressions in R (and introduce you to GEO
 
# differential expression data on the side). Mission accomplished.
 
 
</source>
 
 
 
</div>
 
</div>
 
</div>
 
</div>
 
 
 
 
 
 
<section end=exercises />
 
 
 
 
 
&nbsp;
 
  
 
==Appendix I: Metacharacters and their meaning==
 
==Appendix I: Metacharacters and their meaning==
Line 1,485: Line 887:
 
</table>
 
</table>
  
 
+
{{Vspace}}
  
 
==Appendix II: Character classes and their meaning==
 
==Appendix II: Character classes and their meaning==
Line 1,508: Line 910:
 
</table>
 
</table>
  
 
+
{{Vspace}}
  
 
==Appendix III: Anchor codes and their meaning==
 
==Appendix III: Anchor codes and their meaning==
Line 1,525: Line 927:
 
</table>
 
</table>
  
 
+
{{Vspace}}
  
 
==Appendix IV: Modifiers and their meaning==
 
==Appendix IV: Modifiers and their meaning==
Line 1,538: Line 940:
 
<tr><td><code>s</code></td><td>Treat the whole string as a single line, i.e. don't treat "\n" as line separators. For example, /(&lt;table&gt;.*?&lt;/table&gt;)/s captures an entire table, including newline characters. Without the modifier nothing would match if there is even a single newline in between the tags.</td></tr>
 
<tr><td><code>s</code></td><td>Treat the whole string as a single line, i.e. don't treat "\n" as line separators. For example, /(&lt;table&gt;.*?&lt;/table&gt;)/s captures an entire table, including newline characters. Without the modifier nothing would match if there is even a single newline in between the tags.</td></tr>
 
</table>
 
</table>
 
 
 
 
====A Brief First Encounter of Regular Expressions====
 
  
 
{{Vspace}}
 
{{Vspace}}
  
;Regular expressions are a concise description language to define patterns for pattern-matching in strings.
+
== Further reading, links and resources ==
 
 
Truth be told, many programmers have a love-hate relationship with regular expressions. The syntax of regular expressions is very powerful and expressive, but also terse, not always intuitive, and sometimes hard to understand. I'll introduce you to a few principles here that are quite straightforward and they will probably cover 99% of the cases you will encounter.
 
 
 
Here is our test-case: the sequence of Mbp1, copied from the [https://www.ncbi.nlm.nih.gov/protein/NP_010227 NCBI Protein database page for yeast Mbp1].
 
 
 
        1 msnqiysary sgvdvyefih stgsimkrkk ddwvnathil kaanfakakr trilekevlk
 
      61 ethekvqggf gkyqgtwvpl niakqlaekf svydqlkplf dftqtdgsas pppapkhhha
 
      121 skvdrkkair sastsaimet krnnkkaeen qfqsskilgn ptaaprkrgr pvgstrgsrr
 
      181 klgvnlqrsq sdmgfprpai pnssisttql psirstmgpq sptlgileee rhdsrqqqpq
 
      241 qnnsaqfkei dledglssdv epsqqlqqvf nqntgfvpqq qssliqtqqt esmatsvsss
 
      301 pslptspgdf adsnpfeerf pgggtspiis miprypvtsr pqtsdindkv nkylsklvdy
 
      361 fisnemksnk slpqvllhpp phsapyidap idpelhtafh wacsmgnlpi aealyeagts
 
      421 irstnsqgqt plmrsslfhn sytrrtfpri fqllhetvfd idsqsqtvih hivkrksttp
 
      481 savyyldvvl skikdfspqy rielllntqd kngdtalhia skngdvvffn tlvkmgaltt
 
      541 isnkegltan eimnqqyeqm miqngtnqhv nssntdlnih vntnnietkn dvnsmvimsp
 
      601 vspsdyityp sqiatnisrn ipnvvnsmkq masiyndlhe qhdneikslq ktlksisktk
 
      661 iqvslktlev lkesskdeng eaqtnddfei lsrlqeqntk klrkrliryk rlikqkleyr
 
      721 qtvllnklie detqattnnt vekdnntler lelaqeltml qlqrknklss lvkkfednak
 
      781 ihkyrriire gtemnieevd ssldvilqtl iannnknkga eqiitisnan sha
 
//
 
  
 
{{task|1=
 
 
Navigate to http://regexpal.com and paste the sequence into the '''lower''' box. This site is one of a number of online regular expression testers; their immediate, visual feedback is invaluable when you are developing regular expression patterns.
 
 
Lets try some expressions:
 
 
;Most characters are matched literally.
 
:Type "<code>a</code>" in to the '''upper''' box and you will see all "<code>a</code>" characters matched. Then replace <code>a</code> with <code>q</code>.
 
: Now type "<code>aa</code>" instead. Then <code>krnnkk</code>. ''Sequences'' of characters are also matched literally.
 
 
;The pipe character {{pipe}} that symbolizes logical OR can be used to define that more than one character should match:
 
:<code>i(s{{pipe}}m{{pipe}}q)n</code> matches <code>isn</code> OR <code>imn</code> OR <code>iqn</code>. Note how we can group with parentheses, and try what would happen without them.
 
 
;We can more conveniently specify more than one character to match if we place it in square brackets.
 
:<code>[lq]</code> matches <code>l</code> OR <code>q</code>. <code>[familyvw]</code> matches hydrophobic amino acids.
 
 
;Within square brackets, we can specify "ranges".
 
:<code>[1-5]</code> matches digits from 1 to 5.
 
 
;Within square brackets, we can specify characters that should NOT be matched, with the caret, <code>^</code>.
 
:<code>[^0-9]</code> matches everything EXCEPT digits. <code>[^a-z]</code> matches everything that is not a lower-case letter. That's what we need (try it).
 
 
}}
 
 
One of the '''R''' functions that uses regular expressions is the function <code>gsub()</code>. It replaces characters that match a "regex" with other characters. That is useful for our purpose: we can
 
#match all characters that are NOT a letter, and
 
#replace them by - nothing: the empty string <code>""</code>.
 
This deletes them.
 
 
{{Vspace}}
 
 
{{task|1 =
 
* study the code in the <code>An excursion into regular expressions</code> section of the '''R''' script
 
}}
 
 
{{Vspace}}
 
 
{{Vspace}}
 
 
 
== Further reading, links and resources ==
 
 
<div class="reference-box">[https://en.wikipedia.org/wiki/Regular_expression Regular expressions (Wikipedia)]</div>
 
<div class="reference-box">[https://en.wikipedia.org/wiki/Regular_expression Regular expressions (Wikipedia)]</div>
 
<div class="reference-box">[http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html '''R''' regular expressions]</div>
 
<div class="reference-box">[http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html '''R''' regular expressions]</div>
 
<div class="reference-box">[http://regexpal.com/ '''RegexPal''' - a javascript regex tester]</div>
 
<div class="reference-box">[http://regexpal.com/ '''RegexPal''' - a javascript regex tester]</div>
 +
<div class="reference-box">Visit [http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags the stackoverflow thread on regex and HTML parsing]. What's your opinion on the OP's question?</div>
 
<div class="reference-box">[http://xkcd.com/208/ '''XKCD''']</div>
 
<div class="reference-box">[http://xkcd.com/208/ '''XKCD''']</div>
 
{{Vspace}}
 
 
 
== Notes ==
 
<!-- included from "../components/RPR-RegEx.components.wtxt", section: "notes" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "notes" -->
 
<references />
 
 
{{Vspace}}
 
 
 
</div>
 
<div id="ABC-unit-framework">
 
== Self-evaluation ==
 
<!-- included from "../components/RPR-RegEx.components.wtxt", section: "self-evaluation" -->
 
<!--
 
=== Question 1===
 
 
Question ...
 
 
<div class="toccolours mw-collapsible mw-collapsed" style="width:800px">
 
Answer ...
 
<div class="mw-collapsible-content">
 
Answer ...
 
 
</div>
 
  </div>
 
 
  {{Vspace}}
 
 
-->
 
 
{{Vspace}}
 
 
  
  
 
{{Vspace}}
 
{{Vspace}}
  
 
<!-- included from "ABC-unit_components.wtxt", section: "ABC-unit_ask" -->
 
 
----
 
 
{{Vspace}}
 
 
<b>If in doubt, ask!</b> If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
 
 
----
 
 
{{Vspace}}
 
  
 
<div class="about">
 
<div class="about">
Line 1,674: Line 963:
 
:2017-08-05
 
:2017-08-05
 
<b>Modified:</b><br />
 
<b>Modified:</b><br />
:2017-08-05
+
:2020-09-22
 
<b>Version:</b><br />
 
<b>Version:</b><br />
:0.1
+
:1.2
 
<b>Version history:</b><br />
 
<b>Version history:</b><br />
 +
*1.2 2020 Maintenance, added gsub() cature and backreference
 +
*1.1 Change from require() to requireNamespace() and use &lt;package&gt;::&lt;function&gt;() idiom.
 +
*1.0 First live version, translated from Perl examples in old version
 
*0.1 First stub
 
*0.1 First stub
 
</div>
 
</div>
[[Category:ABC-units]]
 
<!-- included from "ABC-unit_components.wtxt", section: "ABC-unit_footer" -->
 
  
 
{{CC-BY}}
 
{{CC-BY}}
  
 +
[[Category:ABC-units]]
 +
{{UNIT}}
 +
{{LIVE}}
 
</div>
 
</div>
 
<!-- [END] -->
 
<!-- [END] -->

Latest revision as of 09:29, 25 September 2020

Regular Expressions (regex) with R

(Regular expressions)


 


Abstract:

Using regular expressions is a key skill for all apsects of data science, and, by extension, for bioinformatics. There is a bit of a learning curve involved, but once you are even slightly comfortable with their syntax and use in code, you will appreciate their power in automating otherwise unbearably tedious tasks. This unit introduces the principles, the syntax, and provides a fair amount of practice.


Objectives:
This unit will ...

  • ... introduce regular expressions;
  • ... demonstrate their use in R functions;
  • ... teach how to apply them in common tasks.

Outcomes:
After working through this unit you ...

  • ... can express pattern-matching tasks as regular expressions and correctly use a variety of functions that use them;
  • ... are familar with online regex testing sites that help you troubleshoot your expressions during development;
  • ... have written to code that uses regular expressions for a variety of purposes.

Deliverables:

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

  • Prerequisites:
    This unit builds on material covered in the following prerequisite units:


     



     



     


    Evaluation

    Evaluation: NA

    This unit is not evaluated for course marks.

    Contents

    First steps

    A Regular Expression is a specification of a pattern of characters. The typical use of a regular expression is to find occurrences of the pattern in a string. Regular expressions have a flexible syntax that allows them to handle a range of tasks - from trivial substring matches to complex nested motifs. The syntax of regular expressions is a programming language in its own right, and is a powerful way of concisely and uniquely defining a pattern.

    Regular expressions are examples of deterministic pattern matching - they either match a particular expression or not. This is in contrast to probabilistic pattern matching in which a pattern is more or less similar to a query.

    Truth be told, many programmers have a love-hate relationship with regular expressions. The syntax of regular expressions is very powerful and expressive, but also terse, not always intuitive, and sometimes hard to understand. Let's try a few simple things:

    Here is string to play with: the sequence of Mbp1, copied from the NCBI Protein database page for yeast Mbp1.

           1 msnqiysary sgvdvyefih stgsimkrkk ddwvnathil kaanfakakr trilekevlk
          61 ethekvqggf gkyqgtwvpl niakqlaekf svydqlkplf dftqtdgsas pppapkhhha
         121 skvdrkkair sastsaimet krnnkkaeen qfqsskilgn ptaaprkrgr pvgstrgsrr
         181 klgvnlqrsq sdmgfprpai pnssisttql psirstmgpq sptlgileee rhdsrqqqpq
         241 qnnsaqfkei dledglssdv epsqqlqqvf nqntgfvpqq qssliqtqqt esmatsvsss
         301 pslptspgdf adsnpfeerf pgggtspiis miprypvtsr pqtsdindkv nkylsklvdy
         361 fisnemksnk slpqvllhpp phsapyidap idpelhtafh wacsmgnlpi aealyeagts
         421 irstnsqgqt plmrsslfhn sytrrtfpri fqllhetvfd idsqsqtvih hivkrksttp
         481 savyyldvvl skikdfspqy rielllntqd kngdtalhia skngdvvffn tlvkmgaltt
         541 isnkegltan eimnqqyeqm miqngtnqhv nssntdlnih vntnnietkn dvnsmvimsp
         601 vspsdyityp sqiatnisrn ipnvvnsmkq masiyndlhe qhdneikslq ktlksisktk
         661 iqvslktlev lkesskdeng eaqtnddfei lsrlqeqntk klrkrliryk rlikqkleyr
         721 qtvllnklie detqattnnt vekdnntler lelaqeltml qlqrknklss lvkkfednak
         781 ihkyrriire gtemnieevd ssldvilqtl iannnknkga eqiitisnan sha
    //
    


    Task:
    Navigate to http://regexpal.com and paste the sequence into the lower box. This site is one of a number of online regular expression testers; their immediate, visual feedback is invaluable when you are developing regular expression patterns.

    Lets try some expressions:

    Most characters are matched literally.
    Type "a" in to the upper box and you will see all "a" characters matched. Then replace a with q.
    Now type "aa" instead. Then krnnkk. Sequences of characters are also matched literally.
    The pipe character | that symbolizes logical OR can be used to define that more than one character should match
    i(s|m|q)n matches isn OR imn OR iqn. Note how we can group with parentheses, and try what would happen without them.
    We can more conveniently specify more than one character to match if we place it in square brackets. This is a "character class". We will encounter those frequently
    [lq] matches l OR q. [milcwyf] matches hydrophobic amino acids.
    Within square brackets, we can specify "ranges".
    [1-5] matches digits from 1 to 5.
    Within square brackets, we can specify characters that should NOT be matched, with the "caret", ^.
    [^0-9] matches everything EXCEPT digits. [^a-z] matches everything that is not a lower-case letter. That's what we would need to remove characters that do not represent amino acids. Note that outside of the square brackets the caret means "beginning of the string". When yopu see a caret, you need to consider its context carefully.


     

    Make frequent use of this site to develop your regular expressions step by step.


     

    Theory

    According to the Chomsky hierarchy regular expressions are a Type-3 (regular) grammar, thus their use forms a regular language. Therefore, like all Type-3 grammatical expressions they can be decided by a finite-state machine, i.e. a "machine" that is defined by possible states, plus triggering conditions that control transitions between states. Think of such automata as (elaborate) if ... else constructs. The "regex" processor translates the search pattern into such an automaton, which is then applied to the search domain - the string in which the occurrence of the pattern is to be sought.


     

    What are they good for

    Regular expressions support virtually all pattern matching tasks in data clean-up, extracting information items, data mining, "screen scraping", parsing of files, subsetting large tables, etc. etc. This means, they must be part of your everyday toolkit.


     

    When should they not be used

    Since regular expressions are Type-3 grammars, they must fail when trying to parse more complex grammars - i.e. gramars that can't be expressed in a regular language. This means, you can't reliably parse XML - and in perticular HTML - with regular expressions. There is a long discussion on this particular topic however, e.g. see here, and many other similar threads on stackoverflow, and see here for a discussion of when regular expressions should not be used. Use a real XML parser instead.


     

    Perl and POSIX

    Two dialects of regular expressions exist, they differ in some details of syntax. One is the nearly universal "Perl" dialect (Perl is a programming language), the other one is the "POSIX" standard that nearly no one uses. Except R. Tragically, in R the POSIX standard is the default. Fortunately this often does not make a difference, and we can explicitly turn this nonsense off. But we need to type perl = TRUE much more often than we would like. Somebody, some time, made a wrong design decision and thousands of wasted man- and woman hours later we are still stuck with the consequences. If you use regular expressions according to the POSIX standard, you have to learn the Perl standard anyway. But then you can just use the Perl standard in the first place. The Wikipedia page on Regular Expressions has a table with a side-by-side comparison of the different ways the two standards express character classes. Also see the help page on regex in R for details.


     

    Regular Expressions in R

    Regular expressions in R can be used

    • to match patterns in strings for use in if() or while() conditions, or to retrieve specific instances of patterns with the regexpr() family of functions;
    • to substitute occurrences of patterns in strings with other strings with gsub();
    • to split strings into substrings that are delimited by the occurrence of a pattern with strsplit();

    ...and more.

    Accordingly, a basic knowledge of regular expressions is needed to reade and write code, especially code that parses text.


     

    Syntax

    Regular expressions in R are strings, thus they are enclosed in quotation marks.

    "a"
    

    is a regular expression. It specifies the single, literal character a exactly.


    Specifying symbols

    The power of regular expressions lies in their flexible syntax that allows to specify character ranges, classes of characters, unspecified characters, alternatives, and much more. This sometimes can be confusing, because the symbols that specify ranges, options, wildcards and the like are of course themselves characters. Characters that specify information about other characters are called metacharacters, these include ".", "?", "+", "*", "[" and "]", "{" and "}" and more. And the opposite is also possible: some plain characters can be turned into metacharacters to denote character classes.

    The "\" - escape character - allows to distinguish when a character is to be taken literally and when it is to be interpreted as a metacharacter. Note that some symbols have to be escaped to be read literally, while some letters have to be escaped to be read as metacharacters.

    But there is a catch in R, relating to when the escape characater is interpreted. Remember that "\n" is a linebreak in a string, "\t" is a tab, etc. Obviously if you write "\?" (a literal questionmark in a regex), or "\+" (a literal plus-sign in a regex) into a regular string, the mechanism that parses the string is going to see the escape character, then it expects an "n" or a "t" or the like - but what it gets instead is something it doesn't know. So it throws an error. Try:

    "\n" # fine
    "\?" # Error: ...
    

    But then how can we write something like "\?" when we need it? That becomes obvious when you consider what happens with the string: it gets sent to the regex engine for interpretation. Thus the regex engine needs to see: character "\", then character "?". So it needs two characters. The secret is: we need to prevent "\" from attaching to the next character, and specify it as a single character in its own right. We do that by escaping "\" itself - with a backslash. Thus "\\" is a literal "\" character - and can get sent to the regex engine.

    "\\?" # ok
    cat("\\?") # that's what the regex engine sees.
    

    Consequence is: you need to double the "\\" in R when you want a single "\". That works differently from other programming languages who pass patterns to the regex engine as-is. You need to be aware of this, for example when you develop a pattern in an online regex tool, and then copy it back into your R code. You need to double all occurrences of "\" in your R string.

    Letters whose special meaning as a metacharacter is turned on with the escape character:

    CharacterMeans
    wthe letter "w"
    \wa "word" character, ie one of A-Z, a-z, 0-9 and "_"
    sthe letter "s"
    \sa "space" character, i.e. one of " ", tab or newline
    bthe letter "b"
    \ba word boundary


     

    Metacharacters whose special meaning is turned off with the escape character:

    CharacterMeans
    +One or more repetitions of the preceeding expression
    \+the literal character "+"
    \the escape character
    \\the literal character "\"
    .any single character except the newline (\n)
    \.a literal period

    Note that these examples are not exhaustive. A (nearly) complete table of symbols and meanings is given in the appendix.


     

    Character Classes

    Square brackets specify when more than one specific character can match at a position.

    ExpressionMeans
    [acgtACGT]Any non-degenerate nucleotide

    For example: "[AGR]AATT[CTY]" matches all occurrences of an ApoI restriction site, either specified explicitly, or through the nucleotide ambiguity codes R (purines) or Y (pyrimidines).

    Within character sets, hyphens can specify character ranges.

    ExpressionMeans
    [a-z]lowercase letters
    [0-9]digits
    [0-9+*/=^\\-]digits and arithmetic symbols (Note the escaped hyphen)

    If you want to match a literal hyphen, you must escape it. Within character sets, some metacharacters that otherwise have special meanings usually do not need to be escaped.


    The complement

    The caret character "^" denotes the complement of a character set; i.e. everything that is not that expression.

    ExpressionMeans
    [^9]Everything but the digit "9"
    [^ACGT]Not a nucleotide code letter

    Note that outside of square brackets, the "^" character is an "anchoring code" and means "beginning of the string". This can be confusing.

    For many metacharacters that denoite character classes, the metacharacter in upper case denotes the complement. This can also be confusing !

    CharacterMeans
    \wa word character
    \Wnot a word character
    \sa space character
    \Snot a space character


    Specifying quantity

    Special characters in regular expressions control how often a pattern must be present in order to match:

    ExpressionWhat it meansExample (meaning)
    ?match zero or one times"? (there may or may not be a quote mark)
    +match one or more[A-Z]+ (there's at least one uppercase letter)
    *match any number.* (there may be some characters)
    {min,max}match between min and max times (assumes 0 for min, if min is omitted; assumes infinity for max, if max is omitted).[atAT]{20,200} (a stretch of between 20 and 200 upper- or lowercase As or Ts)

    For example: "AAUAAA[ACGU]{10,30}$" defines a polyadenylation site - a AAUAAA motif, followed by 10 to 30 of any nucleotide before the end of the RNA.


    Specifying position (anchoring)

    If a pattern must be matched at a particular location in the string, special terms denote string anchors.

    Anchoring TermMeaning
    ^Start of a line or string
    $End of a line or string
    \AStart of the string
    \ZEnd of the string
    \GLast global match end


     

    Of course defining a regular expression pattern does not yet do anything with it. Below are the most important R functions that use regular expressions. Write the small code samples that are provided below, play with variations, and test how the operators and regular expressions work.


     

    Functions that don't use regular expressions

    Not all pattern searches in strings use (and need) regular expressions. Sometimes simple, exact string-matching is enough. R uses string matching in character equality (==) and by extension, the set operation functions (union(), intersect() etc.), the match() function, and the %in% operator.

    
    vA <- c("the", "quick", "brown", "fox", "jumps", "over", "a", "lazy", "dog")
    
    vA[2] == "quick"  # TRUE
    vA[2] == "quack"  # FALSE
    
    vA == "fox"  # boolean vector
    
    # match tests for string equality
    match("fox", vA)   # 4, i.e. the 4th element matches the string
    match("o", vA)     # NA: matches have to be to the WHOLE element
    
    # match("fox", vA) is equivalent to...
    which(vA == "fox")
    
    # %in% can be used for creating intersections
    # find whether elements from one vector are
    # contained in another:
    
    vB <- c("Quacking", "the", "duck", "wings", "over", "my", "cozy", "cot")
    
    
    vA %in% vB
    vB %in% vA  # note that the length of the return vector is the same as the
                # length of the first argument. So read this as:
                # "Which of my vB are also in vA"
    
    # We can use this to subset the vector with elements that are present in
    # both:
    
    vB[vB %in% vA]
    
    # which is, of course, the intersection set operation.
    intersect(vA, vB)
    


     

    Functions that use regular expressions

    The general online help page is here. Remember: R's default behaviour is extended POSIX. To be sure which regex dialect is used, pass the perl = TRUE parameter.


     

    grep()

    
    # grep() is like match(), but uses regular expressions. A variant of grep() that
    # returns a boolean vector - like "==" does - is grepl(). That is useful
    # because we can & or | the vector, or invert it with ! .
    
    grep("fox", vA)
    grep("o", vA) # Aha! now we get all elements that contain an "o" -
                  # Because we get partial matches with regular expressions.
    vA[grep("o", vA)] # subset
    
    grepl("o", vA)    # logical
    ! grepl("o", vA)  # its inverse
    
    vA[! grepl("o", vA)] # subset all words without "o"
    
    


     

    Subsetting example

    Consider the following regular expression:

    
    patt <- "^\\s*#"
    
    


    This matches if the string it is applied to does not begin with a "#", which may or may not be preceeded by whitespaces. This would be useful to ignore comment lines in a data file.

    The regular expression above is decomposed as follows:

    1. ^   the beginning of the line
    2. \\s   any whitespace character ...
    3. *    ... repeated 0 or more times
    4. #    the hash character


    The following example would read a file into an vector of lines, then drop all lines that are empty, and all lines that are comments. This is a straightforward idiom.

    
    IN <- "test.txt"
    patt <- "^\\s*#"
    
    myData <- readLines(IN)
    myData <- myData[myData != ""]  # drop all elements that are the empty string
    myData <- myData[! grepl(patt, myData)]  # drop all elements match the pattern
    
    


     

    Substitution - gsub()

    Think of "gsub"" as "global substitution", and you'll understand that there exists another function, sub() that replaces only the first occurrence of a pattern, rather than all of them as gsub() does. I can't imagine what the use case for that might be and I don't think I have ever used sub(). I get an intuitive sense that code that needs such a function should probably be reconceived. But gsub() is very useful.

    
    (s <- "   1 MKLAACFLTL LPGFAVA... 17   ") # E-coli Alpha Amylase signal peptide
    
    # Drop everything from this string that is not an amino acid one-letter code.
    # We use gsub() to first identify all non-amino acid letters with a character
    # class regular expression, then we replace each occurrence with the empty
    # string.
    
    gsub("[^ACDEFGHIKLMNPQRSTVWY]", "", s)
    
    # or, with assignment: ...
    s <- gsub("[^ACDEFGHIKLMNPQRSTVWY]", "", s)
    
    


     

    strsplit()

    Another function that makes use of regular expressions is strsplit(). It takes a vector of strings, and returns a list, one element for each element of the vector, in which each string has been split up by separating it along a regular expression.

    x <- c("a b c", "1 2")
    strsplit(x, " ")
    # [[1]]
    # [1] "a" "b" "c"
    #
    # [[2]]
    # [1] "1" "2"
    

    Since even a single string returns a list, you often have to extract the element you want as a vector for further use.

    corvidae <- c("crow:jackdaw:jay:magpie:raven:rook")
    strsplit(corvidae, ":")
    
    unlist(strsplit(corvidae, ":"))
    strsplit(corvidae, ":")[[1]]
    
    # Consider:
    length(strsplit(corvidae, ":"))
    length(unlist(strsplit(corvidae, ":")))
    


    strsplit() is immensely useful to extract elements from strings with a relatively well defined structure.

    s <- "1, 1, 2, 3, 5, 8"
    strsplit(s, ", ")[[1]] # split on comma-space
    
    s <- "~`!@#$%^&*()_-=+[{]}\|;:',<.>/?"
    strsplit(s, "")[[1]]  # split on empty string
    
    s <- "chronological lifespan:\tincreased\ncold sensitivity:\tincreased\nsporulation:\tnormal"
    strsplit(s, "\\t|\\n")[[1]]  # split on tab or newline
    
    



     

    Behaviour

     


    Capturing and using matches

    Matches can be captured and used, e.g. in gsub().

    # Capture matches by placing them in parentheses. To immediatley reuse them, refer to them with "backreferences": <code>\\1</code>, <code>\\2</code>, <code>\\3</code>.
    
    # Example 1:
    # The beginning and ending three words of some text...
    s <- "I know, however, that its precarious and remote villages lie within the lowlands of the Wisla River."
    gsub("^((\\S+\\s+){3}).*((\\s\\S+){3})$", "\\1 ... \\3", s)
    
    # Note: matches \\2 and \\4 are the inside the parentheses that are there to
    # group things to be found {3}-times.
    
    
    # Example 2:
    # A binomial species name has a genus, a species, and possibly a strain name.
    # We use \\S (not whitespace) and \\s (whitespace) to tease this apart into
    # three captures expressions:
    s <- "Saccharomyces cerevisiae S288C"
    gsub("^(\\S+)\\s(\\S+)\\s*(.*)$",
         "genus: \\1; species: \\1 \\2; (strain: \\3)",
         s)
    gsub
    
    

    Capturing and returning matches

    Finding and returning matches in R is a two-step process. (1) find matches with regexpr() (one match), gregexpr() (all matches), or regexec() (sub-expressions in parentheses). All of these return a "match object". (2) use the match object to extract the matching substrings from the original string.


    
    
    # Extracting gene names in text.
    
    # Let's define a valid gene name to be a substring that is bounded by
    # word-boundaries, starts with an upper-case character, contains more upper-case
    # characters or numbers or a hyphen or underscore, with a minimal length of 3.
    # Here is a regex, and we put the part of the string that we want to recover, in
    # parentheses:
    
    patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b"
    
    # Test: positives
    grepl(patt, "MBP1")
    grepl(patt, "AAT")
    grepl(patt, " AI1")
    grepl(patt, "ASP3-1 ")
    grepl(patt, " AI5_ALPHA; ")
    grepl(patt, " (TY1B-PR3) ")
    # Test: negatives
    grepl(patt, "G1") # Too short
    grepl(patt, "G1-") # Hyphen at end
    grepl(patt, "Cell") # contains lower-case
    
    # Let's apply this to retrieve gene names in text
    
    s <- "Transcriptional activation of CLN1, CLN2, and a putative new G1 cyclin (HCS26) by SWI4, a positive regulator of G1-specific transcription. Cell 66(5):1015-26"
    
    (m <- regexpr(patt, s)) # found a match in position 31
    regmatches(s, m)        # retrieve it
    
    (m <- gregexpr(patt, s)) # found all matches
    regmatches(s, m)         # retrieve them (note, this is a list)
    
    # The function of choice however is regexec(). It returns whatever the pattern
    # has defined in parentheses, the others return the entire match. The
    # parentheses are quite important, because we might want to specify additional
    # context for a valid match, but we might not want the context in the match
    # itself. In our example we used word boundaries - \\b - for such context; but
    # these are zero-length and don't actually match a character, so they don't
    # contaminate the substring anyway. But in general we need to be able to
    # precisely retrieve only the target substring.
    
    (m <- regexec(patt, s)) # only the parenthesized substring
    regmatches(s, m)        # retrieve it
    
    # Note that there are two elements: the first is the whole match, the second
    # is the substring that is in parentheses. In our example these are the same.
    # Here is an example where they are not:
    s <- "Find the last word. And tell me."
    (m <- regexec("\\s(\\w+)\\.", s))
    regmatches(s, m)        # retrieve it
    
    # Unfortunately there is no option to capture multiple matches
    # in base R: regexec() lacks a corresponding gregexec()...
    
    patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b"
    
    s <- "Transcriptional activation of CLN1, CLN2, and a putative new G1 cyclin (HCS26) by SWI4, a positive regulator of G1-specific transcription. Cell 66(5):1015-26"
    
    
    # Solution 1 (base R): you can use multiple matches in an sapply()
    # statement...
    sapply(regmatches(s, gregexpr(patt, s))[[1]],
           function(M){regmatches(M, regexec(patt, M))})
    
    
    # Solution 2 (probably preferred): you can use
    # str_match_all() from the very useful library "stringr" ...
    if (! requireNamespace("stringr", quietly=TRUE)) {
      install.packages("stringr")
    }
    # Package information:
    #  library(help = stringr)       # basic information
    #  browseVignettes("stringr")    # available vignettes
    #  data(package = "stringr")     # available datasets
    
    
    stringr::str_match_all(s, patt)
    stringr::str_match_all(s, patt)[[1]][,2]
    # [1] "CLN1"  "CLN2"  "HCS26" "SWI4"
    
    # Note that str_match_all() handles the match object internally, no need for
    # the two-step code.
    
    


    An interesting new alternative/complement to the base R regex libraries is the package "ore" that uses the Oniguruma libraries and supports multiple character encodings, which you need when you work with Unicode and/or CJK character sets.


    if (! requireNamespace("ore"), quietly = TRUE) {
        install.packages("ore")
    }
    # Package information:
    #  library(help = ore)       # basic information
    #  browseVignettes("ore")    # available vignettes
    #  data(package = "ore")     # available datasets
    
    
    S <- "The quick brown fox jumps over a lazy dog"
    
    ore::ore.search(". .", S)
    ore::ore.search(". .", S, all=TRUE)
    M <- ore::ore.search(". .", S, all=TRUE)
    M$nMatches
    M$match[2:4]
    

    According to the author John Clayden, key advantages include:

    • Search results focus around the matched substrings (including parenthesised groups), rather than the locations of matches. This saves

    extra work with regmatches() or similar to extract the matches themselves.

    • Substantially better performance, especially when matching against long strings.
    • Substitutions can be functions as well as strings.
    • Matches can be efficiently obtained over only part of the strings.
    • Fewer core functions, with more consistent names.


     

    Modifiers

    A number of modifiers can be applied as arguments to regular expression functions that may be useful. Here are the two most important ones.

    
    # Option "ignore.case" allows to have case-insensitive matches. This is usually
    # poor programming style, a more explicit (= better) way is to define your
    # character classes appropriately.
    
    patt <- "\\b([A-Z][A-Z0-9\\-_]+[A-Z0-9])\\b"
    
    s <- "The MBP1 gene encodes the Mbp1 protein."
    
    m <- gregexpr(patt, s)
    regmatches(s, m)[[1]]
    
    m <- gregexpr(patt, s, ignore.case = TRUE)
    regmatches(s, m)[[1]]
    
    
    # For regex functions in the stringr package, you can compile the pattern
    # with the regex() function, and include the option "comments = TRUE". This
    # allows you to insert whitespace and # characters into the pattern
    # which will be ingnored by the regex engine. Thus you can comment
    # complex regular expressions inline.
    
    
    myRegex <- stringr::regex("\\b            # word boundary
                              (               # begin capture
                              [A-Z]           # one uppercase letter
                              [A-Z0-9\\-_]+   # one or more letters, numbers, hyphen or
                                              #   underscore
                              [A-Z0-9]        # one letter or number.
                              # Note: this captured subexpression has a minimum length of 3.
                              )               # end capture
                              \\b",           # word boundary
                              comments = TRUE)
    
    stringr::str_match_all(s, myRegex)[[1]][2]
    
    


    Greed

    By default, quantitative matches except zero/one (i.e. the ? character) are "greedy", i.e. they will match the largest possible number of characters. For example:

    s <- "abc123"
    
    patt <- "(\\w+)(\\d+)"   # word characters, followed by digits. This pattern ...
    stringr::str_match_all(s, patt)[[1]][-1]
    
    # ...yields "abc12" and "3" . This is because \w+ is greedy and grabs as many
    # alphanumeric characters as it can before \d+ gets a chance to match.  A "?"
    # after a quantity specifier makes it non-greedy, therefore ...
    
    patt <- "(\\w+?)(\\d+)"   # Note the questionmark in (\\w+?)
    stringr::str_match_all(s, patt)[[1]][-1]
    
    # ... now \d+ gets a chance to match as many digits as possible
    
    


     

    Regular Expressions in other languages

     

    PHP

    <?php
    $string = "The quick brown fox jumps over a lazy dog";
    
    $words = preg_split('/\s+/', $string);
    print_r($words);
    
    preg_match('/.\W./', $string, $matches);
    print_r($matches);
    
    preg_match_all('/.\W./', $string, $matches);
    print_r($matches);
    
    #indexed preg_replace, iterates over array elements
    $pat = array(); #broken
    $pat[0] = '/quick brown/';
    $pat[1] = '/fox/';
    $pat[2] = '/lazy/';
    $pat[3] = '/dog/';
    $rep = array();
    $rep[0] = 'lazy';
    $rep[1] = 'dog';
    $rep[2] = 'quick brown';
    $rep[3] = 'fox';
    print(preg_replace($pat, $rep, $string));
    print("\n");
    
    $pat = array();
    $pat[0] = '/quick brown fox/';
    $pat[1] = '/lazy dog/';
    $pat[2] = '/foo/';
    $pat[3] = '/bar/';
    $rep = array();
    $rep[0] = 'foo';
    $rep[1] = 'bar';
    $rep[2] = 'lazy dog';
    $rep[3] = 'quick brown fox';
    print(preg_replace($pat, $rep, $string));
    print("\n");
    
    
    ?>
    


     

    Python

    Python regular expression are provided through the module re. See here for documentation.

    .re functions in general operate on a string and return a MatchObject. The MatchObject is then further analyzed by supplied methods.

    The most frequently used functions are:

    • re.match(pattern, string) matches only at the beginning of a line.
    • re.search(pattern, string) matches anywhere in a line.
    • re.split(pattern, string) returns the split string as a list.
    • re.findall(pattern, string) returns all matches in a list.


    Python example

    Download this .svg file to experiment.


    # parse_SVG_example.py
    # Read an svg file line by line and process path data
    # to write commands separately to an output file, line by line.
    
    import re
    
    filePath = "/my/working/directory/whatever/"
    
    myIn  = filePath + "sample.svg"
    myOut = filePath + "test.svg"
    
    IN  = open(myIn)
    OUT = open(myOut, "w")
    
    for line in IN:
       path = re.search('\sd=\"(.*?)\"', line) # returns the MatchObject "path"
       if path:
           # Found. Process the result with a second regex.
           # path.group() is a method of the MatchObject
           pathData = re.findall('([aAcChHlLmMqQsStTvVzZ]|-?\d*\.?\d+)',
                                 path.group(1))
           # Write it nicely formatted to output, one command per line
           OUT.write("d=\"")
           s = ""    # we accumulate output lines in this variable
           for token in pathData:
               if re.match('[aAcChHlLmMqQsStTvVzZ]', token):
                   # it's a letter:
                   OUT.write("\n    "+s)     # flush s to output
                   s = token + " "       # new s
               else:
                   s = s + token + " "   # append to s
           OUT.write("\n    " + s + "\"\n")  # flush s, close string, and add \n
    
       else:
           OUT.write(line)
    
    IN.close()
    OUT.close()
    
    


     

    Javascript

    Javascript is probably the most convenient choice if you want to interact directly with the DOM (Document Object Model) of a Web page. Here is example code for a "bookmarklet" that rewrites the URL of a journal-article page for access from outside the UofT network.

    javascript:(function(){
       var url=window.location.href;
       var re=/\/([\w.]+)\/(.*$)/;
       var match=url.match(re);
       var newURL="http://"+match[1]+".myaccess.library.utoronto.ca/"+match[2];
       window.location.href=newURL;
    })();
    void 0
    
    

    Put this into the body of an arbitrary bookmark on your browser, then click it to be redirected to our library's free access system of a paywalled journal article.


     

    POSIX (Unix, the bash shell)

    Use in:

    • grep
    grep finds patterns in files. Patterns are regular expressions and can come in basic or extended flavors. In GNU grep there is no difference between these; in implementations where there is, you switch from basic to extended syntax with the grep -E flag which is the same as invoking egrep.
    Example: what demons run on your system?
    ps -ax | egrep -o "/([^A-Z]\w+d)\b" | sort -u
    

    Other uses of regular expressions in:

    • find
    • sed
    • awk
    • cut

    ... see the man pages.


     

    Practice

    Task:

     
    • Open RStudio and load the ABC-units R project. If you have loaded it before, choose FileRecent projectsABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
    • Choose ToolsVersion ControlPull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
    • Type init() if requested.
    • Open the file RPR-RegEx.R and follow the instructions.


     

    Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.


     


     

    Appendix I: Metacharacters and their meaning

    ExpressionMeaning
    \Escape character
    |Alternation character. Matches either one of specified alternatives. For example, /Asp|Glu/i matches ASP, Asp, asp, GLU, Glu or glu.
    ^If the caret occurs at the beginning of an expression it anchors the expression at the beginning of a line or input.
    For example, /^AT/ does not match the 'AT' in "HETATM" but does match it in "ATOM". If the caret occurs as the first character of a character set as in [^a-z] it specifies the complement of the character set. Everywhere else, it simply matches the character "^".
    $Matches end of input or line.
    For example, /t$/ does not match the 't' in "eater", but does match it in "eat"
    *Matches the preceding character 0 or more times. For example, /bo*/ matches 'boooo' in "A ghost booooed" and 'b' in "A bird warbled", but nothing in "A goat grunted".
    +Matches the preceding character 1 or more times. Equivalent to {1,} . For example, /a+/ matches the 'a' in "candy" and all the a's in "caaaaaaandy."
    ?Matches the preceding character 0 or 1 time. For example, /e?le?/ matches the 'el' in "angel" and the 'le' in "angle."
    .(The decimal point) matches any single character except the newline character.
    (x)Matches 'x' and remembers the match. For example, /(foo) bar/ matches "foo bar" and stores 'foo' in the special variable $1. /(more) (joy)/ matches "more joy", then stores 'more' in $1 and 'joy' in $2.
    {n}Where n is a positive integer. Matches exactly n occurrences of the preceding character. For example, /a{2}/ doesn't match the 'a' in "candy," but it matches all of the a's in "caandy," and the first two a's in "caaandy."
    {n,}Where n is a positive integer. Matches at least n occurrences of the preceding character. For example, /a{2,}/ doesn't match the 'a' in "candy", but matches all of the a's in "caandy" and in "caaaaaaandy."
    {n,m}Where n and m are positive integers. Matches at least n and at most m occurrences of the preceding character. For example, /a{1,3}/ matches nothing in "cndy", the 'a' in "candy," the first two a's in "caandy," and the first three a's in "caaaaaaandy" Notice that when matching "caaaaaaandy", the match is "aaa", even though the original string had more a's in it.
    [xyz]A character set. Matches any one of the enclosed characters. You can specify a range of characters by using a hyphen. For example, [bcd] is the same as [b-d] . They match the 'c' in "cysteine" and the 'd' in "ached" .
    [^xyz]A negated or complemented character set. That is, it matches anything that is not enclosed in the brackets. Note that the caret has to be the first character in the bracket set. For example, [^abc] is the same as [^a-c] . They initially match 'l' in "alanine" and 'y' in "cysteine"


     

    Appendix II: Character classes and their meaning

    ExpressionMeaning
    [\b]Matches a backspace. (Not to be confused with \b .)
    \bMatches a word boundary, such as a space. (Not to be confused with [\b].) The difference is that one is within a character set and the other is not! For example, /\bn\w/ matches the 'no' in "noonday"; /\wy\b/ matches the 'ly' in "possibly yesterday."
    \BMatches a non-word boundary. For example, /\w\Bn/ matches 'on' in "noonday", and /y\B\w/ matches 'ye' in "possibly yesterday."
    \cXWhere X is a control character. Matches a control character in a string. For example, /\cM/ matches control-M in a string.
    \dMatches a digit character. Equivalent to [0-9] . For example, /\d/ or /[0-9]/ matches '2' in "B2 is the suite number."
    \DMatches any non-digit character. Equivalent to [^0-9] . For example, /\D/ or /[^0-9]/ matches 'B' in "B2 is the suite number."
    \fMatches a form-feed.
    \nMatches a linefeed.
    \rMatches a carriage return.
    \sMatches a single white space character, including space, tab, form feed, line feed. Equivalent to [ \f\n\r\t\v] . For example, /\s\w*/ matches ' bar' in "foo bar."
    \SMatches a single character other than white space. Equivalent to [^ \f\n\r\t\v] . For example, /\S/\w* matches 'foo' in "foo bar."
    \tMatches a tab
    \vMatches a vertical tab.
    \wMatches any alphanumeric character including the underscore. Equivalent to [A-Za-z0-9_] . For example, /\w/ matches 'a' in "apple," '5' in "$5.28," and '3' in "3D."
    \WMatches any non-word character. Equivalent to [^A-Za-z0-9_] . For example, /\W/ or /[^$A-Za-z0-9_]/ matches '%' in "50%."


     

    Appendix III: Anchor codes and their meaning

    ExpressionMeaning
    ^If the caret occurs at the beginning of an expression it anchors the expression at the beginning of a line or input. For example, /^AT/ does not match the 'AT' in "HETATM" but does match it in "ATOM".
    $Matches end of input or line. For example, /t$/ does not match the 't' in "eater", but does match it in "eat" as well as in "eat\n".
    \bMatches a word boundary, such as a space. (Not to be confused with [\b].) The difference is that one is within a character set and the other is not! For example, /\bn\w/ matches the 'no' in "noonday"; /\wy\b/ matches the 'ly' in "possibly yesterday."
    \BMatches a non-word boundary. For example, /\w\Bn/ matches 'on' in "noonday", and /y\B\w/ matches 'ye' in "possibly yesterday."
    \AMatches at the start of a string. Like "^". For example, /\AAT/ matches "AT" in "ATOM " but not in "HETATM"
    \ZMatches at the end of a string. Like "$". For example, /\t\Z/ matches a tab at the end of the string but not anywhere else.
    (?: … )Group what's between the brackets, but discard match.
    (?= … )The preceeding pattern must be followed by this one in order to match.
    (?! … )The preceeding pattern must not be followed by this one in order to match.


     

    Appendix IV: Modifiers and their meaning

    Expression<Meaning
    gMatches globally - i.e. matches all occurrences of pattern, one after the other, do not stop at the first one.
    iMatch in a case-insensitive manner. For example, /[ACGT]/i matches any specific nucleotide in upper or lower case.
    xIgnore whitespace in the expression
    oEvaluate pattern only once.
    mTreat the whole string as multiple lines.
    sTreat the whole string as a single line, i.e. don't treat "\n" as line separators. For example, /(<table>.*?</table>)/s captures an entire table, including newline characters. Without the modifier nothing would match if there is even a single newline in between the tags.


     

    Further reading, links and resources

    Visit the stackoverflow thread on regex and HTML parsing. What's your opinion on the OP's question?


     


    About ...
     
    Author:

    Boris Steipe <boris.steipe@utoronto.ca>

    Created:

    2017-08-05

    Modified:

    2020-09-22

    Version:

    1.2

    Version history:

    • 1.2 2020 Maintenance, added gsub() cature and backreference
    • 1.1 Change from require() to requireNamespace() and use <package>::<function>() idiom.
    • 1.0 First live version, translated from Perl examples in old version
    • 0.1 First stub

    CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.