Difference between revisions of "RPR-Literate programming"

From "A B C"
Jump to navigation Jump to search
m
m
Line 19: Line 19:
  
  
{{DEV}}
+
{{LIVE}}
  
 
{{Vspace}}
 
{{Vspace}}
Line 29: Line 29:
 
<section begin=abstract />
 
<section begin=abstract />
 
<!-- included from "../components/RPR-Literate_programming.components.wtxt", section: "abstract" -->
 
<!-- included from "../components/RPR-Literate_programming.components.wtxt", section: "abstract" -->
Documentation of results using R markdown and R notebooks. '''Example needs debugging since the OED site has changed'''.
+
Documentation of results using R markdown and R notebooks.
 
<section end=abstract />
 
<section end=abstract />
  
Line 111: Line 111:
 
'''R Studio''' will load some default text and markup into the script pane which we can edit.
 
'''R Studio''' will load some default text and markup into the script pane which we can edit.
  
Let's introduce our plan: copy/paste the following text into the document.
+
* Choose '''Help &rarr; Cheetssheets &rarr; R Markdown Cheat Sheet''' and '''R Markdown Reference Guide''' to download two PDFs via your browser. Browse the contents to get an idea where you can clarify concepts as you go through this example.
 +
 
 +
Let's introduce our plan: copy/paste the following text into the document to replace the two sections with the headers <code>## R Markdown</code and <code>## Including Plots</code>.
  
 
<div class{{=}}"text-box">
 
<div class{{=}}"text-box">
Line 117: Line 119:
 
We all have some, but we could always use more. How to know them all? With this code we access the Oxford English Dictionary's Website - the most authoritative source on the English language, and scrape a list of phobias. A function is supplied to retrieve a random phobia, which we can subsequently ponder on - either to delight in the fact that we don't have that fear, or to add to our daily quota of anxieties &lt;small>(like our well-founded [fear of bad programming practice](<nowiki>http://xkcd.com/292/</nowiki>))&lt;/small>.
 
We all have some, but we could always use more. How to know them all? With this code we access the Oxford English Dictionary's Website - the most authoritative source on the English language, and scrape a list of phobias. A function is supplied to retrieve a random phobia, which we can subsequently ponder on - either to delight in the fact that we don't have that fear, or to add to our daily quota of anxieties &lt;small>(like our well-founded [fear of bad programming practice](<nowiki>http://xkcd.com/292/</nowiki>))&lt;/small>.
  
To load the list, we will "screenscrape" a list of Phobias from the [OED Phobia list](<nowiki>http://www.oxforddictionaries.com/words/phobias-list</nowiki>). First, we load the XML library (or install it from CRAN, if we don't have it).
+
To load the list, we will "screenscrape" a list of Phobias from the [OED Phobia list](<nowiki>https://en.oxforddictionaries.com/explore/phobias-list</nowiki>). First, we load the XML library (or install it from CRAN, if we don't have it).
  
 
</div>
 
</div>
Line 128: Line 130:
 
::- a Web link <code>[Text...](URL)</code>added to text
 
::- a Web link <code>[Text...](URL)</code>added to text
  
* Click on the green question mark of the menu of your script pane. There is a link to an overview of RMarkdown use and to a quick reference. Load the quick reference (it will appear in the Help pane) and scan it.
+
* The filename in the script pane tab (<span style="colour:#AA0000;">Untitled1</span>) is red, because the file contains unsaved changes. Save the file in your project directory under the name<code>RandomPhobia</code>, note that the extension <code>.Rmd</code> is automatically added.
 
 
* The filename in the script pane tab is red, because it contains unsaved changes. Save the file in your project directory, note that the extension <code>.Rmd</code> is automatically added.
 
  
 
Time to add our first bit of '''R code'''
 
Time to add our first bit of '''R code'''
Line 136: Line 136:
 
*Copy and paste the following:
 
*Copy and paste the following:
  
<source lang="rsplus">
+
<source lang="R">
 
```{r loadLibrary}
 
```{r loadLibrary}
if (!require(XML, quiet=TRUE)) {
+
if (!require(XML, quietly=TRUE)) {
 
   install.packages("XML")
 
   install.packages("XML")
 
   library(XML)
 
   library(XML)
Line 146: Line 146:
  
  
This is what is know as a "code chunk". It is delimited by three backticks <code>```</code> and has directives and options for the chunk in the first line. It is labelled as '''R''' code, and note that after the <code>{r </code> we have added an (optional) label for the chunk. That is useful, because we can rapidly navigate between chunks (click on the navigation menu at the ''bottom'' of the script pane), and we can refer to the labels to execute chunks that are coded later in the document at an earlier stage. This is an important idea of literal programming: the flow of the document should not be determined by the requirements of the code, but by the logic of the narrative.  TLDR; label your chunks. It's useful.
+
This is what is know as a "code chunk". It is delimited by three backticks <code>```</code> and has directives and options for the chunk in the first line. It is labelled as '''R''' code, and note that after the <code>{r </code> we have added an (optional) label for the chunk. That is useful, because we can rapidly navigate between chunks (click on the navigation menu at the ''bottom'' of the script pane), and we can refer to the labels to execute chunks that are coded later in the document at an earlier stage. This is an important idea of literate programming: the flow of the document should not be determined by the requirements of the code, but by the logic of the narrative.  TLDR; label your chunks. It's useful.
  
 
Other options can be added after a comma, for example we can suppress printing of a chunk into the document altogether, if we think it is not relevant for the document, by adding the option <code>echo{{=}}FALSE</code><ref>For a complete list of chunk options, see [http://yihui.name/knitr/options/ the documentation by knitr's author, Xie Yihui].</ref>.
 
Other options can be added after a comma, for example we can suppress printing of a chunk into the document altogether, if we think it is not relevant for the document, by adding the option <code>echo{{=}}FALSE</code><ref>For a complete list of chunk options, see [http://yihui.name/knitr/options/ the documentation by knitr's author, Xie Yihui].</ref>.
Line 155: Line 155:
  
 
<div class{{=}}"text-box">
 
<div class{{=}}"text-box">
The XML package provides a function -- `readHTMLTable()` -- that makes our life very easy: it accesses an URL, looks for all HTML formatted tables, parses them and returns them as lists. Internally, by default `readHTMLTable` reads the data into a dataframe, so to avoid converting all the text into factors we set the option `stringsAsFactors{{=}}FALSE`. There may be several tables in the source page, each one is returned as a list element. Since we know (hope?) the OED page contains only one table, we use only the first list element.<br />
+
The rvest package was designed for screenscraping and has functions to make our life very easy: it accesses an URL, looks for all HTML formatted tables, parses them with an XPATH expression and returns them as lists from which we can get data frames. There may be several tables in the source page, each one is returned as a list element. Since we know (hope?) the OED page contains only one table, we use only the first list element.<br />
<source lang="rsplus">
+
<source lang="R">
 
```{r getPageData, cache=TRUE}
 
```{r getPageData, cache=TRUE}
phobiaFrame <- readHTMLTable("http://www.oxforddictionaries.com/words/phobias-list"
+
phobias <- read_html("https://en.oxforddictionaries.com/explore/phobias-list")
                            stringsAsFactors=FALSE)[[1]]
+
phobias <- html_nodes(phobias, xpath = '//*[@id="content"]/div[1]/div[2]/div/div/div/div/div[4]/table')
 +
phobias <- html_table(phobias)[[1]]
 
```
 
```
 
</source>
 
</source>
Line 165: Line 166:
 
</div>
 
</div>
  
Two things to note here:
+
Some things to note here:
  
 
* Enclosing a piece of text in "backticks" <code>`Text...`</code> formats that text as "code" - typically in a fixed-width font.
 
* Enclosing a piece of text in "backticks" <code>`Text...`</code> formats that text as "code" - typically in a fixed-width font.
 
* For this chunk we have set the option <code>cache</code> as <code>TRUE</code>. This is a very useful and well thought out mechanism that avoids recomputing code that takes a long time or should otherwise be limited. The results of a cached chunk of code are stored locally and retrieved when the file is ''weaved''. Only if anything within the chunk is changed (or <code>cache</code> is set to <code>FALSE</code>), is the chunk evaluated again. This prevents us from excessively pounding on the OED as we develop our script, which is a question of good manners in the context of this example, but can save a lot of time as our projects become large and the calculations become complex.
 
* For this chunk we have set the option <code>cache</code> as <code>TRUE</code>. This is a very useful and well thought out mechanism that avoids recomputing code that takes a long time or should otherwise be limited. The results of a cached chunk of code are stored locally and retrieved when the file is ''weaved''. Only if anything within the chunk is changed (or <code>cache</code> is set to <code>FALSE</code>), is the chunk evaluated again. This prevents us from excessively pounding on the OED as we develop our script, which is a question of good manners in the context of this example, but can save a lot of time as our projects become large and the calculations become complex.
 +
* <code>rvest</code> needs an XPATH expression to parse the document. Writing XPATH expressions can be a bit gnarly - the RBloggers article linked from the Further Reading section demonstrates a nifty way to get the expression from within a CXhrome browser window. The interface has slightly changed since the article was written, but it's easy enough to figure out.
  
 
In order to make sure everything has worked, we'll print a sample from the table to our documentation file. RMarkdown provides a shorthand notation for tables - just like Wiki markup. I never use these. HTML tables are easy enough to format and remember and they provide '''many''' more options. In the example below, we customize the row background-color for alternating rows. That is something we could not do with simple markdown.
 
In order to make sure everything has worked, we'll print a sample from the table to our documentation file. RMarkdown provides a shorthand notation for tables - just like Wiki markup. I never use these. HTML tables are easy enough to format and remember and they provide '''many''' more options. In the example below, we customize the row background-color for alternating rows. That is something we could not do with simple markdown.
Line 180: Line 182:
 
cat("<tr style=\"background-color:#CCFFF0;\"><th>Phobia</th><th>Fear of...</th></tr>\n")
 
cat("<tr style=\"background-color:#CCFFF0;\"><th>Phobia</th><th>Fear of...</th></tr>\n")
 
for (i in 1:7) {
 
for (i in 1:7) {
   r <- randRow(phobiaFrame)
+
   r <- randRow(phobias)
 
   if (i %% 2) {
 
   if (i %% 2) {
 
     cat("<tr style=\"background-color:#F9F9F9;\">")
 
     cat("<tr style=\"background-color:#F9F9F9;\">")
Line 206: Line 208:
 
</source>
 
</source>
  
This executes the code chunk with the label <code>randRow</code> without giving any output.
+
This executes the code chunk with the label <code>randRow</code> (and - you guessed it - the function will be defined in that chunk) without giving any output.
  
 
To finish off, paste the following:
 
To finish off, paste the following:
Line 216: Line 218:
  
 
<source lang="rsplus">
 
<source lang="rsplus">
```{r randRow}
+
```randRow <- function(M, seed = FALSE) {
randRow <- function(M, seed=FALSE) {
+
  # Return a random row from a dataframe M.
   if (seed) set.seed(as.integer(seed))
+
   if (seed) {
   return(M[sample(1:nrow(M), 1),])
+
    set.seed(as.integer(seed))
 +
  }
 +
   return(M[sample(1:nrow(M), 1), ])
 
}
 
}
 
```
 
```
 
</source>
 
</source>
  
With this useful tool we can ponder on our favourite phobia of the day. For today, let it be **`r randRow(phobiaFrame, seed{{=}}1123581321)[2]`**, the fear of `r randRow(phobiaFrame, seed{{=}}1123581321)[1]`.
+
With this useful tool we can ponder on our favourite phobia of the day. For today, let it be **`r randRow(phobias, seed{{=}}1123581321)[2]`**, the fear of `r randRow(phobias, seed{{=}}1123581321)[1]`.
  
 
Reptiles! Awful.
 
Reptiles! Awful.
Line 231: Line 235:
 
This piece now contains the function definition for <code>randRow</code>, which it prints to the document after our comments. It also contains '''inline''' '''R''' code that is executed as the document is built.
 
This piece now contains the function definition for <code>randRow</code>, which it prints to the document after our comments. It also contains '''inline''' '''R''' code that is executed as the document is built.
  
* That should be all. You should be able to save the document and click the '''Knit HTML''' button to execute the code, build, and load a Webpage with the document we just wrote. Please get in touch if you run into problems.
+
* That should be all. You should be able to save the document and select (from the menu bar of the script pane) '''Knit &rarr; Knit to HTML''' to execute the code, build, and load a Webpage with the document we just wrote. If your code has errors in the chunks, they will be reported in the console.
  
<small>If all the pasting of bits and chunks was confusing, the final <code>.Rmd</code> file is [http://steipe.biochemistry.utoronto.ca/abc/CourseMaterials/BCB420/RandomPhobia.Rmd here].</small>
+
<small>If all the pasting of bits and chunks was confusing, the final <code>.Rmd</code> file is [http://steipe.biochemistry.utoronto.ca/abc/assets/RandomPhobia.Rmd here].</small>
  
 
}}
 
}}
Line 256: Line 260:
  
 
== Further reading, links and resources ==
 
== Further reading, links and resources ==
<!-- Formatting exqmples:
+
<div class="reference-box">[https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf R Markdown Reference Guide] (PDF @ RStudio)</div>
{{#pmid: 19957275}}
+
<div class="reference-box">[https://www.r-bloggers.com/using-rvest-to-scrape-an-html-table/ Using rvest to Scrape an HTML Table] (R Bloggers)</div>
<div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div>
+
<div class="reference-box">[https://www.r-bloggers.com/using-rvest-to-scrape-an-html-table/ Using rvest to Scrape an HTML Table] (R Bloggers)</div>
-->
 
  
 
{{Vspace}}
 
{{Vspace}}
Line 320: Line 323:
 
:2017-09-17
 
:2017-09-17
 
<b>Modified:</b><br />
 
<b>Modified:</b><br />
:2017-10-05
+
:2017-10-24
 
<b>Version:</b><br />
 
<b>Version:</b><br />
 
:1.0
 
:1.0

Revision as of 19:53, 25 October 2017

Literate Programming with R


 

Keywords:  (Draft) Literate programming principles; R Markdown; R Notebooks


 



 


 


Abstract

Documentation of results using R markdown and R notebooks.


 


This unit ...

Prerequisites


 


Objectives

This unit will ...

  • ... introduce the philosophy behind "Literate Programming";
  • ... teach the practice with an example that uses the R knitr package;
  • ... demonstrate R notebooks;


 


Outcomes

After working through this unit you ...

  • ... can produce your own "Literate Programs" with knitr or in an R notebook.


 


Deliverables

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.


 


Evaluation

Evaluation: NA

This unit is not evaluated for course marks.


 


Contents

Literate programming is an idea that software is best described in a natural language, focussing on the logic of the program, i.e. the why of code, not the what. The goal is to ensure that model, code, and documentation become a single unit, and that all this information is stored in one and only one location. The product should be consistent between its described goals and its implementation, seamless in capturing the process from start (data input) to end (visualization, interpretation), and reversible (between analysis, design and implementation).

In literate programming, narrative and computer code are kept in the same file. This source document is typically written in Markdown or LaTeX syntax and includes the programming code as well as text annotations, tables, formulas etc. The supporting software can weave human-readable documentation from this, or tangle executable code. Literate programming with both Markdown and LaTex is supported by R Studio and this makes the R Studio interface a useful development environment for this paradigm. While it is easy to edit source files with a different editor and process files in base R after loading the Sweave() and Stangle() functions or the knitr package. In our context here we will use R Studio because it conveniently integrates the functionality we need.

knitr is an R package for literate programming. It is integrated with R Studio.


 

RMarkdown

Markdown is an extremely simple and informal way of structuring documents that is useful if for some reason you feel html is too complicated. That's really all it does: format documents in a simple way so they can be displayed as Web pages. For Markdown documentation, see here.. The concept is quite similar to Wiki markup syntax, the syntax is (regrettably) different, and for a number of features there there are (regrettably) several different ways to achieve the same results.

RMarkdown is an R package that is integrated with R Studio and allows integrating R code with Markdown documents. knitr can work with Markdown files, and this gives additional output options, such as PDF and MSWord documents.


Let's give it a try: we'll write and document an R function that will find us a random phobia to ponder on.

Task:
{{{1}}}


 

R Notebooks

R Notebooks take the concpet into the RStudio editor itself, rather than constructing a Webpage. On one hand, you become dependent on the RStudio editor, on the other hand, you directly edit and comment as you are developing. This is true "Lietarte Programming".

Task:
Read about the concept here and follow along with the exercise.


 


 


Further reading, links and resources


 


Notes


 


Self-evaluation

 



 




 

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.



 

About ...
 
Author:

Boris Steipe (boris.steipe@utoronto.ca)

Created:

2017-09-17

Modified:

2017-10-24

Version:

1.0

Version history:

  • 1.0 First live version
  • 0.1 First stub

CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.