Difference between revisions of "R tutorial"

From "A B C"
Jump to navigation Jump to search
Line 1,626: Line 1,626:
 
         funds <- funds + (2 * bet)
 
         funds <- funds + (2 * bet)
 
         bet <- 1            # reset the bet to one dollar
 
         bet <- 1            # reset the bet to one dollar
     } else {                # we loose :-(
+
     } else {                # we lose :-(
         result <- "Loose."
+
         result <- "Lose."
 
         bet <- 3 * bet      # increase the bet to 3 times previous
 
         bet <- 3 * bet      # increase the bet to 3 times previous
 
     }
 
     }

Revision as of 23:11, 24 January 2017

R tutorial


This is a tutorial introduction to R for users with no previous background in the platform or the language. It's a bit of a sweeping introduction and a lot of material to digest all at once. You might want to return to it again at a later time.




 

Before you begin: Notation and Formatting

In this tutorial, I use specific notation and formatting to mean different things.

If you see footnotes[1], click on the number to read more.

This is normal text for explanations. It is written in a proportionally spaced font.

Code formatting is for code examples, file- and function names, directory paths etc. Code is written in a monospaced font[2].

Bold emphasis and underlining are to mark words as particularly important.

Examples of the right way to do something are highlighted green.

Examples of the wrong way to do something are highlighted red.


Task:
Tasks and exercises are described in boxes with a blue background. You have to do them, they are not optional. If you have problems, you must contact me and not simply continue. All material builds on previous material, therefore you can't skip.

What could possibly go wrong? ... Click to expand.


These sections have information about issues I encounter more frequently. They are required reading when you need to troubleshoot problems but also give background information that may be useful to avoid problems in the first place.

Click to collapse.

"Metasyntactic variables": When I use notation like <Year> in instructions, you type the year, the whole year and nothing but the year (e.g the four digits 2017). You never type the angle brackets! I use the angle brackets only to indicate that you should note type Year literally, but substitute the correct value. You might encounter this notation as <path>, <filename>, <firstname lastname> and similar. To repeat: if I specify

<your name>

... and your name is Elcid Barrett, You type

Elcid Barrett

... and not   your name   or   <Elcid Barret>   or similar. (Oh the troubles I've seen ...)


The sample code on this page sometimes copies text from the console, and sometimes shows the actual commands only. The > character at the beginning of the line is always just R's input prompt, it tells you that you can type something now - you never actually type > at the beginning of a line. If you read:

> getwd()

... you need to type:

getwd()


If a line starts with [1] or similar, this is R's output on the console.[3] The # character marks the following text as a comment which is not executed by R. These are lines that you do not type. They are program output, or comments, not commands.

Characters
Different characters mean different things for computers, and it is important to call them by their right name.
  • ( )  ◁ these are parentheses.
  • [ ]  ◁ these are (square) brackets.
  • < >  ◁ these are angle brackets.
  • { }  ◁ these are (curly) braces.
  •  "  ◁ this, and only this is a quotation mark or double quote. All of these are not: “”„«» . They will break your code. Especially the first two are often automatically inserted by MSWord and hard to distinguish.[4]
  •  '  ◁ this, and only this is a single quote. All of these are not: ‘’‚‹› . They will break your code. Especially the first two are often automatically inserted by MSWord and hard to distinguish.


 


 

The environment

In this section we discuss how to download and install the software, how to configure an R session and how to work in the R environment.


 

Files, directories and paths

Task:
Create a folder (directory) on your computer in which to keep materials for this course (or workshop, as the case may be). Put it into the right place, and give it the right name:

The right place is directly in the Documents folder of your account.
The right name is simply the <Coursecode> e.g. for a CBW workshop in 2016, you call the folder CBW, for a BCH441 course, the name should be BCH441.

Do not use spaces, hyphens, or any other special characters in your filename[5].


 

I will call this the course directory. (I use the words "folder" and "directory" synonymously and completely interchangeably.)

In my experience, it is better to organize file hierarchies wide, not deep. This means I aim to put more things in one folder rather than create elaborate directory structures. I need to look for stuff a lot, and looking more-or-less in the same folder keeps my files more visible. As you will find later, all the R project folders we create will have a common prefix – R_Exercise-..., so they should be easy to recognize and keep organized. So I would keep all material in one course directory, rather than creating subdirectories e.g. for R, exercises, assignments etc. etc.

A filename is a label that identifies a file. Often filenames have two parts: the actual name, and an extension. To specify a file on the computer's command line, or in R, you need to specify its full name including the extension. Now, the problem is that you can switch off viewing extensions in Windows; I'm afraid this is actually done by default. Then all hell breaks lose when you are trying to do "real" work. Files can't be found, or worse, can be inadvertently overwritten. Never allow your operating system to hide file extensions from you. You must be able to see the full name.

A path is the complete specification of where a file is located in the directory tree of your computer. Paths are simply directories strung together into a long string, separated by a forward slash "/" (on Mac or Unix) or a backslash "\" on Windows. Take note! When writing Windows paths in R, you have to use the "wrong" forward slash to specify the path. R will translate Unix-style paths into Windows-style paths automatically - but the backslash would be interpreted as an "escape" character that gives the following character a special meaning.

Folder name and path examples
  • /Users/Pierette/Documents/BCB420  ◁ Looking good on a Mac.
  • C:\Users\Pulcinella\Documents\CBW  ◁ Looking good on a Windows computer.
  • "C:/Users/Pulcinella/Documents/CBW"  ◁ Looking good inside R on a Windows computer (note the quotation marks!).


  • C:\Users\Pantalone\Documents\JTB2020 (2017)  ◁ Wrong. No special characters please.
  • /Users/Brighella/Documents/UofT Stuffz/Courses/more/Comp Sys biol. course  ◁ Wrong. Please read instructions more carefully.
  • C:\Users\Tartaglia\Documents\KUWTK\<Coursecode>  ◁ I can't even ...


 

Install R

Task:

  1. Navigate to CRAN (the Comprehensive R Archive Network)[6] and follow the link to Download R for your computer's operating system.
  2. Download a precompiled binary (or "build") of the R "framework" to your computer and follow the instructions for installing it. Make sure that the program is the correct one for your version of your operating system.
  3. Launch R.

The program should open a window–this window is called the "R console"–and greet you with its input prompt, awaiting your input:

>

Task:

Once you see that R is running correctly, you may quit the program for now.


What could possibly go wrong?...


I can't install R.
Make sure that the version you downloaded is the right one for your operating system. Also make sure that you have the necessary permissions on your computer to install new software.


 

Install RStudio

RStudio is a free IDE (Integrated Development Environment) for R. RStudio is a wrapper[7] for R and as far as basic R is concerned, all the underlying functions are the same, only the user interface is different (and there are a few additional functions that are very useful e.g. for managing projects).

Here is a small list of differences between R and RStudio.

pros (some pretty significant ones actually)
  • Integrated version control.
  • Support for "projects" that package scripts and other assets.
  • Syntax-aware code colouring.
  • A consistent interface across all supported platforms. (Base R GUIs are not all the same for e.g. Mac OS X and Windows.)
  • Code autocompletion in the script editor. (Depending on your point of view this can be a help or an annoyance. I used to hate it. After using it for a while I find it useful.)
  • The ability to set breakpoints for debugging in the script editor.
  • Support for knitr, Sweave, rmarkdown... (This supports "literate programming" and is actually a big advance in software development)
  • Support for R notebooks.
cons (all minor actually)
  • The tiled interface uses more desktop space than the windows of the R GUI.
  • There are sometimes (rarely) situations where R functions do not behave in exactly the same way in RStudio.
  • The supported R version is not always immediately the most recent release.

Task:

  • Navigate to the RStudio Website.
  • Find the right version of the RStudio Desktop installer for your computer, download it and install the software.
  • Open RStudio.
  • Focus on the bottom left pane of the window, this is the "console" pane.
  • Type getwd().

This prints out the path of the current working directory. Make a (mental) note where this is. We usually always need to change this "default directory" to a project directory.



 

"Projects"

We will make extensive use of "projects" in class. Read more about projects in RStudio here.


 

Git Version control

We will also make extensive use of version control. In fact, we will now load a project via Git version control from its free, public repository on GitHub.

Task:


 

Then do the following:

  • open RStudio
  • Select File → NewProject...
  • Click on Version Control
  • Click on Git
  • Enter https://github.com/hyginn/R_Exercise-BasicSetup as the Repository URL.
  • Type a <tab> character, the Project directory name field should then autofill to read R_Exercise-BasicSetup
  • Click on Browse... to find your project directory. (The one that you have created above). Click Open.
  • Click Create Project; the project files should be downloaded and the console should prompt you to type init() to begin.
  • Type init() into the console pane.

An R script should load.

  • Explore the script and follow its instructions.


 

What could possibly go wrong?...


I get an error message
"Git not found".
The simplest reason is that you may have had RStudio open while installing git. Just restart RStudio.
The executable for Git (the Git "program" - "git.exe" on Windows, "git" elsewhere) needs to be on your system's path, or correctly specified in RStudio's options. The correct "path" to Git will depend on your operating system, and how git was installed. To find where git is installed –
On Mac and Unix systems, open a Terminal window[8] and type which git. This will either print the path (Yay), or tell you that git is not found. The latter could have two reasons: either git has no been installed in the first place, or it has been installed in a non-standard location by whatever installation manager you have used. Ask Google to help you figure out how to solve your specific case.
On Windows you can find the location of the executable by searching "git.exe" in your "programs and files". Once it's been found, right click on it and select "Open file location" from the options. It might be in C:\Program Files\Git\cmd\git.exe but the exact location depends on your operating system.
Once you know the path to your git executable, open FilePreferences, click on the Git/SVN option, click on the Browse button, and find the correct folder. On Macs you may need to click <shift> <command> G to open the "Go to ..." dialogue, then type the top-folder of the path (e.g. /usr) and click your way down to folder where the program lives. Find the installation directory and select git.exe. Then click "ok".
Then try again to create the project and let us know what happened in case it still did not work.


 
I get an error message like "directory exists and is not empty".
A directory with the name of the project already exists in the location in which you are asking RStudio to create the project. Either delete the existing directory, or install the project into a different parent directory.


 
The git icon has disappeared.
I have seen this happen when somehow the path to git has changed.
(A) Make sure the correct path to git is set in your FilePreferences → Git/SVN.
(B) Open ToolsProject options...Git/SVN. Next to Version control system git must be selected, not (None). If it is (None), change this to git. If that's not an option, the path is not correct. Go back to (A).
(C) I think you may need to restart RStudio then and reload your project via the Files → Recent projects... menu for the git icon and the version control options to reappear.



 

Typing code or executing it?

As you have seen in the R script ...

you can type, or copy/paste code into the R or RStudio console, or if you are working with the R script in R or RStudio, just execute from the script.

In principle, commands can be copied by you from the Wiki and pasted into the console, or into a script (i.e. a text-file of R commands that can run as a program) - obviously, you don't need to copy the comments. In addition, I use syntax highlighting on R-code, to colour language keywords, numbers, strings, etc. different from other text. This improves readability but keep in mind that the colours you see on your computer may be different.

However, note the following: while it is convenient to copy/paste code, you don't learn how to write code through that. Practice has shown that it is better to actually type commands, even if you are just re-typing code from a book or online. Actively typing out the code character by character ensures you are reading and translating the code, and notice if anything is not entirely clear.[9] In computer code, every single character matters. For example, I expect that by typing out commands you will be much less likely to confuse = with <- or even ==. Also, you will sometimes mistype and create errors. That's actually good, because you quickly learn to spot errors, fix them, and resume. That way you build confidence.

However, for the project files we load from GitHub - these are too long to be retyped by you. Select the line, or parts of the line, or a larger block of code that you want to execute, then press <command> R, or <command><enter> (depending on your operating system) to execute the selected block. In this case you'll need extra effort to discipline yourself to read and understand every single character and command. The point is not to execute the scripts. The point is to understand their contents. The best way to do this is to edit the code, vary parameters, try alternatives and in general play.

However, actually working with code is another story. In this case scripts are indispensable for development. I type all steps of a typical analysis into a script - in this way I can always come back and reproduce the analysis: this makes the work reproducible. And I execute commands from the script, not the console - that's the easiest way to modify and develop. Making script and console work hand in hand is the way to work with R. There are four major advantages:

  1. The script is an accurate record of my procedure so I know exactly what I have done;
  2. I add numerous comments to record what I was thinking when I developed it;
  3. I can immediately reproduce the entire analysis from start to finish, simply by rerunning the script;
  4. I can reuse parts easily, thus making new analyses quick to develop.
  5. If I keep my script under version control, I can return to previous versions and undo errors. That was number five.


 

User interface

Both R and RStudio have a GUI[10] to lay out common tasks. For example, there are a number of menu items, many of which are similar to other programs you will have worked with ("File", "Edit", "Format", "Window", "Help" ...). All of these tasks can also be accessed through the command line in the console.

In general, GUIs are useful when you are not sure what you want to do or how to go about it; the command line is much more powerful when you have more experience and know your way around in principle. R gives you both options.

Let's look at some functions of the R console and associated windows that relate to how you work, not what you do.


 

The Help system

Task:

  • Start RStudio (if it's not already open) and as you work through the sections below, type the commands and explore what they do.


Help is available for all commands as well as for the R syntax. As well, help is available to find the names of commands when you are not sure of them. You can get help through the command line, or from a search field in the Help tab of the lower-right pane.


(help() is a function in R, arguments to a function are passed in parentheses "()")

> help(rnorm)
>


(shorthand for the same thing)

> ?rnorm
>


(what was the name of that again ... ?)

> ?binom     
No documentation for 'binom' in specified packages and libraries:
you could try '??binom'
> ??binom
>


(I see "Binomial" in the list of keywords...)

> ?Binomial
>


(Alternatively: use the apropos() function.

> ?apropos     
> apropos("med")   # all functions that contain the string "med"
> apropos("^med")  # all functions that begin with the string 
> apropos("med$")  # all functions that end with the string


If you need help on operators, place them in quotation marks. Try:

> ?"+"
> ?"~"
> ?"["
> ?"%in%"
>


That's all fine, but you will soon notice that R's help documentation is not all that helpful for newcomers (who need the most help). To illustrate, open the help window for the function var().

> ?var


Here's what you might look for.

  • The Description section describes the function in general technical terms.
  • The Usage section tells you what arguments are required (these don't have defaults), what arguments have defaults, and what the defaults are, and whether additional arguments ("...") are allowed. Often a function comes in several variants, you will find them here.
  • The Arguments section provides detailed information . You should read it, especially regarding whether the arguments are single values, vectors, or other objects, and what effect missing arguments will have.
  • The Details section might provide common usage and context information. It might also not. Often functions have crucial information buried in an innocuous note here.
  • You have to really understand the Value section. It explains the output. Importantly, it explains the type of object a function returns - it could be a list, a matrix or something else (we'll discuss these data types in detail below.). The value could also be an object that has special methods defined e.g. for plotting it. In that case, the object is formally a "list", and its named "components" can be retrieved with the usual list syntax (see below).

If you look at the bottom of the help text, you will usually find examples of the function's usage; these sometimes make matters more clear than the terse and principled help-text above.

What you often won't find:

  • Clear commented, examples that relate to the most frequent use cases.
  • Explanations why a particular function is done in a particular way (e.g. why the denominator is n-1 for sd() and var()).
  • Notes on common errors.
  • A (reasonably) exhaustive list of alternatives and related functions. There are usually some entries, but there is no guarantee that all alternatives are listed – especially if they are provided by an external package.


Therefore, my first approach for R information is usually to Google for what interests me and this is often the quickest way to find working example code. R has a very large user base and it is becoming very rare that a reasonable question will not have a reasonable answer among the top three hits of a Google search. Also, as a result of a Google search, it may turn out that something can't be done (easily) – and you won't find things that can't be done in the help system at all. You may want to include "r language" in your search terms, although Google is usually pretty good at figuring out what kind of "r" you are looking for, especially if your query includes a few terms vaguely related to statistics or programming.

  • There is an active R-help mailing list to which you can post–or at least search the archives: your question probably has been asked and answered before. A number of SIGs (Special Interest Groups) exist for more specific discussions - e.g. for mac OS, geography, ecology etc. They are listed here.
  • Most of the good responses these days are on stack overflow, discussion seems to be shifting to there from the R mailing list. Information on statistics questions can often be found or obtained from the CrossValidated forum of stackexchange.
  • Rseek is a specialized Google search on R-related sites. Try "time series analysis" for an example.


If you want a quick and constructive answer from the R mailing list or stackoverflow, you must do a bit of homework first. If you ask your question well, you will get incredibly insightful and helpful responses, but you need to help the helpers help you:

  • Use the dput() function, perhaps combined with head() to create a small, reproducible dataset with which your problem can be reproduced or your question illustrated. Keep this as small as possible. Post that.
  • Post minimal code that reproduces the problem with the data you have supplied. Together the code and data have to form an MWE – a minimal working example. People love to play with your code and get it to work, but they hate having to copy, paste, reformat or otherwise edit someone's stuff just so they can answer their question.
  • Don't waste too much time on explaining what you did (since that didn't work), but explain clearly what you want to achieve. Focus on the desired result - not on how to fix your algorithm; your algorithm may be the wrong mental model in the first place.
  • Don't post in HTML, be sure to post in plain text only.
  • Read "How to write a reproducible example" and "How to make a great R reproducible example" before you post.


 

Working directory

To locate a file in a computer, one has to specify the filename and the directory in which the file is stored; this is also called the path of the file. The "working directory" for R is either the directory in which the R-program has been installed, or some other directory, as initialized by a startup script. You can execute the command getwd() to list what the "Working Directory" is currently set to:


> getwd()
[1] "/Users/steipe/R"

In RStudio, the contents of the working directory is listed in the Files Pane.

It is convenient to put all your R-input and output files into a project specific directory and then define this to be the "Working Directory". The R working directory is the directory that R uses when you don't specify a path. Think of it as the default directory. Use the setwd() command for this. setwd() requires an argument that you type between the parentheses: a string with the directory path, or a variable containing such a string. Strings in R are delimited with " or ' characters. If the directory does not exist, an Error will be reported. Make sure you have created the directory. On Mac and Unix systems, the usual shorthand notation for relative paths can be used: ~ for the home directory, . for the current directory, .. for the parent of the current directory.

If you use a windows system, you need know that backslashes – "\" – have a special meaning for R, they work as escape characters. For example the string "\n" means newline, and "\t" means tab. Thus R gets confused when you put backslashes into string literals, such as Windows path names. R has a simple solution: you simply use forward slashes instead of backslashes when you specify paths, and R will translate them correctly when it talks to your operating system. Instead of C:\documents\projectfiles you write C:/documents/projectfiles. Also note that on Windows the ~ tilde is a shorthand for the directory in which R is installed, not the user's home directory.


My home directory...

> setwd("~") # Note: ~ is the "tilde" - the squiggly line - not the straight hyphen
> getwd()
[1] "/Users/steipe"

Relative path: home directory, up one level, then down into chen's home directory)

> setwd("~/../chen")  
> getwd()
[1] "/Users/chen"

Absolute path: specify the entire string)

> setwd("/Users/steipe/abc/R_samples")  
> getwd()
[1] "Users/steipe/abc/R_samples"


In RStudio you can use the Session → Set Working Directory menu. This includes the useful option to set the current project directory as the working directory.


Task:
Since you have gone through the script of the BasicSetup project, your working directory should be set to this project directory (I have configured the project to do this automatically.)

  1. Figure out the path to its parent directory - i.e. the course- or workshop directory you created at the beginning.
  2. Use setwd("<your/path/and/directory/name>") to set the working directory to the course directory.
  3. Confirm that this has worked by typing getwd() and list.files().

The Working Directory functions can also be accessed through the Menu, under Misc.


 

.Rprofile - startup commands

Often, when working on a project, you would like to start off in your working directory right away when you start up R, instead of typing the setwd() command. This is easily done in a special R-script that is executed automatically on startup[11]. The name of the script is .Rprofile and R expects to find it in the user's home directory. You can edit these files with a simple text editor like Textedit (Mac), Notepad (windows) or Gedit (Linux) - or, of course, by opening it in R itself[12].

Besides setting the working directory, other items that might go into such a file could be

  • libraries that you often use
  • constants that are not automatically defined
  • functions that you would like to preload.


For more details, use R's help function:

> ?Startup


 

The "Workspace"

During an R session, you might define a large number of variables, data structures, load packages and scripts etc. All of this information is stored in the so-called "Workspace". When you quit R you have the option to save the Workspace; it will then be restored in your next session. However, restoring the Workspace from a previous state is potentially a bad idea: if you load data or variables in a startup script, they may be overwritten with a corrupted version that you happened to save in the workspace when you last quit. This is very hard to troubleshoot. Essentially, when you save and reload your Workspace habitually, you have overlapping and potentially conflicting behaviour of startup script and Workplace restore.

What I prefer and recommend instead is the following:

  • Never save the Workspace.
  • Always work from scripts.
  • Write your scripts so that you can easily recreate all objects you need to continue your analysis.
  • If some objects are expensive to compute, you can always save() and later load() them explicitly. In fact, restoring the Workspace does the same thing, but you have less control regarding whether the version of your objects are correct, and what temporary variables may be loaded as well.
  • In this way, you work with explicit instructions, not implicit behaviour.
  • Explicit beats implicit.


List the current workspace contents: initially it is empty. (R reports an object of type "character" with a length of 0.)

> ls() 
character(0)
>

Initialize three variables

> a <- 3

Save one item in an .RData file.

{{{2}}}

Remove one item from the Workspace. (Note: the argument for rm() is not the string "a", but the variable name a. No quotation marks!)

> rm(a) 
> ls()
[1] "b"   "c"
>

Load what you previously saved.

> load("tmp.RData")
> ls() 
[1] "a"   "b"   "c"


Note: you can save() more than one item in an .RData file. When you then load() the file, all of the objects it contains are loaded. You don't assign the objects - they are being restored.


We can use the output of ls() as input to rm() to remove all items from the workspace. (cf. ?rm for details)

rm(list = ls()) 
> ls() 
character(0)
>


The contents of the workspace is displayed in RStudio's Environment Pane (top-right). You can see a little "brush" icon at the top that you can click to remove all items from the workspace.


 

Packages

R has many powerful functions built in, but one of it's greatest features is that it is easily extensible. Extensions have been written by legions of scientists for many years, most commonly in the R programming language itself, and made available through CRAN–The Comprehensive R Archive Network or through the Bioconductor project.

A package is a collection of code, documentation and (often) sample data. To use packages, you need to install them (once), and add them to your current session (for every new session). You can get an overview of installed and loaded packages by opening the Package Manager window from the Packages & Data Menu item. It gives a list of available packages you currently have installed, and identifies those that have been loaded at startup, or interactively.


 

Task:

library() opens a window of installed packages in the library; search() shows which one are currently loaded.

> library()
> search() 
 [1] ".GlobalEnv"        "tools:RGUI"        "package:stats"     "package:graphics" 
 [5] "package:grDevices" "package:utils"     "package:datasets"  "package:methods"  
 [9] "Autoloads"         "package:base"


  • In the Packages tab of the lower-right pane in RStudio, confirm that seqinr is not yet installed.
  • Follow the link to seqinr to see what standard information is available with a package. Then follow the link to the Reference manual to access the documentation pdf. This is also sometimes referred to as a "vignette" and contains usage hints and sample code.

Read the help for vignette. Note that there is a command to extract R sample code from a vignette, to experiment with it.

> ?vignette

Install seqinr from the closest CRAN mirror and load it for this session. Explore some functions.

> ??install
> ?install.packages
> install.packages("seqinr")   # Note: quoted string!
also installing the dependency ‘ade4’

trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.2/ade4_1.7-2.tgz'
Content type 'application/x-gzip' length 3365088 bytes (3.2 MB)
==================================================
downloaded 3.2 MB

trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.2/seqinr_3.1-3.tgz'
Content type 'application/x-gzip' length 2462893 bytes (2.3 MB)
==================================================
downloaded 2.3 MB


The downloaded binary packages are in
	/var/folders/mx/ld0hdst54jjf11hpcjh8snfr0000gn/T//Rtmpsy5GMx/downloaded_packages

> library(seqinr)     # This refers to an installed page. No quotes here...
> library(help="seqinr")
> ls("package:seqinr")
  [1] "a"                       "aaa"                     "AAstat"                 
  [4] "acnucclose"              "acnucopen"               "al2bp"                  
     [...]
[205] "where.is.this.acc"       "words"                   "words.pos"              
[208] "write.fasta"             "zscore"                 
> ?a
> a("Tyr")
[1] "Y"
> choosebank()
 [1] "genbank"       "embl"          "emblwgs"       "swissprot"     "ensembl"      
    [...]
 [31] "refseqViruses"



What could possibly go wrong?...


The installation fails.
You might see an error message such as this:
Warning message:
package ‘XYZ’ is not available (for R version 3.2.2)
This can mean several things:
  • The package is not available on CRAN. Try Bioconductor instead or Google to find it.
  • The package requires a newer version of R than the one you have. Upgrade, or see if a legacy version exists.
  • A comprehensive set of reasons and their resolution is here on stackoverflow.




We have seen the following on Windows systems when typing library(help="seqinr")
Error in formatDL(nm, txt, indent = max(nchar(nm, "w")) + 3) :
incorrect values of 'indent' and 'width'

Anecdotally this was due to a previous installation problem with a mixup of 32-bit and 64-bit R versions, although another student told us that the problem simply went away when trying the command again. Whatever: Make sure you have the right 'R version installed for your operating system. Uninstall and reinstall when in doubt. Conflicting libraries can be the source of strange misbehaviour.



 

Task:


  • The fact that these methods work, shows that the package has been downloaded, installed, the library has been loaded and its functions and data are now available in the current environment. Just like many other packages, seqinr comes with a number of data files. Try:
?data
data(package="seqinr")   # list the available data
data(aaindex)            # load ''aaindex''
?aaindex                 # what is this?
aaindex$FASG890101       # two of the indices ...
aaindex$PONJ960101

# Lets use the data: plot amino acid codes by hydrophobicity and volume
 
plot(aaindex$FASG890101$I,
     aaindex$PONJ960101$I, 
     xlab="hydrophobicity", ylab="volume", type="n")
text(aaindex$FASG890101$I, 
     aaindex$PONJ960101$I, 
     labels=a(names(aaindex$FASG890101$I)))


  • Now, just for fun, let's use these functions to download a sequence and calculate some statistics (however, not to digress too far, without further explanation at this point). Copy the code below and paste it into the R-console
choosebank("swissprot")
mySeq <- query("mySeq", "N=MBP1_YEAST")
mbp1 <- getSequence(mySeq)
closebank()
x <- AAstat(mbp1[[1]])
barplot(sort(x$Compo))

The function require() is similar to library(), but it does not produce an error when it fails because the package has not been installed. It simply returns TRUE if successful or FALSE if not. If the library has already been loaded, it does nothing. Therefore I usually use the following code paradigm in my R scripts to avoid downloading the package every time I need to run a script:

if (!require(seqinr, quietly=TRUE)) {
    install.packages("seqinr")
    library(seqinr)
}

Note that install.packages() takes a (quoted) string as its argument, but library() takes a variable name (without quotes). New users usually get this wrong :-)

One of the challenges of working with R is the overabundance of options. To find the right package that contains a particular function you might be looking for could be tricky, but there is a package to help you do that. Try this:

if (!require(sos, quietly=TRUE)) {
    install.packages("sos")
    library(sos)
}

findFn("moving average")


A good way to find packages in CRAN is also a keyword search on the Metacran site. Try this link:

http://www.r-pkg.org/search.html?q=regex


Note that the Bioconductor project has its own installation system, the bioclite() function. It is explained here.


 


 

Simple commands

The R command line evaluates expressions. Expressions can contain constants, variables, operators and functions of the various datatypes that R recognizes.


 

Operators

The common arithmetic operators are recognized in the usual way.

Task:
Try the following operators on numbers:

5
5 + 3
5 + 1 / 2 # Think first is this 3 or 5.5
3 * 2 + 1
3 * (2 + 1)
2^3 # Exponentiation
8 ^ (1/3) # Third root via exponentiation
7 %% 2  # Modulo operation (remainder of integer division)
7 %/% 2 # Integer division

# Logical operators return TRUE or FALSE
#    Unary:
! TRUE
! FALSE

#    Binary

1 == 2
1 != 2
1 < 2
1 > 2

1 > 1
1 >= 1
1 < 1
1 <= 1

#    & AND
TRUE & TRUE
TRUE & FALSE
FALSE & FALSE

#    | OR
TRUE | TRUE
TRUE | FALSE
FALSE | FALSE

# Predict what this will return
!(FALSE | (! FALSE))


 

Functions

R is considered an (impure) functional programming language and thus the focus of R programs is on functions. The key advantage is that this encourages programming without side-effects and this makes it easier to reason about the correctness of programs. Function parameters[13] are instantiated for use inside functions as the function's arguments, and a single result is returned[14]. The return values can either be assigned to a variable, or used directly as the argument of another function - intermediate assignment is not required.

Functions are either built-in (i.e. available in the basic R installation), loaded via specific packages (see above), or they can be easily defined by you (see below). In general a function is invoked through a name, followed by one or more arguments in parentheses, separated by commas. Whenever I refer to a function, I write the parentheses to identify it as such and not a constant or other keyword eg. log(). Here are some examples for you to try and play with:

cos(pi) #"pi" is a predefined constant.
sin(pi) # Note the rounding error. This number is not really different from zero.
sin(30 * pi/180) # Trigonometric functions use radians as their argument - this conversion calculates sin(30 degrees)
exp(1) # "e" is not predefined, but easy to calculate.
log(exp(1)) # functions can be arguments to functions - they are evaluated from the inside out.
log(10000) / log(10) # log() calculates natural logarithms; convert to any base by dividing by the log of the base. Here: log to base 10.
exp(complex(r=0, i=pi)) #Euler's identity

There are several ways to populate the argument list for a function and R makes a reasonable guess what you want to do. Arguments can either be used in their predefined order, or assigned via an argument name. Let's look at the complex() function to illustrate this. Consider the specification of a complex number in Euler's identity above. The function complex() can work with a number of arguments that are given in the documentation (see: ?complex). These include length.out, real, imaginary, and some more. The length.out argument creates a vector with one or more complex numbers. If nothing else is specified, this will be a vector of complex zero(s). If there are two, or three arguments, they will be placed in the respective slots. However, since the arguments are named, we can also define which slot of the argument list they should populate. Consider the following to illustrate this:

complex(1)
complex(4)
complex(1, 2) # imaginary part missing: if it's missing it defaults to zero
complex(1, 2, 3) # one complex number
complex(4, 2, 3) # four complex numbers
complex(real = 0, imaginary = pi) # defining values via named parameters
complex(imaginary = pi, real = 0) # same thing - if names are used, order is not important
complex(re = 0, im = pi) # names can be abbreviated ...
complex(r = 0, i = pi)   # ... to the shortest string that is unique among the named parameters.
                         # A strongly advise against this to keep your code readable for others.
complex(i = pi, 1, 0) # Think: what have I done here? Why does this work?
exp(complex(i = pi, 1, 0)) # (The complex number above is the same as in Euler's identity.)

Task:
A frequently used function is seq().

  • Read the help page about seq()
  • Use seq() to generate a sequence of integers from -5 to 3. Pass arguments in default order, don't use argument names.
  • Use seq() to generate a sequence of numbers from -2 to 2 in intervals of 1/3. This time, use argument names.
  • Use seq() to generate a sequence of 30 numbers between 1 and 100. Pass the arguments in the following order: length.out, to, from.


 

Variables

In order to store the results of expressions and computations, you can freely assign them to variables. Variables are created by R whenever you first use them (i.e. space in memory is allocated to the variable and a value is stored in that space.) Variable names distinguish between upper case and lower case letters. There are a small number of reserved names that you are not allowed to redefine, and a very small number of predefined constants, such as pi. However these constants can be overwritten - be careful: R will allow you to define pi <- 3 but casually redefining the foundations of mathematics may lead to unintended consequences. Read more about variable names at:

?make.names
?reserved

To assign a value to a constant, use the assignment operator <-. This is the default way of assigning values in R. You could also use the = sign, but there are subtle differences. (See: ?"<-"). There is a variant of the assignment operator <<- which is sometimes used inside functions. It assigns to a global context. This is possible, but not preferred since it generates a side effect of a function.

a <- 5
a
a + 3
b <- 8
b
a + b
a == b # not assignment: equality test
a != b # not equal
a < b  # less than

Note that all of R's data types (as well as functions and other objects) can be assigned to variables.

There are very few syntactic restrictions on variable names (discussed eg. here) but this does not mean esoteric names are good. For the sake of your sanity, use names that express the meaning of the variable, and that are unique. Many R developers use dotted.variable.names, some people use the pothole_style, my personal preference is to write camelCaseNames. And while the single letters c f n s Q are syntactically valid variable names, they coincide with commands for the debugger browser and will execute debugger commands, rather than displaying variable values when you are debugging. Finally, try not to use variable names that coincide with parameter names in functions. Alas, you see this often in code, but such code can be hard to read because the semantics of the actual argument versus the parameter name becomes obscured. It's just common sense really: don't call different things by the same name.

# I don't like...
col <- c("red", "grey")
hist(rnorm(200), col=col)

# I prefer instead something like...
rgStripes <- c("red", "grey")
hist(rnorm(200), col=rgStripes)


 


 

Data items

R objects can be composed from different kinds of data according to the type and number of "atomic" values they contain:

  • Scalar data are single values;
  • Vectors are ordered sequences of scalars, they must all have the same "data type" (e.g. numeric, logical, character ...);
  • Matrices are vectors for which one or more "dimension(s)" have been defined;
  • Data frames are spreadsheet-like objects, columns are like vectors and all columns must have the same length, but within one data frame, columns can have different data types;
  • Lists are the most general collection of data items, the can contain items of any type and kind, including matrices and lists.


 

Scalar data

Scalars are single numbers, the "atomic" parts of more complex datatypes. Of course we can work with single numbers in R, but under the hood they are actually vectors of length 1. (More on vectors in the next section). We have encountered scalars above, e.g. with the use of constants and their assignment to variables. To round this off, here are some remarks on the types of scalars R uses, and on coercion between types, i.e. casting one datatype into another. The following scalar types are supported:

  • Boolean constants: TRUE and FALSE. This type has the "mode" logical";
  • Integers, floats (floating point numbers) and complex numbers. These types have the mode numeric;
  • Strings. These have the mode character.

Other modes exist, such as list, function and expression, all of which can be combined into complex objects.

The function mode() returns the mode of an object and typeof() returns its type. Also class() tells you what class it belongs to.

typeof(TRUE)
class(3L)
mode(print)

I have combined these information functions into a single function, objectInfo() which gets loaded and defined with the BasicSetup script so you can experiment with it. We can use objectInfo() to explore how R objects are made up, by handing various expressions as arguments to the function. Many of these you may not yet recognize ... bear with it though:


#Let's have a brief look at the function itself: typing a function name without its parentheses returns the source code for the function:
objectInfo

# Various objects:

#Scalars:
objectInfo( 3.0 )    # Double precision floating point number
objectInfo( 3.0e0 )  # Same value, exponential notation

objectInfo( 3 )   # Note: integers are double precision floats by default.
objectInfo( 3L )  # If we really want an integer, we must use R's 
                  # special integer notation ...
objectInfo( as.integer(3) )  # or explicitly "coerce" to type integer...

# Coercions: For each of these, first think what result you would expect:
objectInfo( as.character(3) )  # Forcing the number to be interpreted as a character.
objectInfo( as.numeric("3") )   # character as numeric
objectInfo( as.numeric("3.141592653") )  # string as numeric. Where do the
                                         # non-zero digits at the end come from?
objectInfo( as.numeric(pi) )    # not a string, but a predefined constant
objectInfo( as.numeric("pi") )  # another string as numeric ... Ooops -
                                # why the warning?
objectInfo( as.complex(1) )  

objectInfo( as.logical(0) )  
objectInfo( as.logical(1) )  
objectInfo( as.logical(-1) )  
objectInfo( as.logical(pi) )      # any non-zero number is TRUE ...
objectInfo( as.logical("pie") )   # ... but not non-numeric types.
                                  # NA means "Not Available".
objectInfo( as.character(pi) )    # Interesting: the conversion eats digits.

objectInfo( Inf )                # Larger than the largest representable number
objectInfo( -Inf )               # ... or smaller
objectInfo( NaN )                # "Not a Number" is numeric
objectInfo( NA )                 # "Not Available" - i.e. missing value is
                                 # logical

# NULL
objectInfo( NULL )     # NULL is nothing. Not 0, not NaN,
                       # not FALSE - nothing. NULL is the value that is 
                       # returned by expressions or
                       # functions when the result is undefined. 

objectInfo( as.factor("M") )     # factor
objectInfo( Sys.time() )         # time
objectInfo( letters )            # inbuilt
objectInfo( 1:4 )                # numeric vector
objectInfo( matrix(1:4, nrow=2)) # numeric matrix
objectInfo( data.frame(arabic = 1:3,                           # dataframe
                       roman = c("I", "II", "III"), 
                       stringsAsFactors = FALSE))
objectInfo( list(arabic = 1:7, roman = c("I", "II", "III")))   # list

# Expressions:
objectInfo( 3 > 5 ) # Note: any combination of variables via the logical
                    # operators ! == != > < >= <= | || & and && is a 
                    # logical expression, with values TRUE or FALSE.
objectInfo( 3 < 5 ) 
objectInfo( 1:6 > 4 ) 

objectInfo( a ~ b )              # a formula
objectInfo( objectInfo )         # this function itself


 

Sometimes (but rarely) you may run into a distinction that R makes regarding integers and floating point numbers. By default, if you specify e.g. the number 2 in your code, it is stored as a floating point number. But if the numbers are generated e.g. from a range operator as in 1:2 they are integers! This can give rise to confusion as in the following example:


a <- 7
b <- 6:7
str(a)             # num 7
str(b)             # int [1:2] 6 7
a == b[2]          # TRUE
identical(b[2], a) # FALSE ! Not identical! Why?
                   # (see the str() results above.)

# If you need to be sure that a number is an
# integer, write it with an "L" after the number:
c <- 7L
str(c)             # int 7
identical(b[2], c) # TRUE


 

Vectors

Since we (almost) never do statistics on scalars, R obviously needs ways to handle collections of data items. In its simplest form such a collection is a vector: an ordered list of items of the same type. Vectors are created from scratch with the c() function which concatenates individual items into a list, or with various sequencing functions. Vectors have properties, such as length; individual items in vectors can be combined in useful ways. All elements of a vector must be of the same type. If they are not, they are coerced silently to the most general type (which is often character). (The actual hierarchy for coercion is raw < logical < integer < double < complex < character < list ).

# The c() function concatenates elements into a vector
c(2, 4, 6)


#Create a vector and list its contents and length:
f <- c(1, 1, 3, 5, 8, 13, 21)
f
length(f)

# Often, for teaching code, I want to demonstrate the contents of an object after 
# assigning it. I can simply wrap the assignment into parentheses to achieve that.
# Parentheses return the value of whatever they enclose. So ...
a <- 17
# ... assigns 17 to the variable "a". But this happens silently. However ...
( a <- 17 )
# ... returns the result of the assignment. I will use this idiom often.

( f <- c(1, 1, 3, 5, 8, 13, 21, 34, 55, 89) )


# Coercion: 
# all elements of vectors must be of the same mode
c(1, 2.0, "3", TRUE)  # trying to get a vector with mixed modes ...
[1] "1"    "2"    "3"    "TRUE" 

# ... shows that all elements are silently being coerced
# to character mode. The emphasis is on _silently_. This might
# be unexpected, for example if you are reading numeric data
# from a text-file but someone has entered a " " for a missing
# value... 



# Various ways to retrieve values from the vector:

# Extracting by index ...
f[1] # "1" is first element. 
f[length(f)] # length() is the index of the last element.

# With a vector of indices ...
1:4 # This is the range operator
f[1:4] # using the range operator (it generates a sequence and returns it in a vector)
f[4:1] # same thing, backwards
seq(from=2, to=6, by=2) # The seq() function is a flexible, generic way to generate sequences
seq(2, 6, 2) # Same thing: arguments in default order
f[seq(2, 6, 2)]

# since a scalar is a vector of length 1, does this work?
5[1]


# ...using an index vector with positive indices
a <- c(1, 3, 4, 1) # the elements of index vectors must be 
                   # valid indices of the target vector. 
                   # The index vector can be of any length.
f[a] # In this case, four elements are retrieved from f[]

# Negative indices omit elements ...
# ...using an index vector with negative indices

# If elements of index vectors are negative integers,
# the corresponding elements are excluded.
( a <- -(1:4) ) # Note that this is NOT the same as -1:4

f[a] # Here, the first four elements are omitted from f[]
f[-((length(f)-3):length(f))] # Here, the last four elements are omitted from f[]

# Extracting with a logical vector...
f > 4 # A logical expression operating on the target vector
      # returns a vector of logical elements. It has the
      # same length as the target vector.
f[f > 4]; # We can use this logical vector to extract only
          # elements for which the logical expression evaluates as TRUE.
          # This is sometimes called "filtering".
# Note: the logical vector is aligned with the elements of the original
# vector. You can't retrieve elements more than once, as you could
# with index vectors. If the logical vector is shorter than its target
# it is "recycled" to the full length.



# Example: extending the Fibonacci series for three steps. 
# Think: How does this work? What numbers are we adding here and why does the result end up in the vector?
( f <- c(f, f[length(f)-1] + f[length(f)]) )
( f <- c(f, f[length(f)-1] + f[length(f)]) )
( f <- c(f, f[length(f)-1] + f[length(f)]) )


# Some more thoughts about "["
# "[" is not just a special character, it is an operator. It
# operates on whatever it is attached to on the left. We have attached it
# to vectors above, but we can also attach it directly to function
# expressions, if the function returns a vector. For example, the 
# summary() function returns some basic statistics on a vector:
summary(f)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    5.00   21.00   75.69   89.00  377.00 

# This is a vector of six numbers:
length(summary(f))

# We can extract e.g. the median like so:
summary(f)[3]

# ... or the boundaries of the interquartile range:
summary(f)[c(2, 5)]

# Note that the elements that summary() returns are "named".
# "Names" are attributes.
objectInfo(summary(f))

# The names() function can retrieve (or set) names:
names(summary(f))

# ... which brings us to yet another way to extract elements from Vectors:

# Extracting named elements...
# If the vector has named elements, vectors of names can be used exactly like
# index vectors:

summary(f)["Median"]
summary(f)[c("Max", "Min")]  # Oooops - I mistyped. But you can fix the expression, right?

Many operations on scalars can be simply extended to vectors and R computes them very efficiently by iterating over the elements in the vector.

f
f+1
f*2

# computing with two vectors of same length
f  # the Fibonacci numbers you have defined above
( a <- f[-1] )  # like f, but omitting the first element
( b <- f[1:(length(f)-1)] ) # like f, but shortened by the least element
c <- a / b # the "golden ratio", phi (~1.61803 or (1+sqrt(5))/2 ), 
           # an irrational number, is approximated by the ratio of
           # two consecutive Fibonacci numbers.
c
abs(c - ((1+sqrt(5))/2)) # Calculating the error of the approximation, element by element


What could possibly go wrong?...


When a number is not a single number ...
One of the "warts" of R is that some functions substitute a range when they receive a vector of length one. Most everyone agrees this is pretty bad. This behaviour was introduced when someone sometime long ago thought it would be nifty to save two keystrokes. This has caused countless errors, hours of frustration and probably hundreds of undiscovered bugs instead. Today we wouldn't write code like that anymore (I hope), but the community believes that since it's been around for so long, it would probably break more things if it's changed. Two functions to watch out for are sample() and seq(); other functions include diag() and runif().
Consider:
x <- 8; sample(6:x)
x <- 7; sample(6:x)
x <- 6; sample(6:x)  # Oi!

# also consider
x <- 6:8; seq(x)
x <- 6:7; seq(x)
x <- 6:6; seq(x)    # Oi vay!
Wherever this misbehaviour is a possibility - i.e. when the number of elements to sample from is variable and could be just one, for example in some simulation code - you can write a replacement function like so...
safeSample <- function(x, size, ...) {
	# Replace the sample() function to ensure sampling from a single
	# value gives that value with probability p == 1.
        # Respect additional arguments if present.
    if (length(x) == 1 && is.numeric(x) && x > 0) {
    	if (missing(size)) size <- 1
        return(rep(x, size))
    } else {
        return(sample(x, size, ...))
    }
}
Don't be discouraged though: such warts are rare in R.

 



 

Matrices

If we need to operate with several vectors, or multi-dimensional data, we make use of matrices or more generally k-dimensional arrays R. Matrix operations are very similar to vector operations, in fact a matrix actually is a vector for which the number of rows and columns have been defined. Thus matrices inherit the basic limitation of vectors: all elements have to be of the same type.

The most basic way to define matrix rows and columns is to use the dim() function and specify the size of each dimension. Consider:

( a <- 1:12 )
dim(a) <- c(2,6)
a
dim(a) <- c(2,2,3)
a

dim() also allows you to retrieve the number of rows resp. columns a matrix has. For example:

dim(a)    # returns a vector
dim(a)[3]  # only the third value of the vector

If you have a two-dimensional matrix, the function nrow() and ncol() will also give you the number of rows and columns, respectively. Obviously, dim(a)[1] is the same as nrow(a).

As an alternative to dim(), matrices can be defined using the matrix() or array() functions (see there), or "glued" together from vectors by rows or columns, using the rbind() or cbind() functions respectively:

( a  <- 1:4 )
( b  <- 5:8 )
( m1 <- rbind(a, b) )
( m2 <- cbind(a, b) )
( m  <- cbind(m2, c = 9:12) )  # naming a column :c" while cbind()'ing it

Addressing (retrieving) individual elements or slices from matrices is simply done by specifying the appropriate indices, where a missing index indicates that the entire row or column is to be retrieved. This is called "subsetting" or "subscripting" and is one of the most important and powerful aspects of working with R.

Explore how you extract rows or columns from a matrix by specifying them. Within the square brackets the order is [rows, columns]

m[1,] # first row
m[, 2] # second column
m[3, 2] # element at row == 3, column == 2
m[3:4, 1:2] # submatrix: rows 3 to 4 and columns 1 to 2

More on subsetting below.

Note that R has numerous functions to compute with matrices, such as transposition, multiplication, inversion, calculating eigenvalues and eigenvectors and more.


 

Data frames

Data frames are one of the most important types of data object in bioinformatics because they emulate our mental model of data in a spreadsheet and can be used to implement datamodels.

Usually the result of reading external data from an input file is a data frame. The file below is included with the BasicSetup project files - it is called plasmidData.tsv, and you can click on it in the Files Pane to open and inspect it.

Name	Size	Marker	Ori	Sites
pUC19	2686	Amp	ColE1	EcoRI, SacI, SmaI, BamHI, XbaI, PstI, HindIII
pBR322	4361	Amp, Tet	ColE1	EcoRI, ClaI, HindIII
pACYC184	4245	Tet, Cam	p15A	ClaI, HindIII

This data set uses tabs as column separators and it has a header line. Similar files can be exported from Excel or other spreadsheet programs. Read this as a data frame as follows:

plasmidData <- read.table("plasmidData.tsv", sep="\t", header=TRUE, stringsAsFactors = FALSE)
plasmidData   # show what the data frame contains

Note the argument stringsAsFactors = FALSE. If this is TRUE instead, R will convert all strings in the input to factors and this may lead to problems. Make it a habit to turn this behaviour off, you can always turn a column of strings into factors when you actually mean to have factors.

You can view the data frame contents by clicking on the spreadsheet icon behind its name in the Environment Pane.


 

Lists

The elements of matrices and arrays all have to be of the same type. Data frames allow us to store elements of different types in columns but all columns have to have the same length. But lists are more generally ordered collections of components. These components can have different type, AND different size.

Lists are created with the list() function, which works similar to the c() function. Components are accessed through their index in double square brackets, or through their name, using the "$" operator, if the name has been defined. Here is an example:

pUC19 <- list(size=2686, marker="ampicillin", ori="ColE1", accession="L01397", BanI=c(235, 408, 550, 1647) )
pUC19[[1]]
pUC19[[2]]
pUC19$ori
pUC19$BanI[2]

Note that in the data frame, multiple restriction enzymes were stored in one string, separated by commas. While we can take data like this apart with the strsplit() function, it is still just one element in the data frame's column.


 

Subsetting

We have encountered "subsetting" before, but we really need to discuss this in more detail. It is one of the most important and powerful topics of R since it is indispensable to select, transform, and otherwise modify data to prepare it for analysis. You have seen that we use square brackets to indicate individual elements in vectors and matrices. These square brackets are actually "operators", and you can find more information about them in the help pages:

> ?"["     # Note that you need quotation marks around the operator for this.

Note especially:

  • [ ] "extracts" one or more elements defined within the brackets;
  • [[ ]] "extracts" a single element defined within the brackets;
  • ? "extracts" a single named element that.

"Elements" are not necessarily scalars, but can apply to a row, column, or more complex data structure. But a "single element" can't be a range, or collection.


Here are some examples of subsetting data from the plasmidData data frame we constructed above. For the most part, this is review of what we already did above:

plasmidData[1, ]
plasmidData[2, ]

# we can extract more than one row by specifying
# the rows we want in a vector ...
plasmidData[c(1, 2), ]   
                    
# ... this works in any order ...
plasmidData[c(3, 1), ]   

# ... and for any number of rows ...
plasmidData[c(1, 2, 1, 2, 1, 2), ]   


# Same for columns
plasmidData[ , 2 ]

# We can select rows and columns by name if a name has been defined...
plasmidData[, "Name"]
plasmidData$Name      # different syntax, same thing. This is the syntax I use most frequently.


# Watch this!
plasmidData$Name[plasmidData$Ori != "ColE1"]
# What happened here?
# plasmidData$Ori != "ColE1" is a logical expression, it gives a vector of TRUE/FALSE values
plasmidData$Ori != "ColE1"

# We insert this vector into the square brackets. R then returns all rows for
# which the vector is TRUE.

# In this way we can "filter" for values
plasmidData$Size > 3000
plasmidData$Name[plasmidData$Size > 3000]

# This principle is what we use when we want to "sort" an object
# by some value. The function order() is used to return values
# that are sorted. Remember this: not sort() but order().
order(plasmidData$Size)
plasmidData[order(plasmidData$Size), ]

# grep() matches substrings in strings and returns a vector of indices
grep("Tet", plasmidData$Marker)
plasmidData[grep("Tet", plasmidData$Marker), ]
plasmidData[grep("Tet", plasmidData$Marker), "Ori"]

Elements that can be extracted from an object also can be replaced. Simply assign the new value to the element.

( x <- sample(1:10) )
x[4] <- 99
x
( x <- x[order(x)] )

Try your own subsetting ideas. Play with this. I find that even seasoned investigators have problems with subsetting their data and if you become comfortable with the many ways of subsetting, you will be ahaed of the game right away.


 


 

Control structures

"Control structures" are parts of the syntax that execute code according to whether some condition is met. Let's look at this with some simple examples:


 

if and else

# Code pattern

if (<conditional expression evaluates to TRUE>) {
    <execute this block>
} else if (<expression evaluates to TRUE>) {
    <execute this block>
} else {
    <execute this block>
}

# conditional expressions
# anything that evaluates to TRUE or FALSE, or can be 
# coerced to a logical.
# Obviously the operators ! == < > <= >= can be used
# but there are also a number of in-built functions that
# are useful for this purpose:

?all
?any
?exists
?is.character
?is.factor
?is.integer
?is.null
?is.numeric
?is.unsorted
?is.vector



# Simple "if" statement:
# Rolling a die. If you get a "six", you get to roll again.

x <- sample(1:6, 1)
if (x == 6) {
    x <- c(x, sample(1:6, 1))
}
print(x)

# "if", "else if", and "else"
# Here is a popular dice game called high-low.

a <-  sample(1:6, 1)
b <-  sample(1:6, 1)
if (a + b > 7) {
    print("high")
} else if (a + b < 7) {
    print("low")
} else {
    print("seven")
}


 

We need to talk about conditional expressions that test for more than one condition for a moment, things like: "if this is true OR that is true OR my birthday is a Sunday this year ...". To join logical expressions, R has two distinct sets of operators: | and ||, and & and &&. | is for "or" and & is for "and". But what about && and ||? The single operators are "vectorized" whereas the doubled operators short-circuit. This means if you apply the single operators to a vector, you get the a vector of results:


x <- c(1, 3, 5, 7, 11, 13, 17)
x > 3 &  x < 17 # FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE: all comparisons
x [x > 3 &  x < 17]  #  5  7 11 13

x > 3 && x < 17 # FALSE: stop at the first FALSE

The vectorized version is usually what you want, e.g. for subsetting, as above. But it is usually not the right way in control structures: there, you usually want never to evaluate unnecessary expressions. Chained "OR" expressions can be aborted after the first TRUE is encountered, and chained "AND" expressions can be aborted after the first FALSE. Which is what the double operators do.

x <- numeric()

if (length(x) == 0 |  is.na(x)) { print("zero") }  # throws an error, because is.na() is
                                                   # evaluated even though x has length zero.

if (length(x) == 0 || is.na(x)) { print("zero") }  # no error: length test is TRUE so is.na()
                                                   # never gets evaluated.

Bottom line: always use || and && in control structures.


 

for

"for" loops are the workhorse of innumerable R scripts. They are controlled by a sequence, and a variable. The body of the loop is executed once for each element in the sequence. Most often, the sequence is a sequence of integers, created with the colon - the range operator. The template is:

for (<name> in <vector>) {
   <expressions ...>
}


# simple for loop
for (i in 1:10) {
	print(c(i, i^2, i^3))
}


# Let's stay with the high-low game for a moment:
# What are the odds of winning?
# Let's simulate some runs with a "for" loop.

N <- 25000
outcomes <- character(N)  # initialize an empty vector
for (i in 1:N) {          # repeat, assign each element of 1:N to
                          # the variable "i" in turn
    a <-  sample(1:6, 1)
    b <-  sample(1:6, 1)
    if (a + b > 7) {
        outcomes[i] <- "high"
    } else if  (a + b < 7) {
        outcomes[i] <- "low"
    } else {
        outcomes[i] <- "seven"
    }
}
head(outcomes, 36)
table(outcomes)  # the table() function tabulates the elements
                 # of a vector
                 
round((36 * table(outcomes))/N) # Can you explain this expression?

Note that there is nothing special about the expression for (i in 1:N) { ... . Any expression that generates a sequence of items will do; I write a lot of code like for (fileName in dir()) { ... or for (gene in data$name) {... , or for (column in colnames(expressionTable)) {... etc.

For-loops in R can be slow. That's usually not an issue if you only need to iterate over a thousand items or so. But you should know that there is a function apply() with a whole family of siblings that is usually about a hundred times faster than a for-loop.

# Compare excution times: one million square roots from a vector ...
n <- 1000000
x <- 1:n
y <- sqrt(x)

# ... or done explicitly in a for-loop
for (i in 1:n) {
  y[i] <- sqrt (x[i])
}

If you can achieve your result with an R vector expression, it will be faster than using a loop. But sometimes you need to do things explicitly, for example if you need to access intermediate results.

Here is an example to play some more with loops: a password generator. Passwords are a pain. We need them everywhere, they are frustrating to type, to remember and since the cracking programs are getting smarter they become more and more likely to be broken. Here is a simple password generator that creates random strings with consonant/vowel alterations. These are melodic and easy to memorize, but actually as strong as an 8-character, fully random password that uses all characters of the keyboard such as )He.{2jJ or #h$bB2X^ (which is pretty much unmemorizable). The former is taken from 207 * 77 1015 possibilities, the latter is from 948 ~ 6*1015 possibilities. High-end GPU supported password crackers can test about 109 passwords a second, the passwords generated by this little algorithm would thus take on the order of 106 seconds or eleven days to crack[15]. This is probably good enough to deter a casual attack.

Task:
Copy, study and run ...

# Suggest memorizable passwords
# Below we use the functions: 
?nchar
?sample
?substr
?paste
?print

#define a string of  consonants ...
con <- "bcdfghjklmnpqrstvwxz"
# ... and a string of of vowels
vow <- "aeiouy"

for (i in 1:10) {  # ten sample passwords to choose from ...
    pass = rep("", 14)  # make an empty character vector
    for (j in 1:7) {    # seven consonant/vowel pairs to be created ...
        k   <- sample(1:nchar(con), 1)  # pick a random index for consonants ...
        ch  <- substr(con,k,k)          #  ... get the corresponding character ...
        idx <- (2*j)-1                  # ... compute the position (index) of where to put the consonant ...
        pass[idx] <- ch                 # ...  and put it in the right spot

        # same thing for the vowel, but coded with fewer intermediate assignments
        # of results to variables
        k <- sample(1:nchar(vow), 1)
        pass[(2*j)] <- substr(vow,k,k) 
    }
    print( paste(pass, collapse="") )  # collapse the vector in to a string and print
}


 

while

Whereas a for-loop runs for a fixed number of times, a "while" loop runs as long as a condition is true, possibly forever. Here is an example, again our high-low game: this time we simulate what happens when we play it more than once with a strategy that compensates us for losing.

# Let's assume we are playing high-low in a casino. You can bet
# high or low. You get two dollars for one if you win, nothing
# if you lose. If you bet "high", you lose if we roll "low"
# or "seven". Thus your chances of winning are 15/36 = 42%. You play
# the following strategy: start with 33 dollars. Bet one dollar.
# If you win, good. If you loose, triple your bet. Stop the game
# when your funds are gone (bad), or if you have more than 100
# dollars (good) - i.e. you have tripled the funds you risked.
# Also stop if you've played more than 100 rounds and start
# getting bored.


set.seed(1234567)
funds <- 33
bet <- 1         # our first bet

nPlays <- 0      # this counts how often we've played
MAXPLAYS <- 100 

while (funds > 0 && funds < 100 && nPlays < MAXPLAYS) {
	
    bet <- min(bet, funds)  # can't bet more than we have.
    funds <- funds - bet    # place the bet
    a <-  sample(1:6, 1)    # roll the dice
    b <-  sample(1:6, 1)

    # we always play "high"
    if (a + b > 7) {        # we win :-)     
        result <- "Win!  "
        funds <- funds + (2 * bet)
        bet <- 1            # reset the bet to one dollar
    } else {                # we lose :-(
        result <- "Lose."
        bet <- 3 * bet      # increase the bet to 3 times previous
    }
    print(paste("Round", nPlays, result, 
                "Funds now:", funds,
                "Next bet:", bet))
    nPlays <- nPlays + 1
}

# Now before you get carried away - try this with different seeds
# and you'll quickly figure out that the odds of beating the game
# are not all that great...


 


 

Writing your own functions

R is a "functional programming language" and most if not all serious work will involve writing your own functions. This is easy and gives you access to flexible, powerful and reusable solutions. You have to understand the "anatomy" of an R function however.

  • Functions are assigned to function names. They are treated like any other R object and you can have vectors of functions, and functions that return functions etc.
  • Data gets into the function via the function's parameters.
  • Data is returned from a function via the return() statement[16]. One and only one object is returned. However the object can be a list, and thus contain values of arbitrary complexity. This is called the "value" of the function. Well-written functions have no side-effects like changing global variables.


#defining the function:
myFunction <- function(<myParameters>) { 
	result <- <do something with my parameters>
	return(result)
}




 

The scope of functions is local: this means all variables within a function are lost upon return, and global variables are not overwritten by a definition within a function. However variables that are defined outside the function are also available inside.

Here is a simple example: a function that takes a binomial species name as input and creates a five-letter code as output:

biCode <- function(s) { 
	substr(s, 4, 5) <- substr(strsplit(s,"\\s+")[[1]][2], 1, 2)
	return (toupper(substr(s, 1, 5)))
}

biCode("Homo sapiens")              # HOMSA
biCode("saccharomyces cerevisiae")  # SACCE

We can use loops and control structures inside functions. For example the following creates a vector containing n Fibonacci numbers.

fibSeq <- function(n) { 
   if (n < 1) { return( 0 ) }
   else if (n == 1) { return( 1 ) }
   else if (n == 2) { return( c(1, 1) ) }
   else {
      v <- numeric(n)
      v[1] <- 1
      v[2] <- 1
      for ( i in 3:n ) {
         v[n] <- v[n-2] + v[n-1]
      }
      return( v )
   }
}


 


The function template looks like:

<name> <- function (<parameters>) {
   <statements>
}

In this statement, the function is assigned to the name - any valid name in R. Once it is assigned, it the function can be invoked with name(). The parameter list (the values we write into the parentheses followin the function name) can be empty, or hold a list of variable names. If variable names are present, you need to enter the corresponding parameters when you execute the function. These assigned variables are available inside the function, and can be used for computations. This is called "passing the variable into the function".

You have encountered a function to choose YFO names. In this function, your Student ID was the parameter. Here is another example to play with: a function that calculates how old you are. In days. This is neat - you can celebrate your 10,000 birthday - or so.

Task:
Copy, explore and run ...

Define the function ...
# A lifedays calculator function

myLifeDays <- function(date = NULL) { # give "date" a default value so we can test whether it has been set
    if (is.null(date)) {
        print ("Enter your birthday as a string in \"YYYY-MM-DD\" format.")
        return()
    }
    x <- strptime(date, "%Y-%m-%d") # convert string to time
    y <- format(Sys.time(), "%Y-%m-%d") # convert "now" to time
    diff <- round(as.numeric(difftime(y, x, unit="days")))
    print(paste("This date was ", diff, " days ago."))
}
Use the function (example)
   myLifeDays("1932-09-25")  # Glenn Gould's birthday

Here is a good opportunity to play and practice programming: modify this function to accept a second argument. When a second argument is present (e.g. 10000) the function should print the calendar date on which the input date will be that number of days ago. Then you could use it to know when to celebrate your 10,000th lifeDay, or your 777th anniversary day or whatever.


 

Coding style

 

Debugging

When something goes wrong in your code, you need to look at intermediate values, as the code executes. Almost always sprinkling print() statements all over your code to retrieve such intermediate values is the least efficient way to isolate problems. But what is worse, you are temporarily modifying your code, and there is a significant risk that that this will create problems later.

Right from the beginning of your programming trajectory, you should make yourself familiar with R's debug functions.

  • At first, you may need to pin down approximately where an error occurs. Read the error message carefully, or perhaps do print out some intermediate values from a loop.
  • Debugging is done by entering a "browser" mode that allows you to step through a function.
  • To enter this browser mode ...
    • Call debug(function). When function() is next executed, R will enter the browser mode. Call undebug(function) to clear the debugging mode. (Or use the debugonce(function). )
    • Alternatively insert browser() into your function code to enter the browser mode. This sets a breakpoint into your function; use if (condition) browser() to insert a conditional breakpoint (or watchpoint). This is especially useful if the problem occurs only rarely, in a particular context.
  • It should go without saying that you need to discover that there are problems in the first place: test, test, and test again.


 

Here is an example: let's write a rollDice-function, i.e. a function that creates a vector of n integers between 1 and MAX - the number of faces on your die.

rollDice <- function(len=1, MIN=1, MAX=6) {
    v <- rep(0, len)
    for (i in 1:len) {
        x <- runif(1, min=MIN, max=MAX)
        x <- as.integer(x)
        v[i] <- x
    }
    return(v)
}

Lets try running this...

rollDice()
table(rollDice(1000))

Problem: we see only values from 1 to 5. Why? Lets flag the function for debugging...

debug(rollDice)
rollDice(10)
debugging in: rollDice(10)
debug at #1: {
    v <- rep(0, len)
    for (i in 1:len) {
        x <- runif(1, min = MIN, max = MAX)
        x <- as.integer(x)
    	v[i] <- x
    }
    return(v)
}
Browse[2]> 
debug at #2: v <- rep(0, len)
Browse[2]> 
debug at #3: for (i in 1:len) {
    x <- runif(1, min = MIN, max = MAX)
    x <- as.integer(x)
    v[i] <- x
}
Browse[2]> 
debug at #4: x <- runif(1, min = MIN, max = MAX)
Browse[2]> 
debug at #5: x <- as.integer(x)
Browse[2]> x   # Here we examine the current value of x
[1] 4.506351
Browse[2]> 
debug at #6: v[i] <- x
Browse[2]> 
debug at #4: x <- runif(1, min = MIN, max = MAX)
Browse[2]> v
[1] 4      # Aha: as.integer() truncates, but doesn't round!
Browse[2]> Q
undebug(rollDice)


We need to change the range of the random input values...

rollDice <- function(len=1, MIN=1, MAX=6) {
    v <- rep(0, len)
    for (i in 1:len) {
    	x <- runif(1, min=MIN, max=MAX+1)
    	x <- as.integer(x)
    	v[i] <- x
    }
    return(v)
}
table(rollDice(1000))


Now the output looks correct.

# Disclaimer 1: this function would be better
# written as ...

rollDice <- function(len=1, MIN=1, MAX=6) {
	return(as.integer(runif(len, min=MIN, max=MAX+1)))
}

# Check the output:
table(rollDice(1000))

# This works, since runif() can return a vector of deviates,
# but if we write the function this way we can't check the value of
# individual trials.


# Disclaimer 2: the function relies on a side-effect of as.integer(), which is
# to drop the digits after the comma when it converts. More explicit and
# therefore clearer would be to use the function floor() instead. Here, the
# truncation is not a side effect, but the desired behaviour. This is
# actually important: there is no guarantee how as.integer() constructs an
# integer from a float, it could e.g. round, instead of truncating. But rounding
# would give a wrong distribution! An error that may be hard to spot. (You
# can easily try using the round() function and think about how the result is wrong.)

# A better alternative is thus to write:

rollDice <- function(len=1, MIN=1, MAX=6) {
	return(floor(runif(len, min=MIN, max=MAX+1)))
}



# Disclaimer 3
# A base R function exists that already rolls dice in the required way: sample()

table(sample(1:6, 1000, replace=TRUE))


For the helpful debugging interface that comes with with RStudio, see here and here.

For a deeper excursion into R debugging, see this overview by Duncan Murdoch at UWO, and Roger Peng's introduction to R debugging tools.


 


 

Finishing up

This concludes our introduction to R. We haven't actually done anything significant with the language yet, but we have developed the basic skills to begin. The next steps should be:

  • read data;
  • organize it;
  • explore it with descriptive statistics;
  • explore patterns, structures and relationships within the data graphically (graphics, graphics and more graphics);
  • formulate hypotheses;
  • test the hypotheses.


 


 

Notes

  1. ... and when you click on the arrow to the left, this will take you back to where you came from.
  2. Proportional fonts are for elegant document layout. Monospaced fonts are needed to properly align characters in columns. For code and sequences, we alway use monospaced font. Code editors always use monospaced fonts, but since I need to eMail a lot of code and sequences, I have also set my eMail client to use monospaced font by default (Courier, or Monaco). I highly encourage you to do the same.
  3. [1] means: the following is the first (often only) element of a vector.
  4. Never, ever edit code in MS Word. Use R or RStudio. Actually, don't use notepad or TextEdit either.
  5. After the course, you can rename / move the directory to whatever, wherever you want, but during the course, I need your files in a predictable location to be able to troubleshoot problems.
  6. You can also use one of the mirror sites, if CRAN is down - for example the mirror site at the University of Toronto. A choice of mirror sites is listed on the R-project homepage.
  7. A "wrapper" program uses another program's functionality in its own context. RStudio is a wrapper for R since it does not duplicate R's functions, it runs the actual R in the background.
  8. The Terminal app is in the Utilities sub-folder of your Applications folder.
  9. I think we are using a predictive mental model when we type - something like an inbuilt autocorrect-suggestion mechanism; thus if you type something unfamiliar or surprising (e.g. a subtle detail of syntax), you will notice and be able to figure out the issue. Pasting code by contrast is merely mechanical.
  10. A GUI is a Graphical User Interface, it has windows and menu items, as opposed to a "command line interface".
  11. Actually, the first script that runs is Rprofile.site which is found on Linux and Windows machines in the C:\Program Files\R\R-{version}\etc directory. But not on Macs.
  12. Operating systems commonly hide files whose name starts with a period "." from normal directory listings. All files however are displayed in RStudio's File pane. Nevertheless, it is useful to know how to view such files by default. On Macs, you can configure the Finder to show you such "hidden files" by default. To do this: (i) Open a terminal window; (ii) Type: $defaults write com.apple.Finder AppleShowAllFiles YES (iii) Restart the Finder by accessing Force quit (under the Apple menu), selecting the Finder and clicking Relaunch. (iV) If you ever want to revert this, just do the same thing but set the default to NO instead.
  13. The terms parameter and argument have similar but distinct meanings. A parameter is an item that appears in the function definition, an argument is the actual value that is passed into the function.
  14. However a function may have side-effects, such as writing something to console, plotting graphics, saving data to a file, or changing the value of variables outside the function scope. Avoid the latter, it is fragile and poor practice.
  15. That's assuming the worst case in that the attacker needs to know the pattern with which the password is formed, i.e. the number of characters and the alphabet that we chose from. But note that there is an even worse case: if the attacker had access to our code and the seed to our random number generator. When the random number generator starts up, a new seed is generated from system time, thus the possible space of seeds can be devastatingly small. But even if a seed is set explicitly with the set.seed() function, that seed is a 32-bit integer and thus can take only a bit more than 4*109 values, six orders of magnitude less than the 1015 password complexity we thought we had. It turns out that the code may be a much greater vulnerability than the password itself. Keep that in mind. Keep it secret. Keep it safe.
  16. Actually the return() statement is optional, if missing, the result of the last expression is returned. I consider it poor practice to omit return(), this gives rise to error-prone code.


 


 

Further reading and resources

   R on Wikipedia

User-contributed documents about R at CRAN – including for example E. Paradis' R for Beginners and J. Lemon's Kickstarting R.
The "Task-views" section of CRAN: thematically organized collections of R-packages.

Of course running R and RStudio is just part of configuring your computer as a bioinformatics tool. Here is a link to a good guide to settting up a Mac under El Capitan for biocomputing. I personally have had good experiences with homebrew. I don't know about a simmilar, good, current guide for setting up Windows computers for biocomputing however.