RPR-Installation
Installing R and RStudio
Keywords: Notation; installing R and RStudio; project directory; notation
Contents
This unit is under development. There is some contents here but it is incomplete and/or may change significantly: links may lead to nowhere, the contents is likely going to be rearranged, and objectives, deliverables etc. may be incomplete or missing. Do not work with this material until it is updated to "live" status.
Abstract
...
This unit ...
Prerequisites
You need to complete the following units before beginning this one:
Objectives
...
Outcomes
...
Deliverables
- Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
- Journal: Document your progress in your course journal.
- Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.
Evaluation
Evaluation: NA
- This unit is not evaluated for course marks.
Contents
Before you begin: Notation and Formatting
In this tutorial, I use specific notation and formatting to mean different things.
If you see footnotes[1], click on the number to read more.
This is normal text for explanations. It is written in a proportionally spaced font.
Code formatting is for code examples, file- and function names, directory paths etc.
Code is written in a monospaced font[2].
Bold emphasis and underlining are to mark words as particularly important.
Examples of the right way to do something are highlighted green.
Examples of the wrong way to do something are highlighted red.
Task:
Tasks and exercises are described in boxes with a blue background. You have to do them, they are not optional. If you have problems, you must contact me and not simply continue. All material builds on previous material, therefore you can't skip.
What could possibly go wrong? ... Click to expand.→
These sections have information about issues I encounter more frequently. They are required reading when you need to troubleshoot problems but also give background information that may be useful to avoid problems in the first place.
Click to collapse.↗
"Metasyntactic variables": When I use notation like <Year>
in instructions, you type the year, the whole year and nothing but the year (e.g the four digits 2017). You never type the angle brackets! I use the angle brackets only to indicate that you should note type Year literally, but substitute the correct value. You might encounter this notation as <path>
, <filename>
, <firstname lastname>
and similar. To repeat: if I specify
<your name>
... and your name is Elcid Barrett, You type
Elcid Barrett
... and not your name or <Elcid Barret> or similar. (Oh the troubles I've seen ...)
The sample code on this page sometimes copies text from the console, and sometimes shows the actual commands only. The >
character at the beginning of the line is always just R's input prompt, it tells you that you can type something now - you never actually type >
at the beginning of a line. If you read:
> getwd()
... you need to type:
getwd()
If a line starts with [1]
or similar, this is R's output on the console.[3] The #
character marks the following text as a comment which is not executed by R. These are lines that you do not type. They are program output, or comments, not commands.
- Characters
- Different characters mean different things for computers, and it is important to call them by their right name.
( )
◁ these are parentheses.[ ]
◁ these are (square) brackets.< >
◁ these are angle brackets.{ }
◁ these are (curly) braces."
◁ this, and only this is a quotation mark or double quote. All of these are not: “”„«» . They will break your code. Especially the first two are often automatically inserted by MSWord and hard to distinguish.[4]'
◁ this, and only this is a single quote. All of these are not: ‘’‚‹› . They will break your code. Especially the first two are often automatically inserted by MSWord and hard to distinguish.
The environment
In this section we discuss how to download and install the software, how to configure an R session and how to work in the R environment.
Files, directories and paths
Task:
Create a folder (directory) on your computer in which to keep materials for this course (or workshop, as the case may be). Put it into the right place, and give it the right name:
- The right place is directly in the
Documents
folder of your account.
- The right name is simply the
<Coursecode>
e.g. for a CBW workshop in 2016, you call the folderCBW
, for a BCH441 course, the name should beBCH441
.
Do not use spaces, hyphens, or any other special characters in your filename[5].
I will call this the course directory. (I use the words "folder" and "directory" synonymously and completely interchangeably.)
In my experience, it is better to organize file hierarchies wide, not deep. This means I aim to put more things in one folder rather than create elaborate directory structures. I need to look for stuff a lot, and looking more-or-less in the same folder keeps my files more visible. As you will find later, all the R project folders we create will have a common prefix – R_Exercise-...
, so they should be easy to recognize and keep organized. So I would keep all material in one course directory, rather than creating subdirectories e.g. for R, exercises, assignments etc. etc.
A filename is a label that identifies a file. Often filenames have two parts: the actual name, and an extension. To specify a file on the computer's command line, or in R, you need to specify its full name including the extension. Now, the problem is that you can switch off viewing extensions in Windows; I'm afraid this is actually done by default. Then all hell breaks lose when you are trying to do "real" work. Files can't be found, or worse, can be inadvertently overwritten. Never allow your operating system to hide file extensions from you. You must be able to see the full name.
A path is the complete specification of where a file is located in the directory tree of your computer. Paths are simply directories strung together into a long string, separated by a forward slash "/" (on Mac or Unix) or a backslash "\" on Windows. Take note! When writing Windows paths in R, you have to use the "wrong" forward slash to specify the path. R will translate Unix-style paths into Windows-style paths automatically - but the backslash would be interpreted as an "escape" character that gives the following character a special meaning.
- Folder name and path examples
- /Users/Pierette/Documents/BCB420 ◁ Looking good on a Mac.
- C:\Users\Pulcinella\Documents\CBW ◁ Looking good on a Windows computer.
- "C:/Users/Pulcinella/Documents/CBW" ◁ Looking good inside R on a Windows computer (note the quotation marks!).
- C:\Users\Pantalone\Documents\JTB2020 (2017) ◁ Wrong. No special characters please.
- /Users/Brighella/Documents/UofT Stuffz/Courses/more/Comp Sys biol. course ◁ Wrong. Please read instructions more carefully.
- C:\Users\Tartaglia\Documents\KUWTK\<Coursecode> ◁ I can't even ...
Install R
Task:
- Navigate to CRAN (the Comprehensive R Archive Network)[6] and follow the link to Download R for your computer's operating system.
- Download a precompiled binary (or "build") of the R "framework" to your computer and follow the instructions for installing it. Make sure that the program is the correct one for your version of your operating system.
- Launch R.
The program should open a window–this window is called the "R console"–and greet you with its input prompt, awaiting your input:
>
Task:
Once you see that R is running correctly, you may quit the program for now.
What could possibly go wrong?...
- I can't install R.
- Make sure that the version you downloaded is the right one for your operating system. Also make sure that you have the necessary permissions on your computer to install new software.
Install RStudio
RStudio is a free IDE (Integrated Development Environment) for R. RStudio is a wrapper[7] for R and as far as basic R is concerned, all the underlying functions are the same, only the user interface is different (and there are a few additional functions that are very useful e.g. for managing projects).
Here is a small list of differences between R and RStudio.
- pros (some pretty significant ones actually)
- Integrated version control.
- Support for "projects" that package scripts and other assets.
- Syntax-aware code colouring.
- A consistent interface across all supported platforms. (Base R GUIs are not all the same for e.g. Mac OS X and Windows.)
- Code autocompletion in the script editor. (Depending on your point of view this can be a help or an annoyance. I used to hate it. After using it for a while I find it useful.)
- The ability to set breakpoints for debugging in the script editor.
- Support for knitr, Sweave, rmarkdown... (This supports "literate programming" and is actually a big advance in software development)
- Support for R notebooks.
- cons (all minor actually)
- The tiled interface uses more desktop space than the windows of the R GUI.
- There are sometimes (rarely) situations where R functions do not behave in exactly the same way in RStudio.
- The supported R version is not always immediately the most recent release.
Task:
- Navigate to the RStudio Website.
- Find the right version of the RStudio Desktop installer for your computer, download it and install the software.
- Open RStudio.
- Focus on the bottom left pane of the window, this is the "console" pane.
- Type
getwd()
.
This prints out the path of the current working directory. Make a (mental) note where this is. We usually always need to change this "default directory" to a project directory.
Packages
R has many powerful functions built in, but one of it's greatest features is that it is easily extensible. Extensions have been written by legions of scientists for many years, most commonly in the R programming language itself, and made available through CRAN–The Comprehensive R Archive Network or through the Bioconductor project.
A package is a collection of code, documentation and (often) sample data. To use packages, you need to install them (once), and add them to your current session (for every new session). You can get an overview of installed and loaded packages by opening the Package Manager window from the Packages & Data Menu item. It gives a list of available packages you currently have installed, and identifies those that have been loaded at startup, or interactively.
Task:
- Navigate to http://cran.r-project.org/web/packages/ and read the page.
- Navigate to http://cran.r-project.org/web/views/ (the curated CRAN task-views).
- Follow the link to Genetics and read the synopsis of available packages. The library
sequinr
sounds useful, but check first whether it is already installed.
library()
opens a window of installed packages in the library; search()
shows which one are currently loaded.
> library()
> search()
[1] ".GlobalEnv" "tools:RGUI" "package:stats" "package:graphics"
[5] "package:grDevices" "package:utils" "package:datasets" "package:methods"
[9] "Autoloads" "package:base"
- In the Packages tab of the lower-right pane in RStudio, confirm that
seqinr
is not yet installed. - Follow the link to seqinr to see what standard information is available with a package. Then follow the link to the Reference manual to access the documentation pdf. This is also sometimes referred to as a "vignette" and contains usage hints and sample code.
Read the help for vignette
. Note that there is a command to extract R sample code from a vignette, to experiment with it.
> ?vignette
Install seqinr
from the closest CRAN mirror and load it for this session. Explore some functions.
> ??install
> ?install.packages
> install.packages("seqinr") # Note: quoted string!
also installing the dependency ‘ade4’
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.2/ade4_1.7-2.tgz'
Content type 'application/x-gzip' length 3365088 bytes (3.2 MB)
==================================================
downloaded 3.2 MB
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.2/seqinr_3.1-3.tgz'
Content type 'application/x-gzip' length 2462893 bytes (2.3 MB)
==================================================
downloaded 2.3 MB
The downloaded binary packages are in
/var/folders/mx/ld0hdst54jjf11hpcjh8snfr0000gn/T//Rtmpsy5GMx/downloaded_packages
> library(seqinr) # This refers to an installed page. No quotes here...
> library(help="seqinr")
> ls("package:seqinr")
[1] "a" "aaa" "AAstat"
[4] "acnucclose" "acnucopen" "al2bp"
[...]
[205] "where.is.this.acc" "words" "words.pos"
[208] "write.fasta" "zscore"
> ?a
> a("Tyr")
[1] "Y"
> choosebank()
[1] "genbank" "embl" "emblwgs" "swissprot" "ensembl"
[...]
[31] "refseqViruses"
What could possibly go wrong?...
- The installation fails.
- You might see an error message such as this:
Warning message:
package ‘XYZ’ is not available (for R version 3.2.2)
- This can mean several things:
- The package is not available on CRAN. Try Bioconductor instead or Google to find it.
- The package requires a newer version of R than the one you have. Upgrade, or see if a legacy version exists.
- A comprehensive set of reasons and their resolution is here on stackoverflow.
- We have seen the following on Windows systems when typing
library(help="seqinr")
Error in formatDL(nm, txt, indent = max(nchar(nm, "w")) + 3) :
incorrect values of 'indent' and 'width'
Anecdotally this was due to a previous installation problem with a mixup of 32-bit and 64-bit R versions, although another student told us that the problem simply went away when trying the command again. Whatever: Make sure you have the right 'R version installed for your operating system. Uninstall and reinstall when in doubt. Conflicting libraries can be the source of strange misbehaviour.
Task:
- The fact that these methods work, shows that the package has been downloaded, installed, the library has been loaded and its functions and data are now available in the current environment. Just like many other packages,
seqinr
comes with a number of data files. Try:
?data
data(package="seqinr") # list the available data
data(aaindex) # load ''aaindex''
?aaindex # what is this?
aaindex$FASG890101 # two of the indices ...
aaindex$PONJ960101
# Lets use the data: plot amino acid codes by hydrophobicity and volume
plot(aaindex$FASG890101$I,
aaindex$PONJ960101$I,
xlab="hydrophobicity", ylab="volume", type="n")
text(aaindex$FASG890101$I,
aaindex$PONJ960101$I,
labels=a(names(aaindex$FASG890101$I)))
- Now, just for fun, let's use these functions to download a sequence and calculate some statistics (however, not to digress too far, without further explanation at this point). Copy the code below and paste it into the R-console
choosebank("swissprot")
mySeq <- query("mySeq", "N=MBP1_YEAST")
mbp1 <- getSequence(mySeq)
closebank()
x <- AAstat(mbp1[[1]])
barplot(sort(x$Compo))
The function require()
is similar to library()
, but it does not produce an error when it fails because the package has not been installed. It simply returns TRUE
if successful or FALSE
if not. If the library has already been loaded, it does nothing. Therefore I usually use the following code paradigm in my R scripts to avoid downloading the package every time I need to run a script:
if (!require(seqinr, quietly=TRUE)) {
install.packages("seqinr")
library(seqinr)
}
Note that install.packages()
takes a (quoted) string as its argument, but library()
takes a variable name (without quotes). New users usually get this wrong :-)
One of the challenges of working with R is the overabundance of options. To find the right package that contains a particular function you might be looking for could be tricky, but there is a package to help you do that. Try this:
if (!require(sos, quietly=TRUE)) {
install.packages("sos")
library(sos)
}
findFn("moving average")
A good way to find packages in CRAN is also a keyword search on the Metacran site. Try this link:
Note that the Bioconductor project has its own installation system, the bioclite()
function. It is explained here.
Further reading, links and resources
Notes
- ↑ ... and when you click on the arrow to the left, this will take you back to where you came from.
- ↑ Proportional fonts are for elegant document layout. Monospaced fonts are needed to properly align characters in columns. For code and sequences, we alway use monospaced font. Code editors always use monospaced fonts, but since I need to eMail a lot of code and sequences, I have also set my eMail client to use monospaced font by default (Courier, or Monaco). I highly encourage you to do the same.
- ↑
[1]
means: the following is the first (often only) element of a vector. - ↑ Never, ever edit code in MS Word. Use R or RStudio. Actually, don't use notepad or TextEdit either.
- ↑ After the course, you can rename / move the directory to whatever, wherever you want, but during the course, I need your files in a predictable location to be able to troubleshoot problems.
- ↑ You can also use one of the mirror sites, if CRAN is down - for example the mirror site at the University of Toronto. A choice of mirror sites is listed on the R-project homepage.
- ↑ A "wrapper" program uses another program's functionality in its own context. RStudio is a wrapper for R since it does not duplicate R's functions, it runs the actual R in the background.
Self-evaluation
If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2017-08-05
Modified:
- 2017-08-05
Version:
- 0.1
Version history:
- 0.1 First stub
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.