Difference between revisions of "R tutorial"

From "A B C"
Jump to navigation Jump to search
 
(81 intermediate revisions by the same user not shown)
Line 5: Line 5:
  
  
This is a tutorial introduction to '''R''' for users with no previous background in the platform or the language.  
+
This is a hub for a first introduction to '''R''', for students of one of my workshops or courses. I have subdivided the material into (somewhat) independent learning units that you can work through at your own pace, but in sequence.
  
 +
The units have ''Deliverables'' and ''Prerequisites'' - please ignore these sections, they are for use in a more formal course setting.
  
__TOC__
+
You need to work through these units '''before''' you come to the workshop. There are two reasons:
  
 +
* (i) installation of software is very specific to your computer and we can't walk you through this in a room full of people. It would take so much time that we won't get anything else done.
 +
* (ii) When you are working with '''R''' - like with any computer language or natural language, the key is repetition, repetition, repetition. The more you prime yourself with this material, the more you will profit when we actually meet in class. I hope to see everyone radiant and elated, and not lost before we even begin. Let's do this!
  
 
 
==The environment==
 
In this section we discuss how to download and install the software, how to configure an '''R''' session and how to work in the '''R''' environment. Sometimes I use footnotes<ref> ... and when you click on the arrow to the left, this will take you back to where you came from.</ref>, click on the number to read more.
 
  
===Install '''R'''===
 
  
{{task|
 
# Navigate to [https://cran.r-project.org/ '''CRAN''' (the Comprehensive R Archive Network)]<ref>You can also use one of the mirror sites, if CRAN is down - for example the [http://probability.ca/cran/ mirror site at the University of Toronto]. A choice of mirror sites is listed on the [https://r-project.org '''R'''-project homepage].</ref> and follow the link to '''Download R''' for your computer's operating system.
 
# Download a precompiled binary (or "build") of the R "framework" to your computer and follow the instructions for installing it. Make sure that the program is the correct one for your '''version''' of your operating system.
 
# Launch '''R'''.
 
}}
 
  
The program should open a window&ndash;this window is called the "R console"&ndash;and greet you with its ''input prompt'', awaiting your input:
 
>
 
  
{{task|
 
Once you see that '''R''' is running correctly, you may quit the program for now.
 
}}
 
  
 +
==The Units==
  
 +
{{Smallvspace}}
  
<div class="mw-collapsible mw-collapsed FAQ-box" data-expandtext="Notes for troubleshooting..." data-collapsetext="Collapse">
+
; Start with this:
What could possibly go wrong?...
+
* [[FND-Biocomputing_setup| Set up your computer for biocomputing work]]
<div class="mw-collapsible-content" style="padding:10px;">
 
  
----
+
{{Smallvspace}}
  
;I can't install '''R'''.
+
; Install R and make sure everything works:
:Make sure that the version you downloaded is the right one for your operating system. Also make sure that you have the necessary permissions on your computer to install new software.
+
* [[RPR-Installation| Installing R and RStudio]]
 +
* [[RPR-Setup| Setup]]
 +
* [[RPR-Console| The "Console"]]
 +
* [[RPR-Help| Getting Help]]
  
</div>
+
{{Smallvspace}}
</div>
 
  
===Install R Studio===
+
; Explore how to get '''R''' to work with data:
 +
* [[RPR-Syntax_basics| R Syntax]]
 +
* [[RPR-Objects-Vectors| Vectors]]
 +
* [[RPR-Objects-Data_frames| Data frames]]
 +
* [[RPR-Objects-Lists| Lists]]
  
[https://www.rstudio.com/ '''R Studio'''] is a free IDE (Integrated Development Environment) for '''R'''. <ref>The Mac OS X GUI for '''R''' has almost all the same functionality as '''R Studio''', and thus there is little advantage in using R-Studio on the Mac. If you are working in a Linux or Windows environment however, '''R Studio''' does offer tangible benefits, perhaps most importantly syntax-aware code colouring which the base '''R''' version does not have on these platforms.</ref>. R Studio is a wrapper for '''R''' and thus all the underlying functions are the same, only the user interface is different.
+
{{Smallvspace}}
  
Here is a small list of differences between '''R''' and '''R Studio'''. If you can contribute pros and cons from your personal experience, please let me know so we can update the list.
+
; The one unit that will save your ***, over and over again:
 +
* [[RPR-Subsetting| Subsetting and Filtering]]
  
;pros
+
{{Smallvspace}}
* A consistent interface across all supported platforms; base '''R''' GUIs are not all the same for e.g. Mac OS X and Windows.
 
* Syntax-aware code colouring.
 
* Better handling of the '''Stop Execution''' button, which sometimes does not recover a stuck process in base '''R'''.
 
* Code autocompletion in the script editor. (Depending on your point of view this can be a help or an annoyance.)
 
* The ability to set breakpoints in the script editor.
 
* Support for [http://yihui.name/knitr/ knitr], [http://www.statistik.lmu.de/~leisch/Sweave/ Sweave], [http://rmarkdown.rstudio.com/ rmarkdown]...
 
* Better support for switching between work on concurrent [https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects projects].
 
  
;cons
+
; First steps towards programming:
* The ''tiled'' interface uses more desktop space than the windows of the '''R''' GUI.
+
* [[RPR-Subsetting| Subsetting and Filtering]]
* There are sometimes (rarely) situations where '''R''' functions do not behave in exactly the same way in '''R Studio'''.
+
* [[RPR-Control_structures| Control structures]]
* The supported '''R''' version is not always immediately the most recent release.
+
* [[RPR-Functions| Functions]]
  
{{task|
+
{{Smallvspace}}
* Navigate to the [http://www.rstudio.com/ '''R Studio''' Website].
 
* Find the right version of the '''R Studio Desktop''' installer for your computer, download it and install the software.
 
* Open R Studio.
 
* Focus on the bottom left pane of the window, this is the "console" pane.
 
}}
 
  
 +
; Maybe optional? Meh, just work through this anyway, as time permits. It'll be on the exam.
 +
* [[RPR-Subsetting| Subsetting and Filtering]]
 +
* [[RPR-Plotting| First Plots]]
 +
* [[RPR-Coding_style| Coding Style]]
  
  
===Notation===
+
{{Vspace}}
 
 
The sample code on this page sometimes copies text from the console, and sometimes shows the actual commands only. The <code>&gt;</code> character at the beginning of the line is always just '''R''''s ''input prompt'', it tells you that you can type something now - you never actually type > at the beginning of a line. If a line starts with <code>[1]</code> or similar, this is '''R''''s ''output'' on the console. A <code>#</code>-character marks the following text as a comment which is not executed by '''R'''. In principle, commands can be copied by you and pasted into the console, or into a script (i.e. a text-file of '''R''' commands that can run as a program)  - obviously, you don't need to copy the comments. In addition, I use [http://www.mediawiki.org/wiki/Extension:SyntaxHighlight_GeSHi syntax highlighting] on '''R'''-code, to colour language keywords, numbers, strings, etc. different from other text. This improves readability but keep in mind that the colours you see on your computer will be different.
 
 
 
However, note the following: '''while it is convenient to copy/paste code, you don't learn how to write code through that'''. Practice has shown that it is much better to actually type commands, even if you are just re-typing code from a book or online. Actively typing out the code character by character ensures you are reading and translating the code, and notice if anything is not entirely clear.<ref>I think we are using a predictive mental model when we type - something like an inbuilt autocorrect-suggestion mechanism; thus if you type something unfamiliar or surprising (e.g. a subtle detail of syntax), you will notice and be able to figure out the issue. ''Pasting'' code by contrast is merely mechanical.</ref> In computer code, every single character matters. For example, I expect that by typing out commands you will be much less likely to confuse <code>=</code> with <code><-</code> or even <code>==</code>. Also, you will sometimes mistype and create errors. That's actually good, because you quickly learn to spot errors, fix them, and resume. That way you build confidence.
 
 
 
Some useful notes on the console:
 
*use your keyboard's ''up-arrow'' keys to retrieve previous commands;
 
*"enter" a line of commands with ''left-arrow'' to edit it;
 
*hit ''enter'' to execute the modified line.
 
*In R Studio, click the '''History''' tab in the right upper pane to view the entire history; double-click a line to load its contents into the console.
 
 
 
 
 
 
 
 
 
===User interface===
 
 
 
'''R''' comes with a GUI<ref>A '''GUI''' is a Graphical User Interface, it has windows and menu items, as opposed to a "command line interface".</ref> to lay out common tasks. For example, there are a number of menu items, many of which are similar to other programs you will have worked with ("File", "Edit", "Format", "Window", "Help"  ...). All of these tasks can also be accessed through the command line.
 
 
 
'''R Studio''' has its GUI orgainzed in different ''panes'' of one ''window''; you can resize the panes as you need them.
 
 
 
In general, GUIs are useful when you are not sure what you want to do or how to go about it; the command line is much more powerful when you have more experience and know your way around in principle. '''R''' gives you both options.
 
 
 
 
 
Let's look at some functions of the '''R''' console and associated windows that refer to '''how''' you work, not '''what''' you do.
 
 
 
====The Help system====
 
 
 
 
 
&nbsp;
 
 
 
{{task|
 
* Start '''RStudio''', and as you work through the sections below, type the commands and explore what they do.
 
}}
 
 
 
 
 
 
 
Help is available for all commands as well as for the '''R''' syntax. As well, help is available to find the names of commands when you are not sure of them. You can get help through the command line, or from a search field in the '''Help''' tab of the lower-right pane.
 
 
 
 
 
<small>("help" is a function, arguments to a function are passed in parentheses "()")</small>
 
<source lang="rsplus">
 
> help(rnorm)
 
>
 
</source>
 
 
 
 
 
<small>(shorthand for the same thing)</small>
 
<source lang="rsplus">
 
> ?rnorm
 
>
 
</source>
 
 
 
 
 
<small>(what was the name of that again ... ?)</small>
 
<source lang="rsplus">
 
> ?binom   
 
No documentation for 'binom' in specified packages and libraries:
 
you could try '??binom'
 
> ??binom
 
>
 
</source>
 
 
 
 
 
<small>(I see "Binomial" in the list of keywords...)</small>
 
<source lang="rsplus">
 
> ?Binomial
 
>
 
</source>
 
 
 
 
 
<small>(Alternatively: use the apropos() function.</small>
 
<source lang="rsplus">
 
> ?apropos   
 
> apropos("med")  # all functions that contain the string "med"
 
> apropos("^med")  # all functions that begin with the string
 
> apropos("med$")  # all functions that end with the string
 
</source>
 
 
 
 
 
 
 
If you need help on '''operators''', place them in quotation marks. Try:
 
<source lang="rsplus">
 
> ?"+"
 
> ?"~"
 
> ?"["
 
> ?"%in%"
 
>
 
</source>
 
 
 
 
 
That's all fine, but you will soon notice that '''R''''s help documentation is not all that helpful for newcomers (who need the most help). To illustrate, open the help window for the function {{c|var()}}.
 
<source lang="rsplus">
 
> ?var
 
</source>
 
 
 
Here's what you might look for.
 
* The '''Description''' section describes the function in general technical terms.
 
* The '''Usage''' section tells you what arguments are required (these don't have defaults), what arguments have defaults, and what the defaults are, and whether additional arguments ("...") are allowed. Often a function comes in several variants, you will find them here.
 
* The '''Arguments''' section provides detailed information . You should read it, especially regarding whether the arguments are single values, vectors, or other objects, and what effect missing arguments will have.
 
* The '''Details''' section might provide common usage and context information. It might also not. Often functions have crucial information buried in an innocuous note here.
 
* You have to really understand the '''Value''' section. It explains the output. Importantly, it explains the type of object a function returns - it could be a list, a matrix or something else (we'll discuss these data types in detail below.). The value could also be an object that has special methods defined e.g. for plotting it. In that case, the object is formally a "list", and its named "components" can be retrieved with the usual list syntax (see below).
 
 
 
If you look at the bottom of the help function, you will usually find examples of the function's usage; these often make matters more clear than the terse and principled help-text above.
 
 
 
What you often won't find:
 
 
 
* Clear commented, examples that relate to the most frequent use cases.
 
* Explanations '''why''' a particular function is done in a particular way (e.g. why the denominator is ''n-1'' for {{c|sd()}} and  {{c|var()}}).
 
* Notes on common errors.
 
* An exhaustive list of alternatives and related functions. There are usually some entries, but there is no guarantee that all alternatives are listed - especially if they are provided by an external ''package''.
 
 
 
 
 
Therefore, my first approach for '''R''' information is usually to Google for what interests me and this is often the quickest way to find working example code. '''R''' has a very large user base and it is becoming very rare that a reasonable question will not have a reasonable answer among the top three hits of a Google search. Also, as a result of a Google search, it may turn out that something ''can't'' be done (easily)&ndash;and you won't find things that can't be done in the help system at all. You may want to include {{c|"r language"}} in your search terms, although Google is usually pretty good at figuring out what kind of "r" you are looking for, especially if your query includes a few terms vaguely related to statistics or programming.
 
 
 
* There is an active [https://stat.ethz.ch/mailman/listinfo/r-help '''R-help mailing list'''] to which you can post&ndash;or at least search the archives: your question probably has been asked and answered before. A number of SIGs (Special Interest Groups) exist for more specific discussions - e.g. for mac OS, geography, ecology etc. They are [https://stat.ethz.ch/mailman/listinfo listed here].
 
 
 
* Most of the good responses these days are on ''stack overflow'', discussion seems to be shifting to there from the R mailing list. Information on statistics questions can often be found or obtained from the ''CrossValidated'' forum of stackexchange.
 
** try this [http://stackoverflow.com/search?q=R+sort+dataframe sample search on ''stackOverflow'']...
 
** try this [http://stats.stackexchange.com/search?q=R+bootstrapping+jackknifing+cross-validation sample search on ''CrossValidated'']...
 
 
 
* [http://rseek.org '''Rseek'''] is a specialized Google search on '''R'''-related sites. Try "time series analysis" for an example.
 
 
 
* The '''bioconductor''' project has its own [https://support.bioconductor.org/ support site on the Web].
 
 
 
 
 
If you need to ask for help on the '''R''' mailing list or on stackoverflow, you need to do a bit of homework first. If you ask your question well, you will get incredibly insightful and helpful responses, but you need to help the helpers help you:
 
 
 
* Use the {{c|dput()}} function, perhaps combined with {{c|head()}} to create a '''small, reproducible dataset''' with which your problem can be reproduced or your question illustrated. Keep this as small as possible. Post that.
 
* Post minimal '''code''' that reproduces the problem with the data you have supplied. Together the code and data have to form an '''MWE''' - a minimal working example. People love to play with your code and get it to work, but they '''hate''' having to copy, paste, reformat or otherwise edit someone's stuff just so they can answer a question.
 
* In your question, don't waste too much time on explaining what you did (since that didn't work), but explain clearly '''what you want to achieve'''. Focus on the desired result - not on how to fix your algorithm, your algorithm may be the wrong mental model in the first place.
 
* Don't post in HTML, be sure to post in '''plain text''' only. In particular, somehow many people post to the '''R''' help mailing list via the Nabble forum. That's the wrong way.
 
* Read [http://adv-r.had.co.nz/Reproducibility.html "How to write a reproducible example"] and [http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example "How to make a great R reproducible example"] before you post.
 
 
 
 
 
&nbsp;
 
 
 
====Working directory====
 
To locate a file in a computer, one has to specify the ''filename'' and the directory in which the file is stored; this is sometimes called the ''path'' of the file. The "working directory" for '''R''' is either the directory in which the '''R'''-program has been installed, or some other directory, as initialized by a startup script. You can execute the command <code>getwd()</code> to list what the "Working Directory" is currently set to:
 
 
 
 
 
<source lang="rsplus">
 
> getwd()
 
[1] "/Users/steipe/R"
 
</source>
 
 
 
 
 
It is convenient to put all your '''R'''-input and output files into a project specific directory and then define this to be the "Working Directory". Use the {{c|setwd()}} command for this. {{c|setwd()}} requires an ''argument'' that you type between the parentheses: a string with the directory path, or a variable containing such a string. Strings in R are delimited with {{c|"}} or <code>'</code> characters. If the directory does not exist, an <span style="color:#EE0000;">Error</span> will be reported. Make sure you have created the directory. On Mac and Unix systems, the usual shorthand notation for relative paths can be used: <code>~</code> for the home directory, <code>.</code> for the current directory, <code>..</code> for the parent of the current directory.
 
 
 
On '''windows''' systems, you need know that backslashes &ndash; "\" &ndash; have a special meaning for '''R''', they work as ''escape characters''. Thus '''R''' gets confused when you put them into string literals, such as Windows path names. '''R''' has a simple solution: simply replace all backslashes with forward slashes, and '''R''' will translate them back when it talks to your operating system. Instead of <code>C:\documents\projectfiles</code> you write <code>C:/documents/projectfiles</code> and the <code>~</code> tilde is the directory in which '''R''' is installed, not the user's home directory.
 
 
 
 
 
{{console|My home directory...
 
|> setwd("~") # Note: ~ is the "tilde" - the squiggly line - not the straight hyphen
 
> getwd()
 
[1] "/Users/steipe"
 
}}
 
 
 
{{console|Relative path: home directory, up one level, then down into chen's home directory)
 
|> setwd("~/../chen") 
 
> getwd()
 
[1] "/Users/chen"
 
}}
 
 
 
{{console|Absolute path: specify the entire string)
 
|> setwd("/Users/steipe/abc/R_samples") 
 
> getwd()
 
[1] "Users/steipe/abc/R_samples"
 
}}
 
 
 
 
 
In R Studio you can use the '''Session &rarr; Set Working Directory''' menu.
 
 
 
 
 
 
 
{{task|
 
# Create a directory for your sample files and use {{c|setwd("''your-directory-name''")}} to set the working directory.
 
# Confirm that this has worked by typing {{c|getwd()}}.
 
}}
 
 
 
The ''Working Directory'' functions can also be accessed through the Menu, under '''Misc'''.
 
 
 
A nice shortcut on the Mac is that you can drag/drop a folder or file icon into the '''R''' console or a script window to get the full filename/path. <small>If you know of equivalent functionality in Linux or Windows, let me know.</small>
 
 
 
====.Rprofile - startup commands====
 
Often, when working on a project, you would like to start off in your working directory right away when you start up '''R''', instead of typing the <code>setwd()</code> command. This is easily done in a special '''R'''-script that is executed automatically on startup<ref>Actually, the first script to run is '''Rprofile.site''' which is found on Linux and Windows machines in the <code>C:\Program Files\R\R-{version}\etc</code> directory. But not on Macs.</ref>. The name of the script is <code>.Rprofile</code> and '''R''' expects to find it in the user's home directory. You can edit these files with a simple text editor like Textedit (Mac), Notepad (windows) or Gedit (Linux) - or, of course, by opening it in '''R''' itself.
 
 
 
Besides setting the working directory, other items that might go into such a file could be
 
* libraries that you often use
 
* constants that are not automatically defined
 
* functions that you would like to preload.
 
 
 
 
 
For more details, use '''R''''s help function:
 
<source lang="rsplus">
 
> ?Startup
 
 
 
</source>
 
 
 
=====... unix systems=====
 
*Navigate to your home directory (<code>cd ~</code>).
 
*Open a textfile
 
*Type in: <code>setwd("/path/to/your/project")</code>
 
*Save the file with a filename of <code>.Rprofile</code>. (Note the dot prefix!)
 
 
 
=====... Mac OS X systems=====
 
On Macs, filenames that begin with a dot are not normally shown in the Finder. Either you can open a terminal window and use <code>nano</code> to edit, instead of Textedit. Or, you can configure the Finder to show you such so-called "hidden files" by default. To do this:
 
# Open a terminal window;
 
# Type: <code>$defaults write com.apple.Finder AppleShowAllFiles YES</code>
 
# Restart the Finder by accessing '''Force quit''' (under the Apple menu), selecting the Finder and clicking '''Relaunch'''.
 
# If you ever want to revert this, just do the same thing but set the default to <code>NO</code> instead.
 
 
 
In any case: the procedure is the same as for unix systems. A text editor you can use is <code>nano</code> in a Terminal window, or just open the file in '''R''': <code>file.edit("~/.Rprofile")</code>
 
 
 
=====...Windows systems=====
 
Not sure. You'll need to google or ask on the course mailing list.
 
 
 
====The "Workspace"====
 
During an '''R''' session, you might define a large number of variables, data structures, load packages and scripts etc. All of this information is stored in the so-called "Workspace". When you quit '''R''' you have the option to save the Workspace; it will then be reloaded in your next session.
 
 
 
 
 
{{console|List the current workspace contents: initially it is empty. (R reports an object of type "character" with a length of 0.)
 
|> ls()
 
character(0)
 
>
 
}}
 
 
 
{{console|Initialize three variables (multiple commands on one line can be separated with a semicolon";")
 
|> a <- 1; b <-2; eps <- 0.0001
 
> ls()
 
[1] "a"  "b"  "eps"
 
>
 
}}
 
 
 
{{console|Remove one item. (Note: the argument for <code>rm()</code> is not the string "''a''", but variable name ''a''.)
 
|> rm(a)
 
> ls()
 
[1] "b"  "eps"
 
>
 
}}
 
 
 
<small>We can use the output of {{c|ls()}} as input to {{c|rm()}} to remove everything and clear the Workspace. (cf. {{c|?rm}} for details)</small>
 
<source lang="rsplus">
 
rm(list= ls())
 
> ls()
 
character(0)
 
>
 
</source>
 
 
 
 
 
The '''R''' GUI has a ''Workspace Browser'' as a menu item.
 
 
 
 
 
&nbsp;
 
 
 
===Packages===
 
 
 
'''R''' has many powerful functions built in, but one of it's greatest features is that it is easily extensible. Extensions have been written by legions of scientists for many years, most commonly in the '''R''' programming language itself, and made available through [http://cran.r-project.org/ '''CRAN'''&ndash;The Comprehensive R Archive Network] or through the [http://www.bioconductor.org '''Bioconductor project'''].
 
 
 
A package is a collection of code, documentation and (often) sample data. To use packages, you need to install them (once), and add them to your current session (for every new session). You can get an overview of installed and loaded packages by opening the '''Package Manager''' window from the '''Packages & Data''' Menu item. It gives a list of available packages you currently have ''installed'', and identifies those that have been ''loaded'' at startup, or interactively.
 
 
 
{{task|
 
* Navigate to http://cran.r-project.org/web/packages/ and read the page.
 
* Navigate to http://cran.r-project.org/web/views/ (the '''curated''' CRAN task-views).
 
* Follow the link to [http://cran.r-project.org/web/views/Genetics.html '''Genetics'''] and read the synopsis of available packages. The library {{c|sequinr}} sounds useful, but check first whether it is already installed.
 
{{console
 
|{{c|library()}} opens a window of installed packages in the library; {{c|search()}} shows which one are currently loaded.
 
|> library()
 
> search()
 
[1] ".GlobalEnv"        "tools:RGUI"        "package:stats"    "package:graphics"
 
[5] "package:grDevices" "package:utils"    "package:datasets"  "package:methods" 
 
[9] "Autoloads"        "package:base"   
 
}}
 
 
 
 
 
* In the '''R packages available''' window, confirm that  {{c|seqinr}} is not yet installed.
 
* Follow the link to [http://cran.r-project.org/web/packages/seqinr/index.html seqinr] to see what standard information is available with a package. Then follow the link to the [http://cran.r-project.org/web/packages/seqinr/seqinr.pdf '''Reference manual'''] to access the documentation pdf. This is also sometimes referred to as a "vignette" and contains usage hints and sample code.
 
{{console
 
| Read the help for  {{c|vignette}}. Note that there is a command to extract '''R''' sample code from a vignette, to experiment with it.
 
|> ?vignette
 
}}
 
 
 
{{console
 
| Install {{c|seqinr}} from the closest CRAN mirror and load it for this session. Explore some functions.
 
|> ??install
 
> ?install.packages
 
> install.packages("seqinr")  # Note: quoted string!
 
also installing the dependency ‘ade4’
 
 
 
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.2/ade4_1.7-2.tgz'
 
Content type 'application/x-gzip' length 3365088 bytes (3.2 MB)
 
==================================================
 
downloaded 3.2 MB
 
 
 
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.2/seqinr_3.1-3.tgz'
 
Content type 'application/x-gzip' length 2462893 bytes (2.3 MB)
 
==================================================
 
downloaded 2.3 MB
 
 
 
 
 
The downloaded binary packages are in
 
/var/folders/mx/ld0hdst54jjf11hpcjh8snfr0000gn/T//Rtmpsy5GMx/downloaded_packages
 
 
 
> library(seqinr)    # This refers to an installed page. No quotes here...
 
> library(help{{=}}"seqinr")
 
> ls("package:seqinr")
 
  [1] "a"                      "aaa"                    "AAstat"               
 
  [4] "acnucclose"              "acnucopen"              "al2bp"                 
 
    [...]
 
[205] "where.is.this.acc"      "words"                  "words.pos"             
 
[208] "write.fasta"            "zscore"               
 
> ?a
 
> a("Tyr")
 
[1] "Y"
 
> choosebank()
 
[1] "genbank"      "embl"          "emblwgs"      "swissprot"    "ensembl"     
 
    [...]
 
[31] "refseqViruses"
 
 
 
}}
 
 
 
}}
 
 
 
  
 +
==Notes==
 +
<references />
  
<div class="mw-collapsible mw-collapsed FAQ-box" data-expandtext="Notes for troubleshooting..." data-collapsetext="Collapse">
 
What could possibly go wrong?...
 
<div class="mw-collapsible-content" style="padding:10px;">
 
 
----
 
  
;The installation fails.
+
{{Vspace}}
:You might see an error message such as this:
 
::<span style="color:#EE0000;"><code>Warning message:</code></span>
 
::<span style="color:#EE0000;"><code>package ‘XYZ’ is not available (for R version 3.2.2)</code></span>
 
:This can mean several things:
 
*The package is not available on CRAN. Try Bioconductor instead or Google to find it.
 
*The package requires a newer version of '''R''' than the one you have. Upgrade, or see if a legacy version exists.
 
*A comprehensive set of reasons and their resolution is [http://stackoverflow.com/questions/25721884/how-should-i-deal-with-package-xxx-is-not-available-warning '''here''' on stackoverflow].
 
  
  
 
----
 
----
  
 +
{{Vspace}}
  
;We have seen the following on Windows systems when typing <code>library(help="seqinr")</code>:
 
:<span style="color:#EE0000;"><code>Error in formatDL(nm, txt, indent = max(nchar(nm, "w")) + 3) :</code></span>
 
:<span style="color:#EE0000;"><code>incorrect values of 'indent' and 'width'</code></span>
 
 
Anecdotally this was due to a previous installation problem with a mixup of 32-bit and 64-bit '''R''' versions, although another student told us that the problem simply went away when trying the command again. Whatever: Make sure you have the right ''''R''' version installed for your operating system. Uninstall and reinstall when in doubt. Conflicting libraries '''can''' be the source of strange misbehaviour.
 
 
 
</div>
 
</div>
 
 
 
 
 
{{task|
 
 
* The fact that these methods work, shows that the package has been downloaded, installed, the library has been loaded and its functions and data are now available in the current environment. Just like many other packages, {{c|seqinr}} comes with a number of data files. Try:
 
 
<source lang="rsplus">
 
?data
 
data(package="seqinr")  # list the available data
 
data(aaindex)            # load ''aaindex''
 
?aaindex                # what is this?
 
aaindex$FASG890101      # two of the indices ...
 
aaindex$PONJ960101
 
 
# Lets use the data: plot amino acid codes by hydrophobicity and volume
 
 
plot(aaindex$FASG890101$I,
 
    aaindex$PONJ960101$I,
 
    xlab="hydrophobicity", ylab="volume", type="n")
 
text(aaindex$FASG890101$I,
 
    aaindex$PONJ960101$I,
 
    labels=a(names(aaindex$FASG890101$I)))
 
 
</source>
 
 
 
* Now, just for fun, let's use these functions to download a sequence and calculate some statistics (however, not to digress too far, without further explanation at this point). Copy the code below and paste it into the '''R'''-console
 
 
<source lang="rsplus">
 
choosebank("swissprot")
 
mySeq <- query("mySeq", "N=MBP1_YEAST")
 
mbp1 <- getSequence(mySeq)
 
closebank()
 
x <- AAstat(mbp1[[1]])
 
barplot(sort(x$Compo))
 
</source>
 
}}
 
 
The function {{c|require()}} is similar to {{c|library()}}, but it does not produce an error when it fails because the package has not been installed. It simply returns {{c|TRUE}} if successful or {{c|FALSE}} if not. If the library has already been loaded, it does nothing. Therefore I usually use the following code paradigm in my '''R''' scripts to avoid downloading the package every time I need to run a script:
 
 
<source lang="rsplus">
 
if (!require(seqinr, quietly=TRUE)) {
 
    install.packages("seqinr")
 
    library(seqinr)
 
}
 
</source>
 
 
Note that {{c|install.packages()}} takes a (quoted) string as its argument, but {{c|library()}} takes a variable name (without quotes). New users usually get this wrong :-)
 
 
One of the challenges of working with '''R''' is the overabundance of options. To find the right package that contains a particular function you might be looking for could be tricky, but there is a package to help you do that. Try this:
 
<source lang="rsplus">
 
if (!require(sos, quietly=TRUE)) {
 
    install.packages("sos")
 
    library(sos)
 
}
 
 
findFn("moving average")
 
 
 
</source>
 
 
 
A good way to find packages in CRAN is also a keyword search on the '''Metacran''' site. Try this link:
 
:http://www.r-pkg.org/search.html?q=regex
 
 
 
 
Note that the '''Bioconductor''' project has its own installation system, the {{c|bioclite()}} function. It is explained [http://www.bioconductor.org/install/ '''here'''].
 
 
 
&nbsp;
 
 
===Scripts===
 
 
My preferred way of working with '''R''' is not to type commands into the console. '''R''' has an excellent script editor which I use by opening a new file - a script - and entering my '''R''' commands into the editor window. Then I execute the commands directly from the script. I may try things in the console, experiment, change the values of function arguments ''etc.'' - but ultimately everything I do goes into the file. This has four major advantages:
 
 
* The script is an accurate record of my procedure so I know exactly what I have done;
 
* I add numerous comments to record what I was thinking when I developed it;
 
* I can immediately reproduce the entire analysis from start to finish, simply by rerunning the script;
 
* I can reuse parts easily, thus making new analyses quick to develop.
 
 
 
[[:Media:ScriptTemplate.R|'''Click here''']] for a sample script template that you can download to your computer and edit for every new project.
 
 
 
{{task|
 
* Use the ''File'' menu to open a ''New Document'' (on Mac) or ''New Script'' (on Windows).
 
* Copy the contents of the [[:Media:ScriptTemplate.R|scriptTemplate.R]] file and paste it into the document.
 
* Create a subdirectory to your '''R''' installation, or elsewhere on your computer. This will hold general purpose scripts that you write and collect in the future. Call it {{c|R-scripts}} or similar.
 
* Save the script there. Call it {{c|scriptTemplate.R}} . Don't edit this version, but kepp it in a state that is useful for you as a ''template''.
 
* Create a new project directory. Call it {{c|R-Tutorial}} or similar. You can use this for all course related files.
 
* Use the {{c|save-as...}} menu item to save a second version of the script there. Call it {{c|intro.R}} . This is the version you can play with now and experiment.
 
* Adapt it as needed and save. Delete or "comment-out" the parts that you don't need now<ref>Try selecting lines of code and using the menu item '''Code''' &rarr; '''Comment/Uncomment Lines''' in R studio</ref>. But make sure you set the path of the {{c|setwd()}} function to your current project directory.
 
* Enter the following code (copy from here and paste, or better: type it as you read it here):
 
 
<source lang="rsplus">
 
# sample script:
 
# define a vector
 
a <- c(1, 1, 2, 3, 5, 8, 13)
 
# list its contents
 
a
 
# calculate the mean of its values
 
mean(a)
 
 
</source>
 
 
When you place the cursor in a line of your script, and press <code>command-return</code> (on the Mac) or <code>ctrl-r</code> (on Windows) '''R''' will execute that line and you see the result on the console. You can also select more than one line and execute the selected block with this shortcut. You can also select less than one line, eg. a single variable of function and select it. Alternatively, you can run the entire file. In the console type:
 
 
<source lang="rsplus">
 
source("intro.R")
 
</source>
 
 
However: this will run the file, but not print output to the console. When you run a script and want to have text output printed to the console, you need to explicitly <code>print()</code> it.
 
 
* wrap the two statements that produce output with the {{c|print()}} function, save your script and <code>source()</code> it.
 
<source lang="rsplus">
 
print(a)
 
 
print(mean(a))
 
</source>
 
 
* Confirm that the <code>print(a)</code> command also works when you execute that line directly in the script.
 
 
}}
 
 
Nb. if you want to save your output to file, you can divert it to a file with the <code>sink()</code> command. You can read about the command by typing:
 
 
<source lang="rsplus">
 
?sink
 
</source>
 
 
Note that {{c|print()}} adds the index of the element it prints by default, and also adds a line break. If you don't want that, use {{c|cat()}} instead. Try:
 
 
<source lang="rsplus">
 
cat("vector: ", a, " (mean:", mean(a), ")\n")
 
</source>
 
 
 
 
Some useful shortcuts:
 
* Typing an opening parenthesis <code>(</code> automatically types the closing parenthesis.
 
* Selecting some text and typing a quotation mark quotes the text. This also works with single quotation marks, parentheses, square brackets and curly braces.
 
* Typing a newline character automatically indents the following line.
 
 
 
 
 
&nbsp;
 
 
==Simple commands==
 
 
The '''R''' command line evaluates expressions. Expressions can contain constants, variables, operators and functions of the various datatypes '''R''' recognizes.
 
 
 
 
&nbsp;
 
 
===Operators===
 
 
The common arithmetic operators are recognized in the usual way. Try the following operators on numbers:
 
 
<source lang="rsplus">
 
5
 
5 + 3
 
5 + 1 / 2
 
3 * 2 + 1
 
3 * (2 + 1)
 
2^3 # Exponentiation
 
8 ^ (1/3) # Third root via exponentiation
 
7 %% 2  # Modulo operation (remainder of integer division)
 
7 %/% 2 # Integer division
 
</source>
 
 
 
 
&nbsp;
 
 
===Functions===
 
 
R is considered an (impure) {{WP|Functional_programming|functional programming language}} and thus the focus of '''R''' programs is on functions. They key advantage is that this encourages programming without side-effects and this makes it easier to reason about the correctness of programs. Arguments are passed into functions as parameters, and a single result is returned<ref>However a function may have ''side-effects'', such as writing something to console, plotting graphics, saving data to a file, or changing the value of variables outside the function ''scope''. Try to avoid the latter, it is fragile and poor practice.</ref>. The return values can either be assigned to a variable, or used directly as the parameter in another function.
 
 
Functions are either ''built-in'' (''i.e.'' available in the basic '''R''' installation), loaded via specific packages (see above), or they can be easily defined by you (see below). In general a function is invoked through a name, followed by one or more arguments (also ''parameters'') in parentheses, separated by commas. Whenever I refer to a function, I write the parentheses to identify it as such and not a constant or other keyword eg. <code>log()</code>. Here are some examples for you to try and play with:
 
 
<source lang="rsplus">
 
cos(pi) #"pi" is a predefined constant.
 
sin(pi) # Note the rounding error. This number is not really different from zero.
 
sin(30 * pi/180) # Trigonometric functions use radians as their argument - this conversion calculates sin(30 degrees)
 
exp(1) # "e" is not predefined, but easy to calculate.
 
log(exp(1)) # functions can be arguments to functions - they are evaluated from the inside out.
 
log(10000) / log(10) # log() calculates natural logarithms; convert to any base by dividing by the log of the base. Here: log to base 10.
 
exp(complex(r=0, i=pi)) #Euler's identity
 
</source>
 
 
There are several ways to populate the argument list for a function and '''R''' makes a reasonable guess what you want to do. Arguments can either be used in their predefined order, or assigned via an argument ''name''. Let's look at the <code>complex()</code> function to illustrate this. Consider the specification of a complex number in Euler's identity above. The function {{c|complex()}} can work with a number of arguments that are given in the documentation (see: <code>?complex</code>). These include <code>length.out</code>, <code>real</code>, <code>imaginary</code>, and some more. The <code>length.out</code> argument creates a vector with one or more complex numbers. If nothing else is specified, this will be a vector of complex zero(s). If there are two, or three arguments, they will be placed in the respective slots. However, since the arguments are '''named''', we can also define which slot of the argument list they should populate. Consider the following to illustrate this:
 
 
<source lang="rsplus">
 
complex(1)
 
complex(4)
 
complex(1, 2) # imaginary part missing: if it's missing it defaults to zero
 
complex(1, 2, 3) # one complex number
 
complex(4, 2, 3) # four complex numbers
 
complex(real = 0, imaginary = pi) # defining values via named parameters
 
complex(imaginary = pi, real = 0) # same thing - if names are used, order is not important
 
complex(re = 0, im = pi) # names can be abbreviated ...
 
complex(r = 0, i = pi)  # ... to the shortest string that is unique among the named parameters.
 
                        # Use this feature with discretion to keep your code readable.
 
complex(i = pi, 1, 0) # Think: what have I done here? Why does this work?
 
exp(complex(i = pi, 1, 0)) # (The complex number above is the same as in Euler's identity.)
 
</source>
 
 
 
 
&nbsp;
 
 
===Variables===
 
In order to store the results of evaluations, you can freely assign them to variables. Variables are created internally whenever you first use them (''i.e.'' they are allocated and instantiated). Variable names are case sensitive. There are a small number of reserved strings, and a very small number of predefined constants, such as <code>pi</code>. However these constants can be overwritten - be careful. Read more at:
 
 
<source lang="rsplus">
 
?make.names
 
?reserved
 
</source>
 
 
To assign a value to a constant, use the assignment operator {{c|&lt;-}}. This is the default way of assigning values in '''R'''. You could also use the <code>=</code> sign, but there are subtle differences. (See: {{c|?"&lt;-"}}). There is a variant of the assignment operator {{c|&lt;&lt;-}} whicis sometimes used inside functions. It assigns to a global context. This is possible, but not preferred since it generates a side effect of a function.
 
 
<source lang="rsplus">
 
a <- 5
 
a
 
a + 3
 
b <- 8
 
b
 
a + b
 
a == b # not assignment: equality test
 
a != b # not equal
 
a < b  # less than
 
 
</source>
 
 
Note that '''all''' of '''R''''s data types (as well as functions and other objects) can be assigned to variables.
 
 
There are very few syntactic restrictions on variable names ([http://stackoverflow.com/questions/9195718/variable-name-restrictions-in-r discussed eg. here]) but this does not mean esoteric names are good. For the sake of your sanity, use names that express the meaning of the variable, and that are unique. Many '''R''' developers use {{c|dotted.variable.names}}, my personal preference is to write {{c|camelCaseNames}}. And while the single letters {{c|c f n s Q}} are syntactically valid variable names, they coincide with commands for the debugger browser and will execute debugger commands, rather than displaying variable values when you are debugging. . Finally, try not to use variable names that coincide with parameter labels in functions. Alas, you see this often in code, but in my opinion this is intellectually lazy and leads to code that is hard to read because it obscures the semantics of '''your''' instance of the argument.
 
 
<source lang="rsplus">
 
# I don't like...
 
col <- c("red", "grey")
 
hist(rnorm(200), col=col)
 
 
# I prefer, for example...
 
rgStripes <- c("red", "grey")
 
barplot(1:10, col=rgStripes)
 
 
</source>
 
 
 
 
 
&nbsp;
 
 
==Scalar data==
 
 
''Scalars'' are single numbers, the "atomic" parts of more complex datatypes. Of course we can work with single numbers in '''R''', but under the hood they are actually vectors of length 1. (More on vectors in the next section). We have encountered scalars above, e.g. with the use of constants and their assignment to variables. To round this off, here are some remarks on the types of scalars '''R''' uses, and on ''coercion'' between types, i.e. casting one datatype into another. The following scalar types are supported:
 
 
* Boolean constants: <code>TRUE</code> and <code>FALSE</code>. This type has the "mode" ''logical";
 
* Integers, floats (floating point numbers) and complex numbers.  These types have the mode ''numeric'';
 
* Strings. These have the mode ''character''.
 
 
Other modes exist, such as <code>list</code>, <code>function</code> and <code>expression</code>, all of which can be combined into complex objects.
 
 
The function <code>mode()</code> returns the mode of an object and <code>typeof()</code> returns its type. Also {{c|class()}} tells you what class it belongs to. Lets define a small function to combine most of the information available about an object:
 
 
<source lang="rsplus">
 
info <- function(x) {
 
    print(x) 
 
    cat("str:    ")               
 
    str(x) 
 
    cat("mode:  ", mode(x), "\n")
 
    cat("typeof: ", typeof(x), "\n")
 
    cat("class:  ", class(x), "\n")
 
    # if there are attributes, print them too
 
    if (! is.null(attributes(x))) {
 
        cat("attributes:\n")
 
        print(attributes(x))
 
    }
 
}
 
</source>
 
 
Now we can use {{c|info()}} to explore how R objects are made up,  by handing various expressions as arguments to the function. Many of these you may not yet recognize ... bear with it though:
 
<source lang="rsplus">
 
info( 3 > 5 ) # Note: a > 5 is a logical expression, its value is FALSE.
 
info( 3 < 5 )
 
 
info( 3.0 )  # Double precision floating point number
 
info( 3.0e0 )  # Same value, exponential notation
 
 
info( 3 )  # Note: numbers are double precision floats by default.
 
info( as.integer(3) )  # If we really want an integer, we must coerce to type integer.
 
 
info( as.character(3) )  # Forcing the number to be interpreted as a character.
 
 
# More coercions. For each of these, first think what result you would expect:
 
info( as.numeric("3") )  # character as numeric
 
info( as.numeric("3.141592653") )  # string as numeric
 
info( as.numeric(pi) )  # not a string, but a predefined constant
 
info( as.numeric("pi") )  # another string as numeric. Ooops - what went wrong?
 
 
info( as.complex(1) ) 
 
info( as.logical(0) ) 
 
info( as.logical(1) ) 
 
info( as.logical(-1) ) 
 
info( as.logical(pi) )      # any non-zero number is TRUE ...
 
info( as.logical("pie") )  # ... but not non-numeric types. NA means "Not Available".
 
 
info( as.character(pi) )
 
 
info( Inf )
 
info( NaN )
 
info( NA )
 
info( NULL )
 
info( as.factor("M") )    # factor
 
info( Sys.time() )        # time
 
info( letters )            # inbuilt
 
info( 1:4 )                # numeric vector
 
info( matrix(1:4, nrow=2)) # numeric matrix
 
info( list(arabic = 1:3, roman = c("I", "II", "III")))
 
info( data.frame(arabic = 1:3, roman = c("I", "II", "III"), stringsAsFactors=FALSE))
 
info( a ~ b )              # a formula
 
info( info )              # the function itself
 
 
</source>
 
 
 
&nbsp;
 
 
==Vectors==
 
 
Since we (almost) never do statistics on scalars, '''R''' obviously needs ways to handle collections of data items. In its simplest form such a collection is a '''vector''': an ordered list of items of the same type. Vectors are created from scratch with the <code>c()</code> function which '''c'''oncatenates individual items into a list, or with various sequencing functions. Vectors have properties, such as length; individual items  in vectors can be combined in useful ways. All elements of a vector must be of the same type. If they are not, they are coerced silently to the most general type (which is often {{c|character}}). (The actual hierarchy for coercion is raw < logical < integer < double < complex < character < list ).
 
 
<source lang="rsplus">
 
#Create a vector and list its contents and length:
 
f <- c(1, 1, 3, 5, 8, 13, 21)
 
f
 
length(f)
 
 
# Various ways to retrieve values from the vector.
 
f[1] # By index: "1" is first element.
 
f[length(f)] # length() is the index of the last element.
 
1:4 # This is the range operator
 
f[1:4] # using the range operator (it generates a sequence and returns it in a vector)
 
f[4:1] # same thing, backwards
 
seq(from=2, to=6, by=2) # The seq() function is a flexible, generic way to generate sequences
 
seq(2, 6, 2) # Same thing: arguments in default order
 
f[seq(2, 6, 2)]
 
 
# since a scalar is a vector of length 1, does this work?
 
5[1]
 
 
 
# ...using an index vector with positive indices
 
a <- c(1, 3, 4, 1) # the elements of index vectors must be
 
                  # valid indices of the target vector.
 
                  # The index vector can be of any length.
 
f[a] # In this case, four elements are retrieved from f[]
 
 
# ...using an index vector with negative indices
 
a <- -(1:4) # If elements of index vectors are negative integers,
 
            # the corresponding elements are excluded.
 
f[a] # Here, the first four elements are omitted from f[]
 
f[-((length(f)-3):length(f))] # Here, the last four elements are omitted from f[]
 
 
# ...using a logical vector
 
f>4 # A logical expression operating on the target vector
 
    # returns a vector of logical elements. It has the
 
    # same length as the target vector.
 
f[f>4]; # We can use this logical vector to extract only
 
        # elements for which the logical expression evaluates as TRUE.
 
        # This is sometimes called "filtering".
 
 
 
# Example: extending the Fibonacci series for three steps.
 
# Think: How does this work? What numbers are we adding here and why does the result end up in the vector?
 
f <- c(f, f[length(f)-1] + f[length(f)]); f
 
f <- c(f, f[length(f)-1] + f[length(f)]); f
 
f <- c(f, f[length(f)-1] + f[length(f)]); f
 
 
 
# coercion: all elements of vectors must be of the same mode
 
c(1, 2.0, "3", TRUE)
 
[1] "1"    "2"    "3"    "TRUE"
 
 
 
 
</source>
 
 
Many operations on scalars can be simply extended to vectors and '''R''' computes them '''very''' efficiently by iterating over the elements in the vector.
 
 
<source lang="rsplus">
 
f
 
f+1
 
f*2
 
 
# computing with two vectors of same length
 
a <- f[-1]; a # like f[], but omitting the first element
 
b <- f[1:(length(f)-1)]; b # like f[], but shortened by the least element
 
c <- a / b # the "golden ratio", phi (~1.61803 or (1+sqrt(5))/2 ),
 
          # an irrational number, is approximated by the ratio of
 
          # two consecutive Fibonacci numbers.
 
c
 
abs(c - ((1+sqrt(5))/2)) # Calculating the error of the approximation, element by element
 
</source>
 
 
 
<div class="mw-collapsible mw-collapsed FAQ-box" data-expandtext="Notes for troubleshooting..." data-collapsetext="Collapse">
 
What could possibly go wrong?...
 
<div class="mw-collapsible-content" style="padding:10px;">
 
 
----
 
 
;When a number is not a single number ...
 
:One of the "warts" of '''R''' is that some functions substitute a '''range''' when they receive a vector of length one. Most everyone agrees this is pretty bad. This behaviour was introduced when someone sometime long ago thought it would be nifty to save two keystrokes. This has caused countless errors, hours of frustration and probably hundreds of undiscovered bugs instead. Today we wouldn't write code like that anymore (I hope), but the community believes that since it's been around for so long, it would probably break more things if it's changed. Two functions to watch out for are {{c|sample()}} and {{c|seq()}}; other functions include {{c|diag()}} and {{c|runif()}}.
 
 
:Consider:
 
<source lang="rsplus">
 
x <- 8; sample(6:x)
 
x <- 7; sample(6:x)
 
x <- 6; sample(6:x)  # Oi!
 
 
# also consider
 
x <- 6:8; seq(x)
 
x <- 6:7; seq(x)
 
x <- 6:6; seq(x)    # Oi vay!
 
</source>
 
 
:Wherever this misbehaviour is a possibility - i.e. when the number of elements to sample from is variable and could be just one, for example in some simulation code - you can write a replacement function like so...
 
 
<source lang="rsplus">
 
safeSample <- function(x, size, ...) {
 
# Replace the sample() function to ensure sampling from a single
 
# value gives that value with probability p == 1.
 
        # Respect additional arguments if present.
 
    if (length(x) == 1 && is.numeric(x) && x > 0) {
 
    if (missing(size)) size <- 1
 
        return(rep(x, size))
 
    } else {
 
        return(sample(x, size, ...))
 
    }
 
}
 
 
 
</source>
 
 
:Don't be discouraged though: such warts are rare in '''R'''.
 
&nbsp;
 
 
 
</div>
 
</div>
 
 
 
&nbsp;
 
 
==Matrices==
 
 
If we need to operate with several vectors, or multi-dimensional data, we make use of ''matrices'' or more generally ''k''-dimensional ''arrays'' '''R'''. Matrix operations are very similar to vector operations, in fact a matrix actually is a vector for which the number of rows and columns have been defined.
 
 
The most basic form of such definition is the <code>dim()</code> function. Consider:
 
<source lang="rsplus">
 
a <- 1:12; a
 
dim(a) <- c(2,6); a
 
dim(a) <- c(2,2,3); a
 
</source>
 
 
<code>dim()</code> also allows you to retrieve the number of rows and columns. For example:
 
<source lang="rsplus">
 
dim(a)    # returns a vector
 
dim(a)[3]  # only the third value of the vector
 
</source>
 
 
If you have a two-dimensional matrix, the function <code>nrow()</code> and <code>ncol()</code> will also give you the number of rows and columns, respectively. Obviously, <code>dim(mat)[1]</code> is the same as <code>nrow(a)</code>.
 
 
As an alternative to <code>dim()</code>, matrices can be defined using the <code>matrix()</code> or <code>array()</code> functions (see there), or "glued" together from vectors by rows or columns, using the  <code>rbind()</code> or  <code>cbind()</code> functions respectively:
 
 
<source lang="rsplus">
 
a <- 1:4
 
b <- 5:8
 
m1 <- rbind(a, b); m1  # difference between rbind() and cbind()
 
m2 <- cbind(a, b); m2
 
m <- cbind(m2, 9:12); m2
 
</source>
 
 
Addressing (retrieving) individual elements or slices from matrices is simply done by specifying the appropriate indices, where a missing index indicates that the entire row or column is to be retrieved. '''This is called "subsetting" or "subscripting" and is one of the most important and powerful aspects of working with R.'''
 
 
Explore how you extract rows or columns from a matrix by specifying them. within the square brackets the order is '''[rows, columns]'''
 
<source lang="rsplus">
 
m[1,] # first row
 
m[,2] # second column
 
m[3,2] # element at row == 3, column == 2
 
m[3:4, 1:2] # submatrix: rows 3 to 4 and columns 1 to 2
 
</source>
 
 
More on subsetting below.
 
 
Note that '''R''' has numerous functions to compute with matrices, such as transposition, multiplication, inversion, calculating eigenvalues and eigenvectors and more.
 
 
&nbsp;
 
 
==Lists==
 
 
While the elements of matrices and arrays all have to be of the same type, lists are more generally ordered collections of ''components''. Lists are created with the <code>list()</code> function, which works similar to the <code>c()</code> function. Components are accessed through their index in double square brackets, or through their name, if the name has been defined. Here is an example:
 
 
<source lang="rsplus">
 
 
pUC19 <- list(size=2686, marker="ampicillin", ori="ColE1", accession="L01397", BanI=c(235, 408, 550, 1647) )
 
pUC19[[1]]
 
pUC19[[2]]
 
pUC19$ori
 
pUC19$BanI[2]
 
 
</source>
 
 
 
&nbsp;
 
 
==Data frames==
 
 
Data frames combine features of lists and matrices, they are one of the most important data objects in '''R''', because the result of reading an input file is usually a data frame. Lets create a little datafile and save it in the current working directory. You can use the "New document" command from the menu and save the following data as e.g. <code>plasmidData.tsv</code> (".tsv" for "tab separated values").
 
 
Name Size Marker Ori Sites
 
pUC19 2686 Amp ColE1 EcoRI, SacI, SmaI, BamHI, XbaI, PstI, HindIII
 
pBR322 4361 Amp, Tet ColE1 EcoRI, ClaI, HindIII
 
pACYC184 4245 Tet, Cam p15A ClaI, HindIII
 
 
This data set uses tabs as column separators and it has a header line. Similar files can be exported from Excel or other spreadsheet programs. Read this as a data frame as follows:
 
 
<source lang="rsplus">
 
plasmidData <- read.table("plasmidData.tsv", sep="\t", header=TRUE, stringsAsFactors = FALSE)
 
plasmidData  # show what the data frame contains
 
</source>
 
 
Note the argument {{c|stringsAsFactors {{=}} FALSE}}. If this is {{c|TRUE}} instead, '''R''' will convert all strings in the input to factors and this may lead to problems. Make it a habit to turn this behaviour off, you can always turn a column of strings into factors when you actually mean to have factors.
 
 
You can edit the data through a spreadsheet-like interface with the {{c|edit()}} function.
 
<source lang="rsplus">
 
pD2 <- edit(plasmidData)
 
</source>
 
 
==Subsetting==
 
 
We have encountered subsetting before, but we really need to discuss this in more detail. It is one of the most important and powerful topics of '''R''' since it is indispensable to select, transform, and otherwise modify data to prepare it for analysis. You have seen that we use square brackets to indicate individual elements in vectors and matrices. These square brackets are actually "operators", and you can find more information about them in the help  pages:
 
 
<source lang="rsplus">
 
> ?"["    # Note that you need quotation marks around the operator for this.
 
</source>
 
 
 
Here is a collection of examples of subsetting data from the plasmidData data frame we constructed above:
 
 
<source lang="rsplus">
 
plasmidData[1, ]
 
plasmidData[2, ]
 
 
# we can extract more than one row by specifying
 
# the rows we want in a vector ...
 
plasmidData[c(1, 2), ] 
 
                   
 
# ... this works in any order ...
 
plasmidData[c(3, 1), ] 
 
 
# ... and for any number of rows ...
 
plasmidData[c(1, 2, 1, 2, 1, 2), ] 
 
 
 
# Same for columns
 
plasmidData[ ,2 ]
 
 
# We can select rows and columns by name if a name has been defined...
 
plasmidData[,"Name"]
 
plasmidData$Name      # different syntax, same thing
 
 
 
# Watch this!
 
plasmidData$Name[plasmidData$Ori != "ColE1"]
 
# What happened here?
 
# plasmidData$Ori != "ColE1" is a logical expression, it gives a vector of TRUE/FALSE values
 
plasmidData$Ori != "ColE1"
 
 
# We insert this vector into the square brackets. R then returns all rows for
 
# which the vector is TRUE.
 
 
# In this way we can "filter" for values
 
plasmidData$Size > 3000
 
plasmidData$Name[plasmidData$Size > 3000]
 
 
# This principle is what we use when we want to "sort" an object
 
# by some value. The function order() is used to return values
 
# that are sorted. Remember this: not sort() but order().
 
order(plasmidData$Size)
 
plasmidData[order(plasmidData$Size), ]
 
 
# grep() matches substrings in strings
 
grep("Tet", plasmidData$Marker)
 
plasmidData[grep("Tet", plasmidData$Marker), ]
 
plasmidData[grep("Tet", plasmidData$Marker), "Ori"]
 
 
</source>
 
 
Elements that can be extracted from an object also can be replaced. Simply assign the new value to the element.
 
 
<source lang="rsplus">
 
x <- sample(1:10)
 
x
 
x[4] <- 99
 
x
 
x <- x[order(x)]
 
x
 
</source>
 
 
 
 
&nbsp;
 
 
==Writing your own functions==
 
 
Writing your own functions in '''R''' is easy and gives you access to flexible, powerful and reusable solutions. You have to understand the "anatomy" of an '''R''' function however.
 
 
* Functions are assigned to function names and invoking the function returns some value, vector or other data object.
 
* Data gets '''into''' the function via the function's parameters.
 
* Data is '''returned''' from a function in the {{c|return()}} statement. One and only one object is returned from a function. However the object can be a list, and thus contain values of arbitrary complexity.
 
 
 
<source lang="rsplus">
 
#defining the function:
 
myFunction <- function(myParameters) {
 
result <- doSomethingWith(myParameters)
 
return(result)
 
}
 
 
# using the function:
 
myEpiphany <- myFunction(Arguments)
 
</source>
 
 
 
 
The '''scope''' of functions is local, all variables within a function are lost upon return, and global variables are not overwritten by a definition within a function. However variables that are defined outside the function are also available inside.
 
 
Here is a simple example: a function that takes a binomial species name as input and creates a five-letter code as output:
 
 
<source lang="rsplus">
 
biCode <- function(s) {
 
substr(s, 4, 6) <- substr(strsplit(s,"\\s+")[[1]][2], 1, 2)
 
return (toupper(substr(s, 1, 5)))
 
}
 
 
biCode("Homo sapiens")              # HOMSA
 
biCode("saccharomyces cerevisiae")  # SACCE
 
</source>
 
 
We can use loops and control structures inside functions. For example the following creates a series of ''n'' Fibonacci numbers.
 
 
<source lang="rsplus">
 
fibSeq <- function(n) {
 
  if (n < 1) { return( c(0) ) }
 
  else if (n == 1) { return( c(1) ) }
 
  else if (n == 2) { return( c(1, 1) ) }
 
  else {
 
      v <- c(1, 1)
 
      for ( i in 3:n ) {
 
        v <- c(v, v[length(v)-1] + v[length(v)])
 
      }
 
      return( v )
 
  }
 
}
 
</source>
 
 
 
====Coding style====
 
Code is read much more often than it is written and it should always be your goal to write as clearly as possible. '''R''' has many complex idioms, and as a functional language that can generally insert functions anywhere into expressions it is possible to write very terse, expressive code. Don't do it. Pace yourself, and make sure your reader can follow your flow of thought. More often than not the poor soul who will be confused by a particularly witty use of the language will be you, yourself, half a year later. There is an astute observation by Brian Kernighan that applies completely:
 
 
:"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."
 
 
 
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand" data-collapsetext="Collapse" style="border:#000000 solid 1px; padding: 10px;">A number of guides for '''R''' coding style are available ... expand here to see what I recommend (or require).<div  class="mw-collapsible-content">
 
 
 
<div class="colmask doublepage">
 
  <div class="colleft">
 
    <div class="col1">
 
      <!-- Column 1 start -->
 
* Use informative and specific '''filenames''' for code sources; give them the extension {{c|.R}}
 
* Give your '''sources''' headers stating purpose, author, date and version information, and note bugs and issues.
 
* Give your '''functions''' headers that describe purpose, arguments (including required datatypes), and return values. Callers should be able to work with the function without having to read the code.
 
* Use lots of '''comments'''. Never describe what the code does, but explain '''why'''.
 
* Use '''separators''' ({{c|# --- SECTION -----------------}}) to structure your code.
 
* '''Indent''' comment hashes to align with the expressions in a block.
 
* Use only {{c|<-}} for assignment, not {{c|{{=}}}}
 
* ...but do use {{c|{{=}}}} when defining parameters of functions.
 
* Don't use {{c|<<-}} (global assignment) except in very unusual cases.
 
* Use the concise {{c|camelCaseStyle}} for variable names, never use the {{c|confusing.dot.style}} or the rambling {{c|separating_with_underscores_style}}.
 
* Define parameters at the beginning of the code, use all caps variable names ({{c|MAXWIDTH}}). Never have "magic numbers" appear in your code.
 
* In mathematical expressions, always use '''parentheses''' to define priority. Never rely on convention. {{c|(( 1 + 2 ) / 3 ) * 4}}
 
* Always separate operators and arguments with spaces.<ref>Separating operators with spaces is especially important for the assignment operator {{c|<-}}. Consider this: {{c| myPreciousData < -2}} returns a vector of {{c|TRUE}} and {{c|FALSE}}, depending on whether the values in {{c|myPreciousData}} are less than 2. But  {{c| myPreciousData<-2}} overwrites evry single element with the number {{c|2}}!</ref>
 
* Never separate function names and the brackets that enclose argument lists.
 
* Don't abbreviate argument names.
 
 
      <!-- Column 1 end -->
 
    </div>
 
    <div class="col2">
 
      <!-- Column 2 start -->
 
* Try to limit yourself to ~80 characters per line.
 
* Always use braces {{c|{}}}, even if you write single-line {{c|if}} statements and loops.
 
* Always use a '''space''' after a comma, and never before a comma.
 
* Always '''explicitly return''' values from functions, never rely on returning the last expression.
 
* Use spaces to '''align''' repeating parts of code, so errors become easier to spot.
 
* '''Don't repeat''' code. Use functions instead.
 
* Write out crucial arguments to functions, even if you think you know that this is redundant with the default.
 
* Never reassign reserved words.
 
* Don't use {{c|c}} as a variable name since {{c|c()}}  is a function.
 
* Don't call your data frames {{c|df}} since {{c|df()}} is a function.
 
* Don't use semicolons, and don't write more than one command on a line.
 
* Don't use {{c|attach()}}.
 
* Use {{c|for (i in seq(along{{=}}x)) {...}}}  not {{c|for (i in 1:length(x)) {...}}} because of an unwanted loop if {{c|x {{=}}{{=}} NULL}}
 
* If possible, do not grow data structures, but create the whole structure with {{c|NULL}} values, then modify them. This is '''much''' faster.
 
 
 
;Specific naming conventions I like:
 
:{{c|isValid}}, {{c|hasNeighbour}}  ... Boolean variables
 
:{{c|findRange()}}, {{c|getLimits()}} ... simple function names (verbs!)
 
:{{c|initializeTable()}} ... not {{c|initTab()}}
 
:{{c|node}} ... for one element; {{c|nodes}} ... for more elements
 
:{{c|nPoints}} ... for number-of
 
:{{c|isError}} ... not {{c|isNotError}}: avoid double negation
 
 
<hr />
 
 
Consider using the [http://cran.r-project.org/web/packages/formatR/index.html '''formatR'''] package for consistent code.
 
 
 
And seriously - [http://xkcd.com/1513/ XKCD: Code Quality]
 
 
 
 
      <!-- Column 2 end -->
 
    </div>
 
  </div>
 
</div>
 
 
 
 
 
</div>
 
</div>
 
 
 
&nbsp;
 
 
===Debugging===
 
 
Don't even ''think'' of sprinkling {{c|print()}} statements into your code to help you find out where something went wrong, when it goes wrong. From the beginning of your programming work, make yourself familiar with the debug functions. There are three simple concepts to remember:
 
* Debugging is done by entering a "browser" mode that allows you to step through a function.
 
* Call {{c|debug(''function'')}} to invoke the mode when the function is next executed,  {{c|undebug(''function'')}} to clear the debugging mode.
 
* Insert {{c|browser()}} into your function code to enter the browser mode. This sets a ''breakpoint'' into your function; use  {{c|if (condition) browser()}} to insert a ''conditional breakpoint'' (or watchpoint).
 
 
Here is an example: let's write a rollDice-function, i.e. a function that creates a vector of ''n'' integers between 1 and MAX - the number of faces on your die.
 
 
<source lang="rsplus">
 
rollDice <- function(len=1, MIN=1, MAX=6) {
 
    v <- rep(0, len)
 
    for (i in 1:len) {
 
        x <- runif(1, min=MIN, max=MAX)
 
        x <- as.integer(x)
 
        v[i] <- x
 
    }
 
    return(v)
 
}
 
</source>
 
 
Lets try running this...
 
<source lang="rsplus">
 
rollDice()
 
table(rollDice(1000))
 
</source>
 
 
Problem: we see only values from 1 to 5. Why? Lets flag the function for debugging...
 
<source lang="rsplus">
 
debug(rollDice)
 
rollDice(10)
 
debugging in: rollDice(10)
 
debug at #1: {
 
    v <- rep(0, len)
 
    for (i in 1:len) {
 
        x <- runif(1, min = MIN, max = MAX)
 
        x <- as.integer(x)
 
    v[i] <- x
 
    }
 
    return(v)
 
}
 
Browse[2]>
 
debug at #2: v <- rep(0, len)
 
Browse[2]>
 
debug at #3: for (i in 1:len) {
 
    x <- runif(1, min = MIN, max = MAX)
 
    x <- as.integer(x)
 
    v[i] <- x
 
}
 
Browse[2]>
 
debug at #4: x <- runif(1, min = MIN, max = MAX)
 
Browse[2]>
 
debug at #5: x <- as.integer(x)
 
Browse[2]> x  # Here we examine the current value of x
 
[1] 4.506351
 
Browse[2]>
 
debug at #6: v[i] <- x
 
Browse[2]>
 
debug at #4: x <- runif(1, min = MIN, max = MAX)
 
Browse[2]> v
 
[1] 4      # Aha: as.integer() truncates, but doesn't round!
 
Browse[2]> Q
 
undebug(rollDice)
 
</source>
 
 
 
We need to change the range of the random input values...
 
<source lang="rsplus">
 
rollDice <- function(len=1, MIN=1, MAX=6) {
 
    v <- rep(0, len)
 
    for (i in 1:len) {
 
    x <- runif(1, min=MIN, max=MAX+1)
 
    x <- as.integer(x)
 
    v[i] <- x
 
    }
 
    return(v)
 
}
 
table(rollDice(1000))
 
</source>
 
 
 
Now the output looks correct.
 
<source lang="rsplus">
 
# Disclaimer 1: this function would be better
 
# written as ...
 
 
rollDice <- function(len=1, MIN=1, MAX=6) {
 
return(as.integer(runif(len, min=MIN, max=MAX+1)))
 
}
 
 
# Check the output:
 
table(rollDice(1000))
 
 
# This works, since runif() can return a vector of deviates,
 
# but if we write the function this way we can't check the value of
 
# individual trials.
 
 
 
# Disclaimer 2: the function relies on a side-effect of as.integer(), which is
 
# to drop the digits after the comma when it converts. More explicit and
 
# therefore clearer would be to use the function floor() instead. Here, the
 
# truncation is not a side effect, but the desired behaviour. This is
 
# actually important: there is no guarantee how as.integer() constructs an
 
# integer from a float, it could e.g. round, instead of truncating. But rounding
 
# would give a wrong distribution! An error that may be hard to spot. (You
 
# can easily try using the round() function and think about how the result is wrong.)
 
 
# A better alternative is thus to write:
 
 
rollDice <- function(len=1, MIN=1, MAX=6) {
 
return(floor(runif(len, min=MIN, max=MAX+1)))
 
}
 
 
 
 
# Disclaimer 3
 
# A base R function exists that already rolls dice in the required way: sample()
 
 
table(sample(1:6, 1000, replace=TRUE))
 
</source>
 
 
 
 
For visual debugging with '''R Studio''', see [http://www.r-bloggers.com/visual-debugging-with-rstudio/ '''here'''].
 
 
For a deeper excursion into '''R''' debugging, see [http://www.stats.uwo.ca/faculty/murdoch/software/debuggingR/debug.shtml this overview by Duncan Murdoch at UWO], and {{PDFlink|[http://www.biostat.jhsph.edu/~rpeng/docs/R-debug-tools.pdf Roger Peng's introduction to R debugging tools]}}.
 
 
 
&nbsp;
 
 
===Finishing===
 
 
This concludes our introduction to '''R'''. We haven't actually done anything significant with the language yet, but we have developed the basic skills to begin. The next steps should be:
 
* read data;
 
* organize it;
 
* explore it with descriptive statistics;
 
* explore patterns, structures and relationships within the data graphically (graphics, graphics and more graphics);
 
* formulate hypotheses;
 
* test the hypotheses.
 
 
 
 
&nbsp;
 
 
==Notes==
 
<references />
 
 
 
 
&nbsp;
 
==Further reading and resources==
 
&nbsp;&nbsp;&nbsp;{{WP|R_(programming_language)|'''R''' on Wikipedia}}
 
<div class="reference-box">[http://cran.r-project.org/doc/manuals/R-intro.html Introduction to '''R''' at CRAN]</div>
 
<div class="reference-box">[http://cran.r-project.org/other-docs.html User-contributed documents about '''R''' at CRAN]  &ndash; including for example E. Paradis' ''R for Beginners'' and J. Lemon's ''Kickstarting R''.</div>
 
<div class="reference-box">[http://cran.r-project.org/web/views/ The "Task-views" section of CRAN]: thematically organized collections of '''R'''-packages.</div>
 
<div class="reference-box">[http://www.bioconductor.org/packages/release/BiocViews.html The '''"Views"''' section of Bioconductor], and <br />[http://www.bioconductor.org/help/workflows/ Bioconductor annotated '''workflows'''.]</div>
 
<div class="reference-box">[http://www.statmethods.net/index.html '''Quick-R''' how-to's ]</div>
 
<div class="reference-box">[http://stackoverflow.com/tags/r/info '''R''' tagged questions on ''stackoverflow''.]</div>
 
<div class="reference-box">[http://stats.stackexchange.com/ '''Cross Validated''' statistics questions on ''stackexchange''.]</div>
 
 
 
&nbsp;
 
 
[[Category:Applied_Bioinformatics]]
 
[[Category:Applied_Bioinformatics]]
 
[[Category:R]]
 
[[Category:R]]
 
</div>
 
</div>

Latest revision as of 15:52, 8 May 2018

R tutorial


This is a hub for a first introduction to R, for students of one of my workshops or courses. I have subdivided the material into (somewhat) independent learning units that you can work through at your own pace, but in sequence.

The units have Deliverables and Prerequisites - please ignore these sections, they are for use in a more formal course setting.

You need to work through these units before you come to the workshop. There are two reasons:

  • (i) installation of software is very specific to your computer and we can't walk you through this in a room full of people. It would take so much time that we won't get anything else done.
  • (ii) When you are working with R - like with any computer language or natural language, the key is repetition, repetition, repetition. The more you prime yourself with this material, the more you will profit when we actually meet in class. I hope to see everyone radiant and elated, and not lost before we even begin. Let's do this!




The Units

 
Start with this


 
Install R and make sure everything works


 
Explore how to get R to work with data


 
The one unit that will save your ***, over and over again


 
First steps towards programming


 
Maybe optional? Meh, just work through this anyway, as time permits. It'll be on the exam.


 

Notes