Difference between revisions of "R tutorial"

From "A B C"
Jump to navigation Jump to search
m
 
(238 intermediate revisions by the same user not shown)
Line 5: Line 5:
  
  
{{dev}}
+
This is a hub for a first introduction to '''R''', for students of one of my workshops or courses. I have subdivided the material into (somewhat) independent learning units that you can work through at your own pace, but in sequence.
  
 +
The units have ''Deliverables'' and ''Prerequisites'' - please ignore these sections, they are for use in a more formal course setting.
  
This is a tutorial introduction to '''R''' for users with no previous background in the platform or the language.  
+
You need to work through these units '''before''' you come to the workshop. There are two reasons:
  
 +
* (i) installation of software is very specific to your computer and we can't walk you through this in a room full of people. It would take so much time that we won't get anything else done.
 +
* (ii) When you are working with '''R''' - like with any computer language or natural language, the key is repetition, repetition, repetition. The more you prime yourself with this material, the more you will profit when we actually meet in class. I hope to see everyone radiant and elated, and not lost before we even begin. Let's do this!
  
__TOC__
 
  
  
 
 
==The environment==
 
In this section we discuss how to download and install the software, how to configure an '''R''' session and what work with the '''R''' environment includes.
 
  
===Installation===
 
  
# Navigate to http://probability.ca/cran/ <ref>This is the CRAN mirror site at the University of Toronto, any other mirror site will do. You may access a choice of mirror sites from the [http://r-project.org '''R'''-project homepage].</ref> and follow the link to your computer's operating system.
 
# Download a precompiled binary (or "build") of the R "framework" to your computer and follow the instructions for installing it. You don't need tools, or GUI versions for now, but do make sure that the program is the correct one for your '''version''' of your operating system.
 
# Launch '''R'''.
 
  
The program should open a window&ndash;the "R console"&ndash;and greet you with its ''input prompt'', awaiting your input:
+
==The Units==
>
 
  
 +
{{Smallvspace}}
  
The sample code on this page sometimes copies input/output from the console, and sometimes shows the actual commands only. The <code>&gt;</code> character at the beginning of the line is always just '''R''''s ''input prompt''; It is shown here only to illustrate the interactive use of the program and you do not need to type it. If a line starts with <code>[1]</code> or similar, this is '''R''''s ''output'' on the console. A <code>#</code>-character this marks the following text as a comment which is not executed by '''R'''. In principle, commands can be copied by you and pasted into the console, or into a script - obviously, you don't need to copy the comments. In addition, I use [http://www.mediawiki.org/wiki/Extension:SyntaxHighlight_GeSHi syntax highlighting] on '''R'''-script, to color language keywords, numbers, strings, etc. different from other text. This improves readability but keep in mind that the colours you see on your computer will be different. One more thing about the console: use your keyboard's ''up-arrow'' keys to retrieve previous commands, then enter the line with ''left-arrow'' to edit it; hit ''enter'' to execute the modified line.
+
; Start with this:
 +
* [[FND-Biocomputing_setup| Set up your computer for biocomputing work]]
  
===User interface===
+
{{Smallvspace}}
  
R comes with a GUI<ref>Graphical User Interface</ref> to lay out common tasks. For example, there are a number of menu items, many of which are similar to other programs you will have worked with ("File", "Edit", "Format", "Window", "Help"  ...). All of these tasks can also be accessed through the command line. In general, GUIs are useful when you are not  sure what you want to do or how to go about it; the command line is much more powerful when you have more experience and know your way around in principle. '''R''' gives you both options.
+
; Install R and make sure everything works:
 +
* [[RPR-Installation| Installing R and RStudio]]
 +
* [[RPR-Setup| Setup]]
 +
* [[RPR-Console| The "Console"]]
 +
* [[RPR-Help| Getting Help]]
  
In addition to the ''Console'', there are a number of other windows that you can open (or that open automatically). They all can be brought to the foreground with the '''Windows''' menu and include help, plotting, package browser and other windows.
+
{{Smallvspace}}
  
Let's begin with a glossary of some terms that '''R''' uses and how they relate to your work:
+
; Explore how to get '''R''' to work with data:
 +
* [[RPR-Syntax_basics| R Syntax]]
 +
* [[RPR-Objects-Vectors| Vectors]]
 +
* [[RPR-Objects-Data_frames| Data frames]]
 +
* [[RPR-Objects-Lists| Lists]]
  
;Help
+
{{Smallvspace}}
:Help is available for all commands and for the R command line syntax. As well, help is available to find the names of commands when you are not sure of them.
 
  
 +
; The one unit that will save your ***, over and over again:
 +
* [[RPR-Subsetting| Subsetting and Filtering]]
  
<small>("help" is a function, arguments to a function are passed in parentheses "()")</small>
+
{{Smallvspace}}
<source lang="rsplus">
 
> help(rnorm)
 
>
 
</source>
 
  
 +
; First steps towards programming:
 +
* [[RPR-Subsetting| Subsetting and Filtering]]
 +
* [[RPR-Control_structures| Control structures]]
 +
* [[RPR-Functions| Functions]]
  
<small>(shorthand for the same thing)</small>
+
{{Smallvspace}}
<source lang="rsplus">
 
> ?rnorm
 
>
 
</source>
 
  
 +
; Maybe optional? Meh, just work through this anyway, as time permits. It'll be on the exam.
 +
* [[RPR-Subsetting| Subsetting and Filtering]]
 +
* [[RPR-Plotting| First Plots]]
 +
* [[RPR-Coding_style| Coding Style]]
  
<small>(what was the name of that again ... ?)</small>
 
<source lang="rsplus">
 
> ?binom   
 
No documentation for 'binom' in specified packages and libraries:
 
you could try '??binom'
 
> ??binom
 
>
 
</source>
 
  
 
+
{{Vspace}}
<small>(found "Binomial" in the list of keywords)</small>
 
<source lang="rsplus">
 
> ?Binomial
 
>
 
</source>
 
 
 
 
 
That's all fine, but you wil soon notice that '''R''''s help documentation is not all that helpful for the newcomers (who need the most help). If you look at the bottom of the help function, you will usually find examples of command usage; these often make matters more clear than the terse and principled help-text above. Or you can just Google for what interests you and this is often the quickest way to find working example code. Also, as a result of  Google search it may turn out for example that something ''can't'' be done (easily)&ndash;and you won't find things that can't be done at all in the help system. You may want to include {{c|"r language"}} in your search terms, although Google is usually pretty good at figuring out what kind of "r" you are looking for, if your query includes a few terms vaguely related to statistics.
 
 
 
There is also an active [https://stat.ethz.ch/mailman/listinfo/r-help '''R-help mailing list'''] which you can post to&ndash;or at least search the archives: your question probably has been asked and answered before.
 
 
 
 
 
;Working directory
 
To locate a file in a computer, one has to specify the ''filename'' and the directory in which the file is stored; this is sometimes called the ''path'' of the file. The "working directory" for '''R''' is either the direcory i which the '''R'''-program has been installed, or some other directory, as initialized by a startup script. You can execute the command <code>getwd()</code> to list what the "Working Directory" is currently set to:
 
 
 
 
 
<source lang="rsplus">
 
> getwd()
 
[1] "/Users/steipe/R"
 
</source>
 
 
 
 
 
It is convenient to put all your '''R'''-input and output files into a project specific directory and then define this to be the "Working Directory". Use the {{c|setwd()}} command for this. {{c|setwd()}} requires a parameter in its parentheses: a string with the directory path. Strings in R are delimited with {{c|"}} or <code>'</code> characters. If the directory does not exist, an Error will be reported. Make sure you have created the directory. On Mac and Unix systems, the usual shorthand notation for relative paths can be used: <code>~</code> for the home directory, <code>.</code> for the current directory, <code>..</code> for the parent of the current directory.
 
 
 
 
 
{{console|My home directory...
 
|> setwd("~")
 
> getwd()
 
[1] "/Users/steipe"
 
}}
 
 
 
{{console|Relative path: home directory, up one level, then down into chen's home directory)
 
|> setwd("~/../chen") 
 
> getwd()
 
[1] "/Users/chen"
 
}}
 
 
 
{{console|Absolute path: specify the entire string)
 
|> setwd("/Users/steipe/abc/R_samples") 
 
> getwd()
 
[1] "Users/steipe/abc/R_samples"
 
}}
 
 
 
{{task|
 
# Create a directory for your sample files and use {{c|setwd("''your-directory-name''")}} to set the working directory.
 
# Confirm that this has worked by typing {{c|getwd()}}.
 
}}
 
 
 
The ''Working Directory'' functions can also be accessed thorugh the Menu, under '''Misc'''.
 
 
 
 
 
;Workspace
 
During an '''R''' session, you might define a large number of variables, datastructures, load packages and scripts etc. All of this information is stored in the so-called "Workspace". When you quit '''R''' you have the option to save the Workspace; it will then be reloaded in your next session.
 
 
 
 
 
{{console|List the current workspace contents: initially it is empty. (R reports an object of type "character" with a length of 0.)
 
|> ls()
 
character(0)
 
>
 
}}
 
 
 
{{console|Initialize three variables (multiple commands on one line can be separated with a semicolon";")
 
|> a <- 1; b <-2; eps <- 0.0001
 
> ls()
 
[1] "a"  "b"  "eps"
 
>
 
}}
 
 
 
{{console|Remove one item. (Note: the parameter is not the string "''a''", but the variable name ''a''.)
 
|> rm(a)
 
> ls()
 
[1] "b"  "eps"
 
>
 
}}
 
 
 
<small>We can use the output of {{c|ls()}} as input to {{c|rm()}} to remove everything and clear the Workspace. (cf. {{c|?rm}} for details)</small>
 
<source lang="rsplus">
 
rm(list= ls())
 
> ls()
 
character(0)
 
>
 
</source>
 
 
 
 
 
&nbsp;
 
 
 
===Packages===
 
 
 
'''R''' has many powerful functions built in, but one of it's greatest features is that it is easily extensible. Extensions have been written by legions of scientists for many years, most commonly in the '''R''' programming language itself, and made available through [http://cran.r-project.org/ '''CRAN'''&ndash;The Comprehensive R Archive Network]. A package is a collection of code, documentation and sample data files. To use packages, you need to install them (once), and add them to your current session (for every new session). You can get an overview of installed and loaded packages by opening the '''Package Manager''' window from the '''Packages & Data''' Menu item. It gives a list of available packages you currently have ''installed'', and identifies those that have been ''loaded'' at startup, or interactively.
 
 
 
{{task|
 
* Navigate to http://cran.r-project.org/web/packages/ and read the page.
 
* Navigate to http://cran.r-project.org/web/views/ (the CRAN task-views.
 
* Follow the link to '''Genetics''' and read the synopsis of available packages. The library {{c|sequinr}} sounds useful, but check first whether it is already installed.
 
{{console
 
|{{c|library()}} opens a window of installed packages in the library; {{c|search()}} shows which one are currently loaded.
 
|> library()
 
> search()
 
[1] ".GlobalEnv"        "tools:RGUI"        "package:stats"    "package:graphics"
 
[5] "package:grDevices" "package:utils"    "package:datasets"  "package:methods" 
 
[9] "Autoloads"        "package:base"   
 
}}
 
 
 
 
 
* In the '''R packages available''' window, confirm that  {{c|seqinr}} is not yet installed.
 
* Follow the link to  {{c|seqinr}} to see what standard information is available with a package. Then follow the link to '''Reference manual''' to access the documentation pdf. This is also sometimes referred to as a "vignette" and contains usage hints and sample code.
 
{{console
 
| Read the help for  {{c|vignette}}. Note that there is a command to extract '''R''' sample code from a vignette, to experiment with it.
 
|> ?vignette
 
>
 
}}
 
 
 
 
 
{{console
 
| Install {{c|seqinr}} from the closest CRAN mirror and load it for this session. Explore some functions.
 
|> ??install
 
> ?install.packages
 
> install.packages("seqinr")
 
--- Please select a CRAN mirror for use in this session ---
 
trying URL 'http://probability.ca/cran/bin/macosx/contrib/2.13/seqinr_3.0-5.tgz'
 
Content type 'application/x-gzip' length 4528528 bytes (4.3 Mb)
 
opened URL
 
==================================================
 
downloaded 4.3 Mb
 
 
 
 
 
The downloaded packages are in
 
/var/folders/dq/dqPEEPbF0ApRU/-Tmp-//RtmpBlw/downloaded_packages
 
>
 
> library("seqinr")
 
> ls("package:seqinr")
 
  [1] "a"                      "aaa"                    "AAstat"               
 
  [4] "acnucclose"              "acnucopen"              "al2bp"                 
 
    [...]
 
[205] "where.is.this.acc"      "words"                  "words.pos"             
 
[208] "write.fasta"            "zscore"               
 
> ?a
 
> a("Tyr")
 
[1] "Y"
 
> choosebank()
 
[1] "genbank"      "embl"          "emblwgs"      "swissprot"    "ensembl"     
 
    [...]
 
[31] "refseqViruses"
 
}}
 
 
 
* The fact that these methods work shows that the library has been downloaded, installed and downloaded and its functions are now available. Just for fun and demonstration, let's use these functions to download a sequence and calculate some statistics (however, not to digress too far, without further explanation at this point). Copy the code below and paste it into the '''R'''-console
 
 
 
<source lang="rsplus">
 
choosebank("swissprot")
 
query("seq", "N=MBP1_YEAST")
 
mbp1 <- getSequence(seq)
 
closebank()
 
x <- AAstat(mbp1[[1]])
 
barplot(sort(x$Compo))
 
</source>
 
 
 
}}
 
 
 
===Scripts===
 
 
 
My preferred way of running '''R''' is not strictly through the console. I open a new file - a script - and enter my '''R''' commands into the file. Then I execute the commands directly from the script. I may try things in the console, experiment, change parameters etc. - but ultimately everything I do goes into the file. This has four major advantages:
 
 
 
* The script is an accurate record of my procedure so I know exactly what I have done;
 
* I add numerous comments to record what I was thinking when I developed it;
 
* I can immediately reproduce the entire analysis from start and finish, simply by rerunning the script;
 
* I can reuse parts easily, thus making new analyses quick to develop.
 
 
 
{{task|
 
* Use the ''File'' menu to open a ''New Document''.
 
* Enter the following code (copy from here and paste):
 
<source lang="rsplus">
 
# sample script:
 
# define a vector
 
a <- c(1, 1, 2, 3, 5, 8, 13)
 
# list its contents
 
a
 
# calculate the mean of its values
 
mean(a)
 
 
 
</source>
 
* save the file in your working directory (e.g. with the name <code>sample.R</code>).
 
 
 
Placing the cursor in a line and pressing <code>command-return</code> (on the Mac, <code>ctrl-r</code> on Windows) will execute that line and you see the result on the console. You can also select more than one line and execute the selected block with this shortcut. Alternatively, you can run the entire file. In the console type:
 
 
 
<source lang="rsplus">
 
source("sample.R")
 
</source>
 
 
 
However: this will not print output to the console. When you run a script, if you want to see text output you need to explicitly <code>print()</code> it.
 
 
 
*Change your script to the following, save it and <code>source()</code> it.
 
<source lang="rsplus">
 
# sample script:
 
# define a vector
 
a <- c(1, 1, 2, 3, 5, 8, 13)
 
# list its contents
 
print(a)
 
# calculate the mean of its values
 
print(mean(a))
 
 
 
</source>
 
 
 
* Confirm that the <code>print(a)</code> command also works when you execute the line directly from the script.
 
 
 
}}
 
 
 
Nb. if you want to save your output to file, you can divert it to a file with the <code>sink()</code> command. You can read about the command by typing:
 
 
 
<source lang="rsplus">
 
?sink
 
</source>
 
 
 
 
 
&nbsp;
 
 
 
==Simple commands==
 
 
 
The '''R''' command line evaluates expressions. Expressions can contain constants, variables, operators and functions of the various datatypes '''R''' recognizes.
 
 
 
===Operators===
 
 
 
The common arithmetic operators are recognized in the usual way. Try the following operators on numbers:
 
 
 
<source lang="rsplus">
 
5
 
5 + 3
 
5 + 1 / 2
 
3 * 2 + 1
 
3 * (2 + 1)
 
2^3 # Exponentiation
 
8 ^ (1/3) # Third root via exponentiation
 
7 %% 2  # Modulo operation (remainder of integer division)
 
7 %/% 2 # Integer division
 
</source>
 
 
 
===Functions===
 
 
 
Most of '''R''''s functionality is expressed through functions. These are either defined by default (''built-in''), loaded in specific packages (see above), or they can be easily defined by you (see below). In general a function is invoked through a name, followed by one or more arguments (also ''parameters'') in parentheses, separated by commas. Whenever I refer to a function, I write the parentheses to identify it as such and not a constant or other keyword eg. <code>log()</code>. Here are some examples for you to try and play with:
 
 
 
<source lang="rsplus">
 
cos(pi) #"pi" is a predefined constant.
 
sin(pi) # Note the rounding error. This number is not really different from  zero.
 
sin(30 * pi/180) # Trigonometric functions use radians as their argument - this conversion calculates sin(30 degrees)
 
exp(1) # "e" is not predefined, but easy to calculate.
 
log(exp(1)) # functions can be arguments to functions - they are evaluated from the inside out.
 
log(10000) / log(10) # log() calculates natural logarithms; convert to any base by dividing by the log of the base. Here: log to base 10.
 
exp(complex(r=0, i=pi)) #Euler's identity
 
</source>
 
 
 
There are several ways to populate the argument list for a function and '''R''' makes a reasonable guess what you want to do. Consider the specification of a complex number in Euler's identity above. The function <code>complex()</code> can work with a number of arguments that are given in the documentation (see: <code>?complex</code>). These include <code>length.out</code>, <code>real</code>, <code>imaginary</code>, and some more. The <code>length.out</code> argument creates a vector with one or more complex numbers. If nothing else is specified, this will be a vector of complex zero(s). If there are two, or three arguments, they will be placed in the respective slots. However, since the arguments are '''named''', we can also define which slot of the argument list they should populate. Consider the following to illustrate this:
 
 
 
<source lang="rsplus">
 
complex(1)
 
complex(4)
 
complex(1, 2) # imaginary part missing - defaults to zero
 
complex(1, 2, 3) # one complex number
 
complex(4, 2, 3) # four complex numbers
 
complex(real = 0, imaginary = pi) # defining via named parameters
 
complex(imaginary = pi, real = 0) # same thing - if names are used, order is not important
 
complex(re = 0, im = pi) # names can be abbreviated ...
 
complex(r = 0, i = pi) # ... to the shortest string that is unique among the named parameters. Use this with discretion to keep your code readable.
 
complex(i = pi, 1, 0) # Think: what have I done here? Why does this work?
 
</source>
 
 
 
===Variables===
 
In order to store the results of evaluations, you can freely assign them to variables. Variables are created internally whenever you first use them (''i.e.'' they are allocated and instantiated). Variable names are case sensitive. There are a small number of reserved strings, and a very small number of predefined constants, such as <code>pi</code>. However these constants can be overwritten - be careful. Read more at:
 
 
 
<source lang="rsplus">
 
?make.names
 
?reserved
 
</source>
 
 
 
To assign a value to a constant, use the assignment operator <code>&lt;-</code>. You could also use the <code>=</code> sign, but this is too easily confused with the equality test <code>==</code>, and such errors are hard to debug. Try:
 
 
 
<source lang="rsplus">
 
a <- 5
 
a
 
a + 3
 
b <- 8
 
b
 
a + b
 
a == b # equality test
 
a != b # not equal
 
a < b  # less than
 
 
 
</source>
 
 
 
Note that '''all''' of '''R''''s data types can be assigned to variables (as well as functions and other objects).
 
 
 
 
 
 
 
&nbsp;
 
 
 
==Scalar data==
 
 
 
''Scalars'' are single numbers, the "atomic" parts of more complex datatypes. We have covered many of the properties of scalars above, e.g. the use of constants and their assignment to variables. To round this off, here are some remarks on the types of scalars '''R''' uses and on ''coercion'', or casting one datatype into another. The following scalar types are supported:
 
 
 
* Boolean constants: <code>TRUE</code> and <code>FALSE</code>. This type has the "mode" ''logical";
 
* Integers, floats (floating point numbers) and complex numbers.  These types have the mode ''numeric'';
 
* Strings. These have the mode ''character''.
 
 
 
Other modes exist, such as <code>list</code>, <code>function</code> and <code>expression</code>, all of which can be combined into complex objects.
 
 
 
The function <code>mode()</code> returns the mode of an object and <code>typeof()</code> returns its type. Consider:
 
 
 
<source lang="rsplus">
 
a <- 3 > 5; a; mode(a); typeof(a) # Note: a > 5 is a logical expression, its value is FALSE.
 
a <- 3 < 5; a; mode(a); typeof(a)
 
 
 
a <- 3.0;  a;  mode(a); typeof(a) # Double precision floating point number
 
a <- 3.0e0; a;  mode(a); typeof(a) # Same value, exponential notation
 
 
 
a <- 3;    a;  mode(a); typeof(a) # Note: numbers are double precision floats by default.
 
a <- as.integer(3);  a;  mode(a); typeof(a) # If we really want an integer, we must coerce to type integer.
 
 
 
a <- "3"; a;  mode(a); typeof(a) # Forcing the number to be interpreted as a character.
 
 
 
# More coercions. For each of these, first think what result you would expect:
 
as.numeric("3") # character as numeric
 
as.numeric("3.141592653") # string as numeric
 
as.numeric("pi") # another string as numeric
 
as.numeric(pi) # not a string, but a predefined constant
 
 
 
as.logical(0)
 
as.logical(1)
 
as.logical(-1)
 
as.logical(pi) # any non-zero number is TRUE ...
 
as.logical("pi") # ... but not non-numeric types. NA is "Not Available".
 
</source>
 
 
 
 
 
&nbsp;
 
 
 
==Vectors==
 
 
 
Since we (almost) never do statistics on scalar quantities, '''R''' obviously needs ways to handle collections of data items. In its the simplest form such a collection is a '''vector''': an ordered list of items of the same type. Vectors are created from scratch with the <code>c()</code> function which '''c'''oncatenates individual items into a list. Vectors have properties, such as length; individual items  in vectors can be combined in useful ways.
 
 
 
<source lang="rsplus">
 
#Create a vector and list its contents and length:
 
f <- c(1, 1, 3, 5, 8, 13, 21)
 
f
 
length(f)
 
 
 
# Various ways to retrieve values from the vector.
 
f[1] # By index: "1" is first element.
 
f[length(f)] # length() is the index of the last element.
 
1:4 # This is the range operator
 
f[1:4] # using the range operator (it generates a sequence and returns it in a vector)
 
f[4:1] # same thing, backwards
 
seq(from=2, to=6, by=2) # The seq() function is a flexible, generic way to generate sequences
 
seq(2, 6, 2) # Same thing: arguments in default order
 
f[seq(2, 6, 2)]
 
 
 
# ...using an index vector with positive indices
 
a <- c(1, 3, 4, 1) # the elements of index vectors must be valid indices of the target vector. The index vector can be of any length.
 
f[a] # Here, four elements are retrieved from f[]
 
 
 
# ...using an index vector with negative indices
 
a <- -(1:4) # If elements of index vectors are negative integers, the corresponding elements are excluded.
 
f[a] # Here, the first four elements are omitted from f[]
 
f[-(length(f)-4:length(f))] # Here, the last four elements are omitted from f[]
 
 
 
# ...using a logical vector
 
f>4 # A logical expression operating on the target vector returns a vector of logical elements. It has the same length as the target vector.
 
f[f>4]; # We can use this logical vector to extract only elements for which the logical expression evaluates as TRUE
 
 
 
# Example: extending the Fibonacci series for three steps.
 
# Think: How does this work? What numbers are we adding here and why does the result end up in the vector?
 
f <- c(f, f[length(f)-1] + f[length(f)]); f
 
f <- c(f, f[length(f)-1] + f[length(f)]); f
 
f <- c(f, f[length(f)-1] + f[length(f)]); f
 
 
 
</source>
 
 
 
Many operations on scalars can be simply extended to vectors and '''R''' computes them '''very''' efficiently by iterating over the elements in the vector.
 
 
 
<source lang="rsplus">
 
f
 
f+1
 
f*2
 
 
 
# computing with two vectors of same length
 
a <- f[-1]; a # like f[], but omitting the first element
 
b <- f[1:(length(f)-1)]; b # like f[], but shortened by the least element
 
c <- a / b # the "golden ratio", phi (~1.61803 or (1+sqrt(5))/2 ), an irrational number, is approximated by the ratio of two consecutive Fibonacci numbers.
 
c
 
abs(c - ((1+sqrt(5))/2)) # Calculating the error of the approximation, element by element
 
</source>
 
 
 
 
 
&nbsp;
 
 
 
==Matrices==
 
 
 
If we need to operate with several vectors, or multi-dimensional data, we make use of ''matrices'' or more generally ''k''-dimensional ''arrays'' '''R'''. Matrix operations are very similar to vector operations, in fact a matrix actually is a vector for which the number of rows and columns have been defined.
 
 
 
The most basic form of such definition is the <code>dim()</code> function. Consider:
 
<source lang="rsplus">
 
a <- 1:12; a
 
dim(a) <- c(2,6); a
 
dim(a) <- c(2,2,3); a
 
</source>
 
 
 
Alternatively, matrices can be defined using the <code>matrix()</code> or <code>array()</code> functions (see there), or "glued" together from vectors by rows or columns, using the  <code>rbind()</code> or  <code>cbind()</code> functions respectively:
 
 
 
<source lang="rsplus">
 
a <- 1:4
 
b <- 5:8
 
c <- rbind(a, b); c
 
d <- cbind(a, b); d
 
e <- cbind(d, 9:12); e
 
</source>
 
 
 
Addressing (retrieving) individual elements or slices from matrices is simply done by specifying the appropriate indices, where a missing index indicates that the entire row or column is to be retrieved
 
<source lang="rsplus">
 
e[1,] # first row
 
e[,2] # second column
 
e[3,2] # element at index row=3, column = 2
 
e[3:4, 1:2] # submatrix
 
</source>
 
 
 
Note that '''R''' has numerous functions to compute with matrices, such as transposition, multiplication, inversion, calculating eigenvalues and eigenvectors and more.
 
 
 
&nbsp;
 
 
 
==Lists==
 
 
 
While the elements of matrices and arrays all have to be of the same type, lists are more generally ordered collections of ''components''. Lists are created with the <code>list()</code> function, which works similar to the <code>c()</code> function. Components are accessed through their index in double square brackets, or through their name, if the name has been defined. Here is an example:
 
 
 
<source lang="rsplus">
 
 
 
pUC19 <- list(size=2686, marker="ampicillin", ori="ColE1", accession="L01397", BanI=c(235, 408, 550, 1647) )
 
pUC19[[1]]
 
pUC19[[2]]
 
pUC19$ori
 
pUC19$BanI[2]
 
 
 
</source>
 
 
 
 
 
&nbsp;
 
 
 
==Data frames==
 
 
 
Data frames combine features of lists and matrices, they are one of the most important data objects in '''R''', because the result of reading an input file is usually a data frame. Lets create a little datafile and save it in the current working directory. You can use the "New document" command from the menu and save the following data as e.g. <code>vectors.tsv</code> (for "tab separated values").
 
 
 
Name Size Marker Ori Sites
 
pUC19 2686 Amp ColE1 EcoRI, SacI, SmaI, BamHI, XbaI, PstI, HindIII
 
pBR322 4361 Amp, Tet ColE1 EcoRI, ClaI, HindIII
 
pACYC184 4245 Tet, Cam p15A ClaI, HindIII
 
 
 
This data set uses tabs as column separators and it has a header line. Similar files can be exported from Excel or other spreadsheet programs. Read this as a data frame as follows:
 
 
 
<source lang="rsplus">
 
Vectors <- read.table("vectors.tsv", sep="\t", header=TRUE)
 
Vectors
 
</source>
 
 
 
You can edit the data through a spreadsheet-like interface with the <code>edit()</code> function.
 
<source lang="rsplus">
 
V2 <- edit(Vectors)
 
</source>
 
 
 
 
 
Here is a collection of examples of subsetting data from this frame:
 
 
 
<source lang="rsplus">
 
Vectors[1, ]
 
Vectors[2, ]
 
Vectors[ ,2 ]
 
 
 
Vectors$Name
 
 
 
Vectors$Size > 3000
 
Vectors$Name[Vectors$Size > 3000]
 
Vectors$Name[Vectors$Ori != "ColE1"]
 
 
 
Vectors[order(Vectors$Size), ]
 
 
 
grep("Tet", Vectors$Marker)
 
Vectors[grep("Tet", Vectors$Marker), ]
 
Vectors[grep("Tet", Vectors$Marker), "Ori"]
 
as.vector(Vectors[grep("Tet", Vectors$Marker), "Ori"])
 
 
 
</source>
 
 
 
 
 
 
 
&nbsp;
 
 
 
 
 
==Writing your own functions==
 
 
 
Writing your own functions in '''R''' is easy and gives you access to flexible, powerful and reusable solutions. functions are assigned to function names and invoking the function returns some value, vector or other data object.
 
 
 
<source lang="rsplus">
 
lg <- function(x) { log(x) / log(10) }
 
lg(10000) # should be 5
 
</source>
 
 
 
We can use loops and control structures inside functions. For example the following creates a series of Fibonacci numbers.
 
 
 
<source lang="rsplus">
 
fib <- function(n) {
 
  if (n < 1) { return( c(0) ) }
 
  else if (n == 1) { return( c(1) ) }
 
  else if (n == 2) { return( c(1, 1) ) }
 
  else {
 
      v <- c(1, 1)
 
      for ( i in 3:n ) {
 
        v <- c(v, v[length(v)-1] + v[length(v)])
 
      }
 
      return( v )
 
  }
 
}
 
</source>
 
 
 
This concludes our introduction to '''R'''.
 
 
 
&nbsp;
 
  
 
==Notes==
 
==Notes==
 
<references />
 
<references />
  
 +
 +
{{Vspace}}
  
  
&nbsp;
+
----
==Further reading and resources==
 
{{WP|R_(programming_language)|'''R''' on Wikipedia}}
 
<div class="reference-box">[http://cran.r-project.org/doc/manuals/R-intro.html Introduction to '''R''' at CRAN]</div>
 
  
 +
{{Vspace}}
  
&nbsp;
 
 
[[Category:Applied_Bioinformatics]]
 
[[Category:Applied_Bioinformatics]]
 +
[[Category:R]]
 
</div>
 
</div>

Latest revision as of 15:52, 8 May 2018

R tutorial


This is a hub for a first introduction to R, for students of one of my workshops or courses. I have subdivided the material into (somewhat) independent learning units that you can work through at your own pace, but in sequence.

The units have Deliverables and Prerequisites - please ignore these sections, they are for use in a more formal course setting.

You need to work through these units before you come to the workshop. There are two reasons:

  • (i) installation of software is very specific to your computer and we can't walk you through this in a room full of people. It would take so much time that we won't get anything else done.
  • (ii) When you are working with R - like with any computer language or natural language, the key is repetition, repetition, repetition. The more you prime yourself with this material, the more you will profit when we actually meet in class. I hope to see everyone radiant and elated, and not lost before we even begin. Let's do this!




The Units

 
Start with this


 
Install R and make sure everything works


 
Explore how to get R to work with data


 
The one unit that will save your ***, over and over again


 
First steps towards programming


 
Maybe optional? Meh, just work through this anyway, as time permits. It'll be on the exam.


 

Notes