Expected Preparations:
|
|||||||||||
|
|||||||||||
Keywords: Notation; installing R and RStudio; packages; first experiments | |||||||||||
|
|||||||||||
Objectives:
This unit will …
|
Outcomes:
After working through this unit you …
|
||||||||||
|
|||||||||||
Deliverables: Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit. Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these. Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page. |
|||||||||||
|
|||||||||||
Evaluation: NA: This unit is not evaluated for course marks. |
This unit works through the installation of R and RStudio and introduces R’s packages of additional functions.
The R statistics environment and programming language(W) is an exceptionally well engineered, free (as in free speech) and free (as in free beer) platform for data manipulation and analysis. The number of functions that are included by default is large, there is a very large number of additional, community-generated analysis modules that can be simply imported from dedicated sites (e.g. the Bioconductor project for molecular biology data), or via the CRAN network, and whatever function is not available can be easily programmed. The ability to filter and manipulate data to prepare it for analysis is an absolute requirement in research-centric fields such as ours, where the strategies for analysis are constantly shifting and prepackaged solutions become obsolete almost faster than they can be developed. Besides numerical analysis, R has very powerful and flexible functions for plotting graphical output.
One can’t learn a language in a single day.
The key to success is constant, effort, every day. Such continuous engagement will quickly bring the principles into active memory. As you progress, make sure you understand every single step. What counts is not your submissions, but your learning. As long as there is even one line of code that you do not fully understand your work is not completed. Taking shortcuts will just make matters worse, later on.
In this tutorial, I use specific notation and formatting to mean different things.
If you see footnote numbers1, there is relevant text in the side bar.
This is normal text for explanations. It is written in a proportionally spaced font.
Code formatting is for code examples, file- and function names,
directory paths etc.
Code is written in a monospaced
font2. What makes a good monospaced font?
Readability! Above all, the number one, the lowercase l, and the
uppercase I have to be clearly distinginguishable, and also the number
zero and the uppercase O. Compare these characters in
serif/proportional: “1lI-0O”, sans-serif/proportional “1lI-0O”, and monospace 1lI-0O
.
Bold emphasis and underlining are to mark words as particularly important.
Sometimes I underline examples of the right way to do something in green.
… and examples of the wrong’ way to do something may be underlined red.
Task…
Tasks and exercises are described in boxes formatted like this. If you want to profit from the material, the tasks are not optional3. If you have problems, never hesitate to contact me, or discuss the issue on the mailing list. Don’t simply continue. All material builds on previous material.
“Syntactic variables”: When I use notation like
<Year>
in instructions, you need to type the year,
the whole year and nothing but the year (i.e. the four digits
2021). You never type the angle
brackets! I use the angle brackets only to indicate that you should not
type “Year literally, but substitute the
correct value. You might encounter this notation as
<path>
, <filename>
,
<firstname lastname>
and similar. To repeat: if the
instructions say <your name>
… and your name is
Elcid Barrett, you type Elcid Barret
– and not your name
or <Elcid Barret>
(Oh the troubles
I’ve seen … !)
The sample code on this page sometimes copies text from the console,
and sometimes shows the actual commands only. The >
character at the beginning of the line is always just R’s input
prompt, it tells you that you can type something now - you never
actually type >
at the beginning of a line. If you
read:
> getwd()
… you actually need to type:
getwd()
If a line starts with [1]
or similar, this is R’s
output on the console.4 The #
character marks the following text as a comment, which is not executed
by R. These are lines that you do not type. They are
program output, or comments, not commands.
Character | Notes |
---|---|
/
|
This is a forward-slash. It leans forward in the reading direction. You often find it as a file-path separator, to close a tag in HTML, or as the division operator in math. |
\
|
This is a backslash. It leans backward in the reading direction. It is often also called an “escape character” , since it lets some characters escape to be interpreted according to some special meaning, and use the literal meaning instead (or vice-versa, it depends). You will find them inside strings and (Horrors!) regular expressions. |
( )
|
These are parentheses. They identify functions and collect their parameters. |
[ ]
|
These are (square) brackets. They are used for arrays. |
< >
|
These are angle brackets. Individually, they just mean less-than, and greater-than. Together, they enclose a “tag” in HTML. |
{ }
|
These are (curly) braces. In code, they delineate
blocks of code that are executed together, for example in
if conditional exprerssions or for loops.
|
"
|
This, and only this is a quotation mark or double
quote. All of these are not: “ ” „ « » . Those will break your code.
Especially the first two are often automatically inserted by MSWord and
hard to distinguish.5 They are ubiquitous in code, where they
delineate strings (“text”) in code.
|
'
|
This, and only this is a single quote. All of these are
not: ‘ ’ ‚ ‹ › . Those will break
your code. Especially the first two are often automatically inserted by
MSWord and hard to distinguish. They are also used to delineate strings
in code.
|
MSWord is not useful as a code editor.
In this section we discuss how to download and install the software, how to configure an R session and how to work in the R environment.
Task…
The program should open a window – this window is called the “R console” – and greet you with its input prompt, awaiting your input:
>
Once you see that R is running correctly, you may quit the program for now.
RStudio is a free IDE (Integrated Development Environment) for R. RStudio is a wrapper7 for R and as far as basic R is concerned, all the underlying functions are the same, only the user interface is different (and there are a few additional functions that are very useful e.g. for managing projects).
Here is a small list of differences between R and RStudio.
pros (quite significant):
cons (all very minor):
Task…
getwd()
.This prints out the path of the current working directory. Make a (mental) note where this is. When working on a project, we always need to make sure this default directory is changed to to the right project directory.
R has many powerful functions built in, but one of it’s greatest features is that it is easily extensible. Extensions have been written by legions of scientists for many years, most commonly in the R programming language itself, and made available through CRAN–The Comprehensive R Archive Network or through the Bioconductor project.
A package is a collection of code, documentation and (often) sample data. To use packages, you need to install the package (once). Installing a package downloads its code and assets from a repository and stores it in an appropriate location on your computer. You can then use all of the package’s functions in one of two ways:
stringr::stri_trim(” Nestor notabilis “)
). That
is the preferred way since your code then explicitly8 shows you which
package a function comes from.library()
command, eg
library(stringr)
and then use all those functions without a
prefix (eg. stri_trim(” Nestor notabilis “)
). That’s less
typing, and it is definitely the way you will find code written all over
the internet. The problem is this is less explicit, much harder to
understand, troubleshoot, maintain, and it may actually be the source of
insidious bugs that depend on the loading order of packages.In the teaching code for this course, I use the
package::function()
idiom wherever possible.9
To repeat:
install.packages(“<package-name”)
downloads the package files from CRAN and places them
in the appropriate location on your computer.packagename::function()
is the preferred idiom to use
functions from a package.
library(packagename)
(note: no quotation marks
in this case.) Then you can use the functions simply by typing
function()
.You can get an overview over installed packages on your computer, and which ones have been loaded, by opening the Packages tab in the Files pane (lower right) of RStudio.
Task…
To explore packages on CRAN:
Some basic functions that deal with biologicval sequences are
included in the sequinr
package, although it is getting
dated now.
vignette
. Note that there is a
command to extract R sample code from a vignette, to experiment with
it.> ?vignette
Now download and install seqinr
from the closest CRAN
mirror and load it for this session. Then explore some functions.
> ??install
> ?install.packages
> install.packages("seqinr") # Note: the parameter is a quoted string!
also installing the dependency ‘ade4’
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.2/ade4_1.7-2.tgz'
Content type 'application/x-gzip' length 3365088 bytes (3.2 MB)
==================================================
downloaded 3.2 MB
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.2/seqinr_3.1-3.tgz'
Content type 'application/x-gzip' length 2462893 bytes (2.3 MB)
==================================================
downloaded 2.3 MB
The downloaded binary packages are in
/var/folders/mx/ld0hdst54jjf11hpcjh8snfr0000gn/T//Rtmpsy5GMx/downloaded_packages
> library(help="seqinr") # This prints information on the package
> library(seqinr) # This loads the installed package into memory
> ls("package:seqinr") # This lists the functions and other objects in the package
[1] "a" "aaa" "AAstat"
[4] "acnucclose" "acnucopen" "al2bp"
[...]
[205] "where.is.this.acc" "words" "words.pos"
[208] "write.fasta" "zscore"
> ?seqinr::a
> seqinr::a("Tyr")
[1] "Y"
> seqinr::words(3, c("A", "G", "C", "U"))
[1] "AAA" "AAG" "AAC" "AAU" "AGA" "AGG" "AGC" "AGU" "ACA" "ACG" "ACC" "ACU" "AUA" "AUG"
[15] "AUC" "AUU" "GAA" "GAG" "GAC" "GAU" "GGA" "GGG" "GGC" "GGU" "GCA" "GCG" "GCC" "GCU"
[29] "GUA" "GUG" "GUC" "GUU" "CAA" "CAG" "CAC" "CAU" "CGA" "CGG" "CGC" "CGU" "CCA" "CCG"
[43] "CCC" "CCU" "CUA" "CUG" "CUC" "CUU" "UAA" "UAG" "UAC" "UAU" "UGA" "UGG" "UGC" "UGU"
[57] "UCA" "UCG" "UCC" "UCU" "UUA" "UUG" "UUC" "UUU"
Task…
The fact that these methods work, shows that the package has been
downloaded, installed, its functions are now available with the package
name prefix and any datasets it contains can be loaded. Just like many
other packages, seqinr
comes with a number of datafiles.
Try:
?data
data(package="seqinr") # list the available data
data(aaindex, package="seqinr") # load ''aaindex''
?aaindex # what is this?
aaindex$FASG890101 # two of the indices ...
aaindex$PONJ960101
# Lets use the data: plot amino acid single-letter codes by hydrophobicity
# and volume. The values come from the dataset. Copy and paste the commands,
# we'll discuss them in detail later.
plot(aaindex$FASG890101$I,
aaindex$PONJ960101$I,
xlab="hydrophobicity", ylab="volume", type="n")
text(aaindex$FASG890101$I,
aaindex$PONJ960101$I,
labels=a(names(aaindex$FASG890101$I)))
# assign the sequence for the Mbp1 transcription factor from
# https://www.uniprot.org/uniprotkb/P39678/entry to a variable
mbp1 <- "
MSNQIYSARY SGVDVYEFIH STGSIMKRKK DDWVNATHIL KAANFAKAKR TRILEKEVLK
ETHEKVQGGF GKYQGTWVPL NIAKQLAEKF SVYDQLKPLF DFTQTDGSAS PPPAPKHHHA
SKVDRKKAIR SASTSAIMET KRNNKKAEEN QFQSSKILGN PTAAPRKRGR PVGSTRGSRR
KLGVNLQRSQ SDMGFPRPAI PNSSISTTQL PSIRSTMGPQ SPTLGILEEE RHDSRQQQPQ
QNNSAQFKEI DLEDGLSSDV EPSQQLQQVF NQNTGFVPQQ QSSLIQTQQT ESMATSVSSS
PSLPTSPGDF ADSNPFEERF PGGGTSPIIS MIPRYPVTSR PQTSDINDKV NKYLSKLVDY
FISNEMKSNK SLPQVLLHPP PHSAPYIDAP IDPELHTAFH WACSMGNLPI AEALYEAGTS
IRSTNSQGQT PLMRSSLFHN SYTRRTFPRI FQLLHETVFD IDSQSQTVIH HIVKRKSTTP
SAVYYLDVVL SKIKDFSPQY RIELLLNTQD KNGDTALHIA SKNGDVVFFN TLVKMGALTT
ISNKEGLTAN EIMNQQYEQM MIQNGTNQHV NSSNTDLNIH VNTNNIETKN DVNSMVIMSP
VSPSDYITYP SQIATNISRN IPNVVNSMKQ MASIYNDLHE QHDNEIKSLQ KTLKSISKTK
IQVSLKTLEV LKESSKDENG EAQTNDDFEI LSRLQEQNTK KLRKRLIRYK RLIKQKLEYR
QTVLLNKLIE DETQATTNNT VEKDNNTLER LELAQELTML QLQRKNKLSS LVKKFEDNAK
IHKYRRIIRE GTEMNIEEVD SSLDVILQTL IANNNKNKGA EQIITISNAN SHA"
mbp1 <- gsub("\\s", "", mbp1) # remove all whitespace
mbp1 <- unlist(strsplit(mbp1, "")) # split into a vector: one letter per element
x <- seqinr::AAstat(mbp1) # compute frequency statistics
barplot(sort(x$Compo),
cex.names = 0.6) # plot number of occurrences
The function requireNamespace()
is useful because it
does not produce an error when a package has not been installed. It
simply returns TRUE
if successful or FALSE
if
not. Therefore one can use the following code idiom in R scripts to make
sure the package exists, but without having to download the package
every time the script is called. You will find this code idiom quite
often in our course scripts.
if (! requireNamespace("seqinr", quietly=TRUE)) {
install.packages("seqinr")
}
Here the single exclamation mark is a logical NOT
operator.
Note that the Bioconductor project
has its own installation system, the Biocmanager::install()
function. It is explained here and
we will encounter it later in the course.
Note, just to mention it at this point: to install packages that are not on CRAN or Bioconductor, you need the devtools package. But this is not a good idea unless you really, really know that you can trust the source.
One of the challenges of working with R is the overabundance of options. CRAN has over 18,000 packages and Bioconductor has over 2,000 more. How can you find ones that are useful to your work? There’s actually a package to help you do that, the sos package on CRAN. Try this:
if (! requireNamespace("sos", quietly=TRUE)) {
install.packages("sos")
}
library(help = sos) # basic information
browseVignettes("sos") # available vignettes
sos::findFn("moving average")
Alternatively …
Question 1
What is the purpose of this code?
if (! requireNamespace("seqinr", quietly = TRUE)) {
install.packages("seqinr")
}
Why not just write instead:
install.packages("seqinr")
Carey, Maureen
A and Jason A Papin. (2018). “Ten simple rules for biologists
learning to program”. Plos Computational Biology
14(1):e1005871 .
[PMID:
29300745]
[DOI: 10.1371/journal.pcbi.1005871]
If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.
Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.
[END]
… when the page-width is too narrow, footnotes will be hidden. They will appear when you click on the number.↩︎
Proportional fonts are for elegant document layout. Monospaced fonts are needed to properly align characters in columns. For code and sequences, we alway use monospaced font. Code editors always use monospaced fonts, but since I need to eMail a lot of code and sequences, I have also set my eMail client to use monospaced font by default. I highly encourage you to do the same.↩︎
But if you ever feel the tasks are irrelevant, “make-work”, or outdated, do let me know so we can address this.↩︎
[1]
means: the following is the first
element of a vector - and this is often the only element.↩︎
Never, ever edit code in MS Word. Use R or RStudio. Actually, don’t use notepad or TextEdit either.↩︎
You can also use one of the mirror sites, if CRAN is down - for example the mirror site at the University of Toronto. A choice of mirror sites is listed on the R-project homepage.↩︎
A “wrapper” program uses another program’s functionality in its own context. RStudio is a wrapper for R since it does not duplicate R’s functions, it runs the actual R in the background.↩︎
Writing code as explicitly as possible is a mantra of this course. This is so important for maintenance and troubleshooting.↩︎
Unfortunately, it is not always possible to
avoid using the library()
function, but even after you have
had to load a package, you can alsways still use the
package::function()
idiom in your code↩︎