Expected Preparations:
|
|||||||
|
|||||||
Keywords: R coding style; software development | |||||||
|
|||||||
Objectives:
This unit will …
|
Outcomes:
After working through this unit you …
|
||||||
|
|||||||
Deliverables: Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit. Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these. Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page. |
|||||||
|
|||||||
Evaluation: NA: This unit is not evaluated for course marks. |
Now that you have encountered some concepts of R programming, how do you write good R code?
What do we even mean by “good” R code? … This unit is one of those that you will need to come back to from time to time. It won’t make a lot of sense to you until you have actually encountered code, read it and written some yourself. So don’t try to memorize these principles, but review them every four weeks or so.
Proceed with caution:
Coding style is a volatile topic. Friendships have been
renounced, eternal vows of marriage have been dissolved, stock-options
have been lost, all over a disagreement about the One True Brace Style(W), or whether
fetchSequenceFromPDB()
is a good function name or whether it
must be fetch.sequence.from.PDB()
instead.
I am laying out coding rules below that reflect a few years of
experience. They work for me, they may not work for you. However:
Well written code helps the reader to understand the intent.
One of the goals of the coding style expressed below is that the code should be easy to read for people for whom R is not the first language, or even the language of choice. There are many things that R-purists might do differently, however those code idioms probably are not well suited for a research collaboration in which people speak python, C++, javascript and R all at the same time.
It should always be your goal to code as clearly and explicitly as possible. R has many complex idioms, and since it is a functional language that can generally insert functions anywhere into expressions, it is possible to write very terse, expressive code. Use this with discretion. Pace yourself, and make sure your reader can follow your flow of thought. You should aim for a generic coding style that can easily be translated to other languages if necessary, and easily understood by others whose background is in another language. And resist being crafty: more often than not the poor soul who will be confused by a particularly witty use of the language will be you, yourself, half a year later. There is an astute observation by Brian Kernighan that applies completely:
“Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”
Paraphrased from The Elements of Programming Style, Chapter 2.
<-
for assignment, not =
=
when passing values into the arguments of
functions.MAXWIDTH
). Never have “magic
numbers” appear in your code.for (i in seq(along = x)) {…}
rather than
for (i in 1:length(x)) {…}
because if x ==
NULL
the loop is executed once, with an undefined variable.attach()
.library()
. Rather use the
::()
syntax to make it clear which
function you mean2. That’s what package namespaces are for in
the first place.Bad:
library(igraph)
...
...
clu <- components(g)
Good:
clu <- igraph::components(g)
We do understand why our functions should not have side-effects (other than the explicit intended effects of printing, plotting, or writing files). But there are subtle ways to change the global state that we need to remember - and avoid. Here’s an obvious one:
<<-
(global assignment) except in very
unusual cases. Actually never.Less obvious is:
set.seed()
in functions.set.seed()
changes the state of the Random Number
Generator (RNG), which is part of Rs global state. If this state is
changed inside a function, it might result in vastly
smaller space of random numbers than you would expect. Even resetting
the RNG is not a good idea: a repeatable script might require the RNG to
be in a defined state and if your function does
set.seed(NULL)
, your enclosing script is no longer
repeatable. But of course, we need to be able to compute simulations
repeatably. The only acceptable idiom is something like:
mySim <- function(N) { ...
# do something random N times
return(result)
}
set.seed(112358) # set RNG seed for repeatable randomness
x <- mySim(N)
set.seed(NULL) # reset the RNG
Then you can comment out the lines, or change them to a different
seed, or reset the RNG with set.seed(NULL)
- everything is
explicit.
save()
and load()
This is a corrollary to the principle not to change the global state.
save() / load()
saves one or more R objects and restores
them to the same object name(s) they originally had.
But there is no good way to know in advance what that object name is. If
you already have an object by that name in your global workspace, it
gets overwritten!
The sane alternative is to use writeRDS() / readRDS()
.
writeRDS()
serializes, compresses and saves a
single R object, and readRDS()
restores the
object, and returns it as a return value, thus that value can be
assigned:
save(aThing, file="savedAthing.rds")
myNewThing <- readRDS("savedAthing.rds")
One of the general principles of writing clear, maintainable code is collocation. This means that information items that can affect each other should be viewable on the same screen. Spolski makes a great argument for this, together with a few excellent examples; he also makes a case for a special kind of prefix notation for variable and function names that has a lot of merit.
# — SECTION —————–
) to
structure your code.
(( 1 + 2 ) / 3 ) * 4
{ }
, even if you write single-line
if
statements and loops where they are not legally
required.
if
and for
are language
keywords, not functions. Separate the following parenthesis
from the keyword with a space.
Good:
if (silent) { ...
Bad:
if(silent) { ...
Good:
print(1 / 3, digits = 10)
if (! id %in% IDs) { ...'
expressionProfiles[ , 1:3]
Bad:
print (1 / 3 ,digits=10) # space before the comma
if (!id %in% IDs) { ... # "!" hides against the parenthesis
expressionProfiles[, 1:3] # "," hides against the bracket
There are only two hard things in Computer Science: cache invalidation and naming things.
Phil Karlton5
.R
Sys.time()
and other system calls.)camelCaseStyle
for variable names,
don’t use the confusing.dot.style
or the rambling
pothole_style
[But nevert hesitate to make exceptions
if this makes your code more legible.],^ 6.c
as a variable name since c()
is a function.df
since df()
is a function.8MAXWIDTH
)
if they are defined at the top of a code module.Specific naming conventions I like:
isValid
, hasNeighbour
… Boolean
variablesfindRange()
, getLimits()
… simple function
names (verbs!)initializeTable()
… not initTab()
node
… for one element;
nodes
… for more elements - you can then write code
like:for (node in nodes) { ...
nPoints
… for number-ofiPoints
… for indices-of-pointsisError
… don’t use isNotError
: avoid
double negation
This may be controversial. The code block in an if
(<condition>) {…}
statement is evaluated if
<condition>
is TRUE
. But what if we use
a boolean variable in the condition? Should we write:
if (<boolean variable>) { ...
or
if (<boolean variable> == TRUE) { ...
It depends. Remember - the goal is to make your code as explicit and
readable as possible. If our variable is e.g. a
, then …
if (a) { ...
… is not good. Better write …
if (a == TRUE) { ...
… and treat this as any other condition that needs to be evaluated. However - if you have given this a meaningful variable name in the first place, something like …
if (recordIsValid) { ...
… is great, whereas …
if (recordIsValid == TRUE) { ...
… is something that feels oddly self-contradictory. So best practice
here depends on context. Myself, I more often than not end up write
if (something-something-that-is-boolean == TRUE) …
, (and
that’s not because I don’t understand how conditionals
work).
Make the FALSE
behaviour explicit. Always use an
else
at the end of a conditional to define what your code
does if the condition is not TRUE
. Otherwise your reader
will wonder whether your code covers all cases. What if your code
should do nothing in the FALSE
case? Make
that explicit:
if (a > b) {
tmp <- b
b <- a
a <- tmp
} else {
; # does nothing
}
No need for much discussion. Follow the One True Bracing Style and we will all be happy. That includes you yourself. If you don’t immediately see why: read about Indentation style here(W).10 (i) Opening brace on the same line as the function or control declaration; (ii) closing brace aligned with the declaration; (iii) braces mandatory, even if there is only one statement to execute. Sample:
if (length(x) > 1) {
perm <- sample(x)
} else if (length(x) == 1) {
perm <- x
} else {
perm <- NULL
}
Pre-allocate your result objects to have the correct size if at all
possible. Growing objects dynamically with c()
,
cbind()
, or rbind()
is much, much slower.
Use seq_along()
, not length()
to compute
the range of index variables. If the object you are iterating over has
length zero (i.e. it is NULL
, like e.g. the result of a
grep()
operation if the pattern was not found) then using
…
for (idx in 1:length(myVector)) { ...
… will result in an iteration range of 1:0
since
length(NULL)
is zero, and the loop will be executed twice
even though it should not have been. The correct and safe way to iterate
is …
for (idx in seq_along(myVector)) { ...
… which will not execute since seq_along(NULL)
is
NULL
.
sort(x)
… sorts in increasing order, smallest first. But even though …
sort(x, decreasing = FALSE)
… does the same thing, the expression explicitly tells the reader what it is going to do. And that’s good.
Always explicitly return values from functions, never rely on the implicit behaviour that returns the last expression. This is not superfluous, it is explicit.
If there is nothing to return because the function is invoked for its side effects of writing a file or plotting a graph, write it into your code that nothing will be returned. This prevents you from accidentally returning the result of last expression anyway (as the language does by default), or the reader might think you forgot something. The idiom is:
return(invisible(NULL))
If possible, do not grow data structures dynamically, but create the whole structure with “empty” values, then assign values to its elements. This is much faster.
# This is really bad:
system.time({
N <- 100000
v <- numeric()
for (i in 1:N) {
v <- c(v, sqrt(i))
}
})
user system elapsed
16.718 11.258 27.988
# Even only writing directly to new elements is much, much better:
system.time({
N <- 100000
v <- numeric()
for (i in 1:N) {
v[i] <- sqrt(i)
}
})
user system elapsed
0.025 0.003 0.027
# That's abaout as fast as doing the same thing with a vapply() function.
# The fastest way is to preallocate memory, it actually comes close to the
# vectorized operation:
system.time({
N <- 100000
v <- numeric(N)
for (i in seq_along(v)) {
v[i] <- sqrt(i)
}
})
user system elapsed
0.008 0.000 0.007
# Using a vectorized operation is the fastest approach overall and the
# method of choice:
system.time({ v <- sqrt(1:100000) })
user system elapsed
0.001 0.001 0.002
Don’t buy into the “apply is good, for-loop is bad” nonsense that you
might encounter on the Web. Especially not if you need speed: a
well-written for-loop will outperform an apply()
statement,
which internally uses a for-loop anyway. The reason we often use
apply()
is because we are following a functional
programming idiom, not because there is something magical and exalted
about the apply()
function. It’s usually a bit subtle which
idiom is “better” at any given time. But apply()
is NOT
trivial for a python or C programmer, whereas anyone can read a
for-loop. Moreover, you can explicitly assign and monitor intermediate
statements, which is important when developing, validating, and
debugging.
# [END]
# [END]
comment. This way
you can be sure it was copied or saved completely and nothig has been
inadvertently omitted. This is important in teamwork. However, if even
ONE team member does not adhere to this, it is useless for
EVERYONE.
If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.
Join the conversation: [visit] the “Questions and Comments” page or [create/edit] the page.
[END]
I’m serious: I have reformatted major pieces of code more than once after learning of a better approach, and if that creates better code it is very satisfying.↩︎
It is happening more and more frequently that functions in different packages we load have the same name. Then our code’s behaviour will depend on the order in which the libraries were loaded. Evil.↩︎
Separating operators with spaces is especially important
for the assignment operator <-
. Consider this:
myPreciousData < -2
returns a vector of TRUE
and
FALSE
, depending on whether the values in
myPreciousData
are less than -2. But
myPreciousData<-2
overwrites every single element with the
number 2
! I’m not even making this up - happened to a
student in a workshop I taught.↩︎
The =
sign is a bit of a special case. When
I write e.g. a plot statement, or construct a dataframe, I prefer
not to use spaces if the expression ends up all on one
line, but to use spaces when the arguments are on
separate lines.↩︎
Philip Lewis Karlton was an engineer with Netscape who was instrumental in shaping the early Internet. He sadly died in 1997. As is unusual for Internet quotes, it is corroborated that he actually said what is quoted here - though there are many misattributions, often to Donald Knuth. For a complementary perspective, see here.↩︎
This is not a random opinion but based on that it’s easier to keep within the 80-character line limit. Also see the linked articles.↩︎
In my opinion, base R uses far too many function names
that would be useful for variables. But we’re not going to change that.
So I often just prefix my variable names with my
or
this
, eg myDf, thisLength
etc.↩︎
Here are more names that may seem attractive as variable
names but that are in fact functions in the base R
package and thus may cause confusion: all(), args(), attr(),
beta(), body(), col(), date(), det(), diag(), diff(), dim(), dir(),
dump(), eigen(), file(), files(), gamma(), kappa(), length(), list(),
load(), log(), max(), mean(), min(), open(), q(), raw(), row(),
sample(), seq(), sub(), summary(), table(), type(), url(), vector(), and
version()
. I’m sure you get the idea - composite names of the
type proposed above in CamelCase are usually safe.↩︎
“constants” - i.e. variables that are not supposed to change; R has no “constants” in the actual sense.↩︎
If you noticed that this is the third time I am mentioning this, you have been paying attention. Redundant? Emphasis!↩︎