Difference between revisions of "RPR-Coding style"

Revision as of 01:33, 27 December 2018

R Coding Style

(R coding style; software development)

Abstract:

Now that you have encountered some concepts of R programming, how do you write good R code?

Objectives:
This unit will ...

... introduce tried and proven principles of writing expressive and maintainable R code.

Outcomes:
After working through this unit you ...

... can identify poor practice in formatting R code;
... know better;
... begin incorporating these principles into your own practice.

Deliverables:

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Prerequisites:
This unit builds on material covered in the following prerequisite units:

RPR-Plotting (Introduction to R Plots)

1 Contents
2 General
3 Layout
- 3.1 Design and granularity
4 Headers
5 Sections
- 5.1 Parentheses and Braces
- 5.2 Spaces
6 Names
7 Conditionals
8 Indent Style
- 8.1 Indentation of long function declarations
9 Loops
10 Functions
11 Efficiency
12 # [END]
13 Self-evaluation
14 Notes
15 Further reading, links and resources

Warning: Coding style is a volatile topic. Friendships have been renounced, eternal vows of marriage have been dissolved, stock-options have been lost, all over a disagreement about the One True Brace Style, or whether fetchSequenceFromPDB()is a good function name or not. I am laying out coding rules below that reflect a few years of experience. They work for me, they may not work for you.

However:

If you are taking one of my workshops, I recommend you to follow these rules: I write this way, and we will find it easier to communicate if you do too.
If you are collaborating on a software project, these rules embody the standard across the project, and I will not check-in code that deviates. Here, consistency is key; but if you think you have a better approach, you only need to convince me and we will change the rule and apply it throughout the codebase^[1].
If you are taking one of my courses, you may lose marks if you do not adhere to these standards. Of course, following rules must not be done blindly – we are training future collaborators, not parrots – but you need to write in the spirit of the one rule we all agree on:

Well written code helps the reader to understand the intent.

General

It should always be your goal to code as clearly and explicitly as possible. R has many complex idioms, and since it is a functional language that can generally insert functions anywhere into expressions, it is possible to write very terse, expressive code. Use this with discretion. Pace yourself, and make sure your reader can follow your flow of thought. You should aim for a generic coding style that can easily be translated to other languages if necessary, and easily understood by others whose background is in another language. And resist being crafty: more often than not the poor soul who will be confused by a particularly witty use of the language will be you, yourself, half a year later. There is an astute observation by Brian Kernighan that applies completely:

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."

Never sacrifice being explicit for saving on keystrokes. Code is read much more often than it is written!

Use lots of comments. Don't describe what the code does, but explain why.
Indent comment "#"-characters to align with the expressions in a block.
Use only <- for assignment, not =
...but do use = when passing values into the arguments of functions.
Don't use <<- (global assignment) except in very unusual cases. Actually never.
Define global variables at the beginning of the code, use all caps variable names (MAXWIDTH) for such parameters. Never have "magic numbers" appear in your code.
If such variables are meant to be truly global use options() to set them.

Don't use attach().
Always use for (i in seq(along=x)) {...} rather than for (i in 1:length(x)) {...} because if x == NULL the loop is executed once, with an undefined variable.

Layout

Limit yourself to 80 characters per line.
Don't use semicolons to write more than one expression on a line.

Design and granularity

Don't repeat code. Use functions instead.
Don't repeat code. If you feel the urge to type code more than once, that's how you know you should break up the code into functions.
Don't repeat code. I'm repeating this for emphasis.

One of the general principles of writing clear, maintainable code is collocation. This means that information items that can affect each other should be viewable on the same screen. Spolski makes a great argument for this, together with a few excellent examples; he also makes a case for a special kind of prefix notation for variable and function names that has a lot of merit.

If the code for a function does not fit on approxiamtaley one printed page, you should probably break it up further.

if your loops or conditionals are nested more than three levels deep, you should rethink the logic.

Headers

Give your script files headers that state purpose, author, date and version information, and note bugs and issues.
Give your functions headers that describe purpose, parameters (including required datatypes), and return values. Callers should be able to work with the function without having to read the code.

Sections

Use separators (# --- SECTION -----------------) to structure your code.

Parentheses and Braces

In mathematical expressions, always use parentheses to define priority explicitly. Never rely on implicit operator priority. (( 1 + 2 ) / 3 ) * 4
Always use braces {}, even if you write single-line if statements and loops.

Spaces

if and for are language keywords, not functions. Separate the following parenthesis from the keyword with a space.

Good:

if (silent) { ...

Bad:

if(silent) { ...

Always separate operators and arguments with spaces.^[2]^[3]
Never separate function names and their following parentheses with spaces.
Always use a space after a comma, and never before a comma.

Good:

print(1 / 3, digits = 10)
if (! id %in% IDs) { ...

Bad:

print (1 / 3 ,digits=10)
if (!id %in% IDs) { ...

Names

There are only two hard things in Computer Science: cache invalidation and naming things.

- Phil Karlton^[4]

Use informative and specific filenames for code sources; give them the extension .R
Periods have a syntactic meaning in object-oriented classes. I consider their use in normal variables names wrong, even though this is not a syntax error and many R library functions have such names (e.g. Sys.time() and other system calls.)
Create names so that related variables or functions are alphabetically sorted together, code autocomplete will be more useful.
Use the concise camelCaseStyle for variable names, don't use the confusing.dot.style or the rambling pothole_style^[5].
Don't abbreviate argument names when calling functions. You can, but you shouldn't.

Never reassign reserved words^[6].
Don't use c as a variable name since c() is a function.
Don't call your data frames df since df() is a function.^[7]

Name length should be commensurate with the scope of a variable. Short names for local scope. More explicit names for global scope. I often write global parameters in ALL CAPS: MAXWIDTH if they are defined at the top of a code module.

Specific naming conventions I like: isValid, hasNeighbour ... Boolean variables; findRange(), getLimits() ... simple function names (verbs!); initializeTable() ... not initTab(); Use plurals to good advantage. node ... for one element; nodes ... for more elements - you can then write code like:

for (node in nodes) { ...

nPoints ... for number-of

iPoints ... for indices-of-points

isError ... don't use isNotError: avoid double negation

Conditionals

This may be controversial. The code block in an if (<condition<) {...} statement is evaluated if <condition< is TRUE. But what if we use a boolean variable in the condition? Should we write:

if (<boolean variable>) { ...

or

if (<boolean variable> == TRUE) { ...

It depends. Remember - the goal is to make your code as explicit and readable as possible. If our variable is e.g. a, then ...

if (a) { ...

... is not good. Better write ...

if (a == TRUE) { ...

... and treat this as any other condition that needs to be evaluated. However - if you have given this a meaningful variable name in the first place, something like ...

if (recordIsValid) { ...

... is much better. I often write an explicit comparison to TRUE, and it's not because I don't understand how conditionals work.

Indent Style

No need for much discussion. Follow the One True Bracing Style and we will both be happy. If you don't immediately see why: read about indent style here.

Indentation of long function declarations

Use spaces to align repeating parts of code, so errors become easier to spot.

Loops

Pre-allocate your result objects to have the correct size if at all possible. Growing objects dynamically with c(), cbind(), or rbind() is much, much slower.

Use seq_along(), not length() to compute the range of index variables. If the object you are iterating over has length zero (i.e. it is NULL, like e.g. the result of a grep() operation if the pattern was not found) then using ...

for (idx in 1:length(myVector)) { ...

... will result in an iteration range of 1:0 since length(NULL) is zero, and the loop will be executed twice even though it should not have been. The correct and safe way to iterate is ...

for (idx in seq_along(myVector)) { ...

... which will not execute since seq_along(NULL) is NULL.

Functions

Always explicitly return values from functions, never rely on the implicit behaviour that returns the last expression. This is not superfluous, it is explicit.

In general, return only from the end of the function, not from multiple places.
Explicitly assign values to crucial function arguments, even if you think you know that that value is the language default.

Efficiency

If possible, do not grow data structures dynamically, but create the whole structure with "empty" values, then assign values to its elements. This is much faster.

 # This is really bad:
 system.time({
   N <- 100000
   v <- numeric()
   for (i in 1:N) {
       v <- c(v, sqrt(i))
   }
 })
    user  system elapsed
 16.718  11.258  27.988

 # Even only writing directly to new elements is much, much better:
 system.time({
   N <- 100000
   v <- numeric()
   for (i in 1:N) {
       v[i] <- sqrt(i)
   }
 })
   user  system elapsed
  0.025   0.003   0.027

 # The fastest way is to preallocate memory, it actually comes close to the
 # vectorized operation (which is the fastest approach overall):
 # system.time({ v <- sqrt(1:100000) })

 system.time({
   N <- 100000
   v <- numeric(N)
   for (i in seq_along(v)) {
       v[i] <- sqrt(i)
   }
 })
   user  system elapsed
  0.008   0.000   0.007

`# [END]`

Always end your code with an # [END] comment. This way you can be sure it was copied or saved completely and nothig has been inadvertently omitted. This is important in teamwork. If even ONE team member does not adhere to this, it invalidates the efforts of EVERYONE.

Self-evaluation

Notes

↑ I'm serious: I have reformatted major pieces of code more than once after learning of a better approach, and if that creates better code it is very satisfying.
↑ Separating operators with spaces is especially important for the assignment operator <-. Consider this: myPreciousData < -2 returns a vector of TRUE and FALSE, depending on whether the values in myPreciousData are less than -2. But myPreciousData<-2 overwrites every single element with the number 2!
↑ The = sign is a bit of a special case. When I write e.g. a plot statement, or construct a dataframe, I prefer not to use spaces if the expression ends up all on one line, but to use spaces when the arguments are on separate lines.
↑ For a complementary perspective, see here.
↑ But nevert hesitate to make exceptions if this makes your code more legible.
↑ In my opinion, base R uses far too many function names that would be useful for variables. But we're not going to change that. So I often just prefix my variable names with my- or this-, eg myDf, thisLength etc.
↑ Here are more names that may seem attractive as variable names but that are in fact functions in the base R package and thus may cause confusion: all(), args(), attr(), beta(), body(), col(), date(), det(), diag(), diff(), dim(), dir(), dump(), eigen(), file(), files(), gamma(), kappa(), length(), list(), load(), log(), max(), mean(), min(), open(), q(), raw(), row(), sample(), seq(), sub(), summary(), table(), type(), url(), vector(), and version(). I'm sure you get the idea - composite names of the type proposed above in CamelCase are usually safe.

@@ Line 230: / Line 230: @@
 </source>
 :{{c|nPoints}} ... for number-of
-:{{c|iPoints}} ... for number-of
+:{{c|iPoints}} ... for indices-of-points
-:{{c|isError}} ... not {{c|isNotError}}: avoid double negation
+:{{c|isError}} ... don't use {{c|isNotError}}: avoid double negation
 {{Vspace}}
@@ Line 286: / Line 286: @@
 ==Loops==
-Pre-allocate your result objects of the correct size if at all possible.
+Pre-allocate your result objects to have the correct size if at all possible. Growing objects dynamically with {{c|c()}}, {{c|cbind()}}, or {{c|rbind()}} is much, much slower.
+Use {{c|seq_along()}}, not {{c|length()}} to compute the range of index variables. If the object you are iterating over has length zero (i.e. it is {{c|NULL}}, like e.g. the result of a {{c|grep()}} operation if the pattern was not found) then using ...
+<source lang="R">
+for (idx in 1:length(myVector)) { ...
+</source>
+... will result in an iteration range of {{c|1:0}} since {{c|length(NULL)}} is zero, and the loop will be executed twice even though it should not have been. The correct and safe way to iterate is ...
+<source lang="R">
+for (idx in seq_along(myVector)) { ...
+</source>
+... which will not execute since {{c|seq_along(NULL)}} is {{c|NULL}}.
 {{Vspace}}
@@ Line 292: / Line 306: @@
 ==Functions==
-* Always '''explicitly return''' values from functions, never rely on the implicit behaviour that returns the last expression. This is not superfluous, it's explicit.
+* Always '''explicitly return''' values from functions, never rely on the implicit behaviour that returns the last expression. This is not superfluous, it is explicit.
 * In general, return only from the end of the function, not from multiple places.
@@ Line 304: / Line 318: @@
 <source lang = "rsplus">
-  # This is bad:
+  # This is really bad:
-  v <- numeric()
+  system.time({
- for (i in 1:100000) {
+   N <- 100000
-     v <- c(v, sqrt(i))
+   v <- numeric()
-  }
+   for (i in 1:N) {
+       v <- c(v, sqrt(i))
+   }
+  })
      user  system elapsed
-.192   2.182  22.540
+.718  11.258  27.988
-  # This is marginally better:
+  # Even only writing directly to new elements is much, much better:
-  v <- numeric()
+  system.time({
- for (i in 1:100000) {
+   N <- 100000
-     v[i] <- sqrt(i)
+   v <- numeric()
-  }
+   for (i in 1:N) {
+       v[i] <- sqrt(i)
+   }
+  })
     user  system elapsed
-.185   2.036  16.230
+.025   0.003   0.027
-  # This is much, much better (200 times faster):
+  # The fastest way is to preallocate memory, it actually comes close to the
+ # vectorized operation (which is the fastest approach overall):
+ # system.time({ v <- sqrt(1:100000) })
-  v <- numeric(100000)
+  system.time({
- for (i in seq_along(v)) {
+   N <- 100000
-     v[i] <- sqrt(i)
+   v <- numeric(N)
-  }
+   for (i in seq_along(v)) {
+       v[i] <- sqrt(i)
+   }
+  })
     user  system elapsed
-.101   0.008   0.108
+.008   0.000   0.007
 </source>
 {{Vspace}}

Difference between revisions of "RPR-Coding style"

Revision as of 01:33, 27 December 2018

Contents

Contents

General

Layout

Design and granularity

Headers

Sections

Parentheses and Braces

Spaces

Names

Conditionals

Indent Style

Indentation of long function declarations

Loops

Functions

Efficiency

`# [END]`

Self-evaluation

Notes

Further reading, links and resources

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools