R CodingStyle

From "A B C"
Revision as of 02:12, 14 April 2017 by Boris (talk | contribs) (→‎Spaces)
Jump to navigation Jump to search

R Coding Style


 


 



 

Layout

 

Granularity

 

Headers

 

Sections

 

Spaces

if and for are language keywords, not functions. Separate the following parenthesis from the keyword with a space.

Good:

if (isValid) { ...

Bad:

if(isValid) { ...


 

Names

There are only two hard things in Computer Science: cache invalidation and naming things.

- Phil Karlton[1]


Periods have a syntactic meaning in object-oriented classes. Using them in normal variables names is wrong.


Alphabetically sort names together, code autocomplete will be more useful.


 

Conditionals

 

=Indent Style

No need for much discussion. Follow the One True Bracing Style and we will both be happy. If you don't immediately see why: read about inden style here.


Indentation of long function declarations

 

Loops

 

Functions

 

# [END]

It should always be your goal to code as clearly and explicitly as possible. R has many complex idioms, and it being a functional language that can generally insert functions anywhere into expressions, it is possible to write very terse, expressive code. Don't do it. Pace yourself, and make sure your reader can follow your flow of thought. More often than not the poor soul who will be confused by a particularly witty use of the language will be you, yourself, half a year later. There is an astute observation by Brian Kernighan that applies completely:

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."


  • Never sacrifice being explicit for saving on keystrokes. Code is read much more often than it is written!
  • Use informative and specific filenames for code sources; give them the extension .R
  • Give your sources headers stating purpose, author, date and version information, and note bugs and issues.
  • Give your functions headers that describe purpose, arguments (including required datatypes), and return values. Callers should be able to work with the function without having to read the code.
  • Use lots of comments. Don't describe what the code does, but explain why.
  • Use separators (# --- SECTION -----------------) to structure your code.
  • Indent comment hashes to align with the expressions in a block.
  • Use only <- for assignment, not =
  • ...but do use = when passing values into the arguments of functions.
  • Don't use <<- (global assignment) except in very unusual cases. Actually never.
  • Use the concise camelCaseStyle for variable names, don't use the confusing.dot.style or the rambling pothole_style.
  • Define parameters at the beginning of the code, use all caps variable names (MAXWIDTH) for such parameters. Never have "magic numbers" appear in your code.
  • In mathematical expressions, always use parentheses to define priority explicitly. Never rely on implicit operator priority. (( 1 + 2 ) / 3 ) * 4
  • Always separate operators and arguments with spaces.[2][3]
  • Never separate function names and the brackets that enclose argument lists.
  • Don't abbreviate argument names. You can, but you shouldn't.
  • Try to limit yourself to ~80 characters per line.
  • Always use braces {}, even if you write single-line if statements and loops.
  • Always use a space after a comma, and never before a comma.
  • Always explicitly return values from functions, never rely on the implicit behaviour that returns the last expression.
  • Use spaces to align repeating parts of code, so errors become easier to spot.
  • Don't repeat code. Use functions instead.
  • Don't repeat code. If you feel the urge to type code more than once, that's how you know you should break up the code into functions.
  • Don't repeat code. I'm repeating this for emphasis.
  • Explicitly assign values to crucial function arguments, even if you think you know that that value is the default.
  • Never reassign reserved words.
  • Don't use c as a variable name since c() is a function.
  • Don't call your data frames df since df() is a function.[4]
  • Don't use semicolons to write more than one expression on a line.
  • Don't use attach().
  • It's safer to use for (i in seq(along=x)) {...} rather than for (i in 1:length(x)) {...} because if x == NULL the loop is executed once, with an undefined variable.


Specific naming conventions I like
isValid, hasNeighbour ... Boolean variables
findRange(), getLimits() ... simple function names (verbs!)
initializeTable() ... not initTab()
node ... for one element; nodes ... for more elements
nPoints ... for number-of
isError ... not isNotError: avoid double negation


Consider using the formatR package for consistent code.


If possible, do not grow data structures dynamically, but create the whole structure with "empty" values, then assign values to its elements. This is much faster.

 # This is bad: 
 v <- numeric()
 for (i in 1:100000) {
     v <- c(v, sqrt(i))
 }
    user  system elapsed 
 20.192   2.182  22.540 
 
 # This is slightly better: 
 v <- numeric()
 for (i in 1:100000) {
     v[i] <- sqrt(i)
 }
   user  system elapsed 
 14.185   2.036  16.230 

 # This is much, much better (200 times faster):
 N <- 100000
 v <- numeric(N)
 for (i in 1:N) {
     v[i] <- sqrt(i)
 }
   user  system elapsed 
  0.101   0.008   0.108

One of the general principles of writing clear, maintainable code is collocation. This means that information items that can affect each other should be viewable on the same screen. Spolski makes a great argument for this, together with a few excellent examples; he also makes a case for a special kind of prefix notation for variable and function names that has a lot of merit.


Sources and Notes

  1. For a complementary perspective, see here.
  2. Separating operators with spaces is especially important for the assignment operator <-. Consider this: myPreciousData < -2 returns a vector of TRUE and FALSE, depending on whether the values in myPreciousData are less than -2. But myPreciousData<-2 overwrites every single element with the number 2!
  3. The = sign is a bit of a special case. When I write e.g. a plot statement, or construct a dataframe, I prefer not to use spaces if the expression ends up all on one line, but to use spaces when the arguments are on separate lines.
  4. Here are more names that may seem attractive as variable names but that are in fact functions in the base R package and thus may cause confusion: all(), args(), attr(), beta(), body(), col(), date(), det(), diag(), diff(), dim(), dir(), dumpp(), eigen(), file(), files(), gamma(), kappa(), length(), list(), load(), log(), max(), mean(), min(), open(), q(), raw(), row(), sample(), seq(), sub(), summary(), table(), type(), url(), vector(), and version(). I'm sure you get the idea - composite names of the type proposed above in CamelCase are usually safe.