Difference between revisions of "R CodingStyle"

From "A B C"
Jump to navigation Jump to search
m
 
(One intermediate revision by the same user not shown)
Line 6: Line 6:
 
{{Vspace}}
 
{{Vspace}}
  
<!-- div class="alert">
+
<div class="alert">
  
<small>Warning – Coding style is a volatile topic. Friendships have been renounced, eternal vows of marriage have been dissolved, stock-options have been lost,
+
Warning:
all over a disagreement about the One True Brace Style, or whether <tt>fetchSequenceFromPDB()</tt>is a good function name or not. I am laying out coding rules below that reflect a few years of experience. They work for me, they may not work for you.
+
<small>
 +
Coding style is a volatile topic. Friendships have been renounced, eternal vows of marriage have been dissolved, stock-options have been lost, all over a disagreement about the One True Brace Style, or whether <tt>fetchSequenceFromPDB()</tt>is a good function name or not. I am laying out coding rules below that reflect a few years of experience. They work for me, they may not work for you.
  
 
'''However''':
 
'''However''':
* If you are ''taking one of my workshops'', I '''recommend''' you to follow these rules: I write this way, and we will find it easier to communicate if you do to.
+
* If you are ''taking one of my workshops'', I '''recommend''' you to follow these rules: I write this way, and we will find it easier to communicate if you do too.
* If you are ''collaborating on a software project'', I insist that these rules are followed, and I will not check in code that deviates. Here, consistency is key; but if you think you have a better approach, you only need to convince me and we will change the rule and apply it throughout the codebase<ref>I'm serious: I have reformatted major pieces of code more than once after learning of a better approach, and if that creates better code it is very satisfying.</ref>.
+
* If you are ''collaborating on a software project'', these rules embody the standard across the project, and I will not check-in code that deviates. Here, consistency is key; but if you think you have a better approach, you only need to convince me and we will change the rule and apply it throughout the codebase<ref>I'm serious: I have reformatted major pieces of code more than once after learning of a better approach, and if that creates better code it is very satisfying.</ref>.
* If you are ''taking one of my courses'', you will lose marks if you do not adhere to these standards. Of course, this must not be done blindly - we are training future collaborators, not parrots - but you need to write in the spirit of the one rule we all agree on:
+
* If you are ''taking one of my courses'', you may lose marks if you do not adhere to these standards. Of course, following rules must not be done blindly we are training future collaborators, not parrots but you need to write in the spirit of the one rule we all agree on:
</small>
+
</small>
  
 
Well written code helps the reader to understand the intent.
 
Well written code helps the reader to understand the intent.
  
</div -->
+
</div>
  
 
{{Vspace}}
 
{{Vspace}}
Line 25: Line 26:
  
 
__TOC__
 
__TOC__
 +
 +
 +
{{Vspace}}
 +
 +
==General==
 +
 +
It should always be your goal to code as clearly and explicitly as possible. '''R''' has many complex idioms, and it being a functional language that can generally insert functions anywhere into expressions, it is possible to write very terse, expressive code. Don't do it. Pace yourself, and make sure your reader can follow your flow of thought. More often than not the poor soul who will be confused by a particularly witty use of the language will be you, yourself, half a year later. There is an astute observation by Brian Kernighan that applies completely:
 +
 +
:"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."
 +
 +
 +
* Never sacrifice '''being explicit''' for saving on keystrokes. Code is read much more often than it is written!
 +
 +
* Use lots of '''comments'''. Don't describe what the code does, but explain '''why'''.
 +
* '''Indent''' comment hashes to align with the expressions in a block.
 +
* Use only {{c|<-}} for assignment, not {{c|{{=}}}}
 +
* ...but do use {{c|{{=}}}} when passing values into the arguments of functions.
 +
* Don't use {{c|<<-}} (global assignment) except in very unusual cases. Actually never.
 +
* Define global variables at the beginning of the code, use all caps variable names ({{c|MAXWIDTH}}) for such parameters. Never have "magic numbers" appear in your code.
 +
* If such variables are meant to be ''truly'' global use {{c|options()}} to set them.
 +
 +
* Don't use {{c|attach()}}.
 +
* Always use {{c|for (i in seq(along{{=}}x)) {...}}}  rather than  {{c|for (i in 1:length(x)) {...}}} because if {{c|x {{=}}{{=}} NULL}} the loop is executed once, with an undefined variable.
  
  
Line 31: Line 55:
 
==Layout==
 
==Layout==
  
 +
* Limit yourself to 80 characters per line.
 +
* Don't use semicolons to write more than one expression on a line.
  
  
 
{{Vspace}}
 
{{Vspace}}
  
===Granularity===
+
===Design and granularity===
 +
 
 +
* '''Don't repeat''' code. Use functions instead.
 +
* '''Don't repeat''' code. If you feel the urge to type code more than once, that's how you know you should break up the code into functions.
 +
* '''Don't repeat''' code. I'm repeating this for emphasis.
 +
 
 +
 
 +
 
 +
One of the general principles of writing clear, maintainable code is '''collocation'''. This means that information items that can affect each other should be viewable on the same screen. [https://www.joelonsoftware.com/2005/05/11/making-wrong-code-look-wrong/Joel Spolski makes a great argument for this], together with a few excellent examples; he also makes a case for a special kind of prefix notation for variable and function names that has a lot of merit.
 +
 
 +
 
 +
* If the code for a function does not fit on approxiamtaley one prnted page, you should probably break it up further.
 +
 
 +
* if your loops or conditionals are nested more than three levels deep, you should rethink the logic.
  
  
Line 42: Line 81:
 
==Headers==
 
==Headers==
  
 +
* Give your '''sources''' headers stating purpose, author, date and version information, and note bugs and issues.
 +
* Give your '''functions''' headers that describe purpose, arguments (including required datatypes), and return values. Callers should be able to work with the function without having to read the code.
  
  
Line 47: Line 88:
  
 
==Sections==
 
==Sections==
 +
* Use '''separators''' ({{c|# --- SECTION -----------------}}) to structure your code.
  
  
Line 52: Line 94:
 
{{Vspace}}
 
{{Vspace}}
  
==Spaces==
+
 
 +
===Parentheses and Braces===
 +
 
 +
* In mathematical expressions, always use '''parentheses''' to define priority explicitly. Never rely on implicit operator priority. {{c|(( 1 + 2 ) / 3 ) * 4}}
 +
* Always use braces {{c|{}}}, even if you write single-line {{c|if}} statements and loops.
 +
 
 +
{{Vspace}}
 +
 
 +
===Spaces===
  
 
<tt>if</tt> and <tt>for</tt> are '''language keywords''', not functions. Separate the following parenthesis from the keyword with a space.
 
<tt>if</tt> and <tt>for</tt> are '''language keywords''', not functions. Separate the following parenthesis from the keyword with a space.
Line 65: Line 115:
 
if(silent) { ...
 
if(silent) { ...
 
</source>
 
</source>
 +
 +
 +
* Always separate operators and arguments with spaces.<ref>Separating operators with spaces is especially important for the assignment operator {{c|<-}}. Consider this: {{c| myPreciousData < -2}} returns a vector of {{c|TRUE}} and {{c|FALSE}}, depending on whether the values in {{c|myPreciousData}} are less than -2. But  {{c| myPreciousData<-2}} overwrites every single element with the number {{c|2}}!</ref><ref>The {{c|{{=}}}} sign is a bit of a special case. When I write e.g. a plot statement, or construct a dataframe, I prefer '''not''' to use spaces if the expression ends up all on one line, but to '''use''' spaces when the arguments are on separate lines.</ref>
 +
* Never separate function names and their following parentheses with spaces.
 +
* Always use a '''space''' after a comma, and never before a comma.
  
  
Line 77: Line 132:
 
</div>
 
</div>
  
 +
* Use informative and specific '''filenames''' for code sources; give them the extension {{c|.R}}
 +
* Periods have a syntactic meaning in object-oriented classes. I consider their use in normal variables names wrong, even though this is not a syntax error.
 +
* Alphabetically sort names for related together, code autocomplete will be more useful.
 +
* Use the concise {{c|camelCaseStyle}} for variable names, don't use the {{c|confusing.dot.style}} or the rambling {{c|pothole_style}}.
 +
* Don't abbreviate argument names. You can, but you shouldn't.
  
Periods have a syntactic meaning in object-oriented classes. Using them in normal variables names is wrong.
+
* Never reassign reserved words.
 +
* Don't use {{c|c}} as a variable name since {{c|c()}}  is a function.
 +
* Don't call your data frames {{c|df}} since {{c|df()}} is a function.<ref>Here are more names that may seem attractive as variable names but that are in fact functions in the '''base R''' package and thus may cause confusion: <code> all(), args(), attr(), beta(), body(), col(), date(), det(), diag(), diff(), dim(), dir(), dumpp(), eigen(), file(), files(), gamma(), kappa(), length(), list(), load(), log(), max(), mean(), min(), open(), q(), raw(), row(), sample(), seq(), sub(), summary(), table(), type(), url(), vector(), and version()</code>. I'm sure you get the idea - composite names of the type proposed above in CamelCase are usually safe.</ref>
  
 +
* Name length should be commensurate with the scope of a variable.
  
Alphabetically sort names together, code autocomplete will be more useful.
+
;Specific naming conventions I like:
 
+
:{{c|isValid}}, {{c|hasNeighbour}}  ... Boolean variables
 +
:{{c|findRange()}}, {{c|getLimits()}} ... simple function names (verbs!)
 +
:{{c|initializeTable()}} ... not {{c|initTab()}}
 +
:{{c|node}} ... for one element; {{c|nodes}} ... for more elements
 +
:{{c|nPoints}} ... for number-of
 +
:{{c|isError}} ... not {{c|isNotError}}: avoid double negation
  
 
{{Vspace}}
 
{{Vspace}}
Line 99: Line 167:
  
  
 +
* Use spaces to '''align''' repeating parts of code, so errors become easier to spot.
  
  
Line 111: Line 180:
 
==Functions==
 
==Functions==
  
 +
* Always '''explicitly return''' values from functions, never rely on the implicit behaviour that returns the last expression.
  
{{Vspace}}
+
* In general, return only from the end of the function, not from multiple places.
 
 
==<tt># [END]</tt>==
 
 
 
 
 
It should always be your goal to code as clearly and explicitly as possible. '''R''' has many complex idioms, and it being a functional language that can generally insert functions anywhere into expressions, it is possible to write very terse, expressive code. Don't do it. Pace yourself, and make sure your reader can follow your flow of thought. More often than not the poor soul who will be confused by a particularly witty use of the language will be you, yourself, half a year later. There is an astute observation by Brian Kernighan that applies completely:
 
 
 
:"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."
 
 
 
 
 
* Never sacrifice '''being explicit''' for saving on keystrokes. Code is read much more often than it is written!
 
* Use informative and specific '''filenames''' for code sources; give them the extension {{c|.R}}
 
* Give your '''sources''' headers stating purpose, author, date and version information, and note bugs and issues.
 
* Give your '''functions''' headers that describe purpose, arguments (including required datatypes), and return values. Callers should be able to work with the function without having to read the code.
 
* Use lots of '''comments'''. Don't describe what the code does, but explain '''why'''.
 
* Use '''separators''' ({{c|# --- SECTION -----------------}}) to structure your code.
 
* '''Indent''' comment hashes to align with the expressions in a block.
 
* Use only {{c|<-}} for assignment, not {{c|{{=}}}}
 
* ...but do use {{c|{{=}}}} when passing values into the arguments of functions.
 
* Don't use {{c|<<-}} (global assignment) except in very unusual cases. Actually never.
 
* Use the concise {{c|camelCaseStyle}} for variable names, don't use the {{c|confusing.dot.style}} or the rambling {{c|pothole_style}}.
 
* Define parameters at the beginning of the code, use all caps variable names ({{c|MAXWIDTH}}) for such parameters. Never have "magic numbers" appear in your code.
 
* In mathematical expressions, always use '''parentheses''' to define priority explicitly. Never rely on implicit operator priority. {{c|(( 1 + 2 ) / 3 ) * 4}}
 
* Always separate operators and arguments with spaces.<ref>Separating operators with spaces is especially important for the assignment operator {{c|<-}}. Consider this: {{c| myPreciousData < -2}} returns a vector of {{c|TRUE}} and {{c|FALSE}}, depending on whether the values in {{c|myPreciousData}} are less than -2. But  {{c| myPreciousData<-2}} overwrites every single element with the number {{c|2}}!</ref><ref>The {{c|{{=}}}} sign is a bit of a special case. When I write e.g. a plot statement, or construct a dataframe, I prefer '''not''' to use spaces if the expression ends up all on one line, but to '''use''' spaces when the arguments are on separate lines.</ref>
 
* Never separate function names and the brackets that enclose argument lists.
 
* Don't abbreviate argument names. You can, but you shouldn't.
 
* Try to limit yourself to ~80 characters per line.
 
* Always use braces {{c|{}}}, even if you write single-line {{c|if}} statements and loops.
 
* Always use a '''space''' after a comma, and never before a comma.
 
* Always '''explicitly return''' values from functions, never rely on the implicit behaviour that returns the last expression.
 
* Use spaces to '''align''' repeating parts of code, so errors become easier to spot.
 
* '''Don't repeat''' code. Use functions instead.
 
* '''Don't repeat''' code. If you feel the urge to type code more than once, that's how you know you should break up the code into functions.
 
* '''Don't repeat''' code. I'm repeating this for emphasis.
 
 
* Explicitly assign values to crucial function arguments, even if you think you know that that value is the default.
 
* Explicitly assign values to crucial function arguments, even if you think you know that that value is the default.
* Never reassign reserved words.
 
* Don't use {{c|c}} as a variable name since {{c|c()}}  is a function.
 
* Don't call your data frames {{c|df}} since {{c|df()}} is a function.<ref>Here are more names that may seem attractive as variable names but that are in fact functions in the '''base R''' package and thus may cause confusion: <code> all(), args(), attr(), beta(), body(), col(), date(), det(), diag(), diff(), dim(), dir(), dumpp(), eigen(), file(), files(), gamma(), kappa(), length(), list(), load(), log(), max(), mean(), min(), open(), q(), raw(), row(), sample(), seq(), sub(), summary(), table(), type(), url(), vector(), and version()</code>. I'm sure you get the idea - composite names of the type proposed above in CamelCase are usually safe.</ref>
 
* Don't use semicolons to write more than one expression on a line.
 
* Don't use {{c|attach()}}.
 
* It's safer to use {{c|for (i in seq(along{{=}}x)) {...}}}  rather than  {{c|for (i in 1:length(x)) {...}}} because if {{c|x {{=}}{{=}} NULL}} the loop is executed once, with an undefined variable.
 
 
 
;Specific naming conventions I like:
 
:{{c|isValid}}, {{c|hasNeighbour}}  ... Boolean variables
 
:{{c|findRange()}}, {{c|getLimits()}} ... simple function names (verbs!)
 
:{{c|initializeTable()}} ... not {{c|initTab()}}
 
:{{c|node}} ... for one element; {{c|nodes}} ... for more elements
 
:{{c|nPoints}} ... for number-of
 
:{{c|isError}} ... not {{c|isNotError}}: avoid double negation
 
 
 
Consider using the [http://cran.r-project.org/web/packages/formatR/index.html '''formatR'''] package for consistent code.
 
  
 +
{{Vspace}}
  
 +
==Efficiency==
  
 
If possible, do not grow data structures dynamically, but create the whole structure with "empty" values, then assign values to its elements. This is '''much''' faster.
 
If possible, do not grow data structures dynamically, but create the whole structure with "empty" values, then assign values to its elements. This is '''much''' faster.
Line 172: Line 193:
 
<source lang = "rsplus">
 
<source lang = "rsplus">
 
  # This is bad:  
 
  # This is bad:  
  v <- numeric()
+
  v <- 0
 
  for (i in 1:100000) {
 
  for (i in 1:100000) {
 
     v <- c(v, sqrt(i))
 
     v <- c(v, sqrt(i))
Line 179: Line 200:
 
  20.192  2.182  22.540  
 
  20.192  2.182  22.540  
 
   
 
   
  # This is slightly better:  
+
  # This is marginally better:  
 
  v <- numeric()
 
  v <- numeric()
 
  for (i in 1:100000) {
 
  for (i in 1:100000) {
Line 188: Line 209:
  
 
  # This is much, much better (200 times faster):
 
  # This is much, much better (200 times faster):
  N <- 100000
+
   
  v <- numeric(N)
+
  v <- numeric(100000)
  for (i in 1:N) {
+
  for (i in seq_along(v)) {
 
     v[i] <- sqrt(i)
 
     v[i] <- sqrt(i)
 
  }
 
  }
Line 197: Line 218:
 
</source>
 
</source>
  
One of the general principles of writing clear, maintainable code is '''collocation'''. This means that information items that can affect each other should be viewable on the same screen. [https://www.joelonsoftware.com/2005/05/11/making-wrong-code-look-wrong/Joel Spolski makes a great argument for this], together with a few excellent examples; he also makes a case for a special kind of prefix notation for variable and function names that has a lot of merit.  
+
 
 +
 
 +
{{Vspace}}
 +
 
 +
==<tt># [END]</tt>==
 +
 
 +
*Always end your code with an {{c|# [END]}} comment. This way you can be sure it was copied or saved completely and nothig has been inadvertently omitted.
 +
 
 +
 
  
  

Latest revision as of 15:33, 17 April 2017

R Coding Style


 

Warning: Coding style is a volatile topic. Friendships have been renounced, eternal vows of marriage have been dissolved, stock-options have been lost, all over a disagreement about the One True Brace Style, or whether fetchSequenceFromPDB()is a good function name or not. I am laying out coding rules below that reflect a few years of experience. They work for me, they may not work for you.

However:

  • If you are taking one of my workshops, I recommend you to follow these rules: I write this way, and we will find it easier to communicate if you do too.
  • If you are collaborating on a software project, these rules embody the standard across the project, and I will not check-in code that deviates. Here, consistency is key; but if you think you have a better approach, you only need to convince me and we will change the rule and apply it throughout the codebase[1].
  • If you are taking one of my courses, you may lose marks if you do not adhere to these standards. Of course, following rules must not be done blindly – we are training future collaborators, not parrots – but you need to write in the spirit of the one rule we all agree on:

Well written code helps the reader to understand the intent.


 



 

General

It should always be your goal to code as clearly and explicitly as possible. R has many complex idioms, and it being a functional language that can generally insert functions anywhere into expressions, it is possible to write very terse, expressive code. Don't do it. Pace yourself, and make sure your reader can follow your flow of thought. More often than not the poor soul who will be confused by a particularly witty use of the language will be you, yourself, half a year later. There is an astute observation by Brian Kernighan that applies completely:

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."


  • Never sacrifice being explicit for saving on keystrokes. Code is read much more often than it is written!
  • Use lots of comments. Don't describe what the code does, but explain why.
  • Indent comment hashes to align with the expressions in a block.
  • Use only <- for assignment, not =
  • ...but do use = when passing values into the arguments of functions.
  • Don't use <<- (global assignment) except in very unusual cases. Actually never.
  • Define global variables at the beginning of the code, use all caps variable names (MAXWIDTH) for such parameters. Never have "magic numbers" appear in your code.
  • If such variables are meant to be truly global use options() to set them.
  • Don't use attach().
  • Always use for (i in seq(along=x)) {...} rather than for (i in 1:length(x)) {...} because if x == NULL the loop is executed once, with an undefined variable.


 

Layout

  • Limit yourself to 80 characters per line.
  • Don't use semicolons to write more than one expression on a line.


 

Design and granularity

  • Don't repeat code. Use functions instead.
  • Don't repeat code. If you feel the urge to type code more than once, that's how you know you should break up the code into functions.
  • Don't repeat code. I'm repeating this for emphasis.


One of the general principles of writing clear, maintainable code is collocation. This means that information items that can affect each other should be viewable on the same screen. Spolski makes a great argument for this, together with a few excellent examples; he also makes a case for a special kind of prefix notation for variable and function names that has a lot of merit.


  • If the code for a function does not fit on approxiamtaley one prnted page, you should probably break it up further.
  • if your loops or conditionals are nested more than three levels deep, you should rethink the logic.


 

Headers

  • Give your sources headers stating purpose, author, date and version information, and note bugs and issues.
  • Give your functions headers that describe purpose, arguments (including required datatypes), and return values. Callers should be able to work with the function without having to read the code.


 

Sections

  • Use separators (# --- SECTION -----------------) to structure your code.



 


Parentheses and Braces

  • In mathematical expressions, always use parentheses to define priority explicitly. Never rely on implicit operator priority. (( 1 + 2 ) / 3 ) * 4
  • Always use braces {}, even if you write single-line if statements and loops.


 

Spaces

if and for are language keywords, not functions. Separate the following parenthesis from the keyword with a space.

Good:

if (silent) { ...

Bad:

if(silent) { ...


  • Always separate operators and arguments with spaces.[2][3]
  • Never separate function names and their following parentheses with spaces.
  • Always use a space after a comma, and never before a comma.


 

Names

There are only two hard things in Computer Science: cache invalidation and naming things.

- Phil Karlton[4]

  • Use informative and specific filenames for code sources; give them the extension .R
  • Periods have a syntactic meaning in object-oriented classes. I consider their use in normal variables names wrong, even though this is not a syntax error.
  • Alphabetically sort names for related together, code autocomplete will be more useful.
  • Use the concise camelCaseStyle for variable names, don't use the confusing.dot.style or the rambling pothole_style.
  • Don't abbreviate argument names. You can, but you shouldn't.
  • Never reassign reserved words.
  • Don't use c as a variable name since c() is a function.
  • Don't call your data frames df since df() is a function.[5]
  • Name length should be commensurate with the scope of a variable.
Specific naming conventions I like
isValid, hasNeighbour ... Boolean variables
findRange(), getLimits() ... simple function names (verbs!)
initializeTable() ... not initTab()
node ... for one element; nodes ... for more elements
nPoints ... for number-of
isError ... not isNotError: avoid double negation


 

Conditionals

 

Indent Style

No need for much discussion. Follow the One True Bracing Style and we will both be happy. If you don't immediately see why: read about indent style here.


Indentation of long function declarations

  • Use spaces to align repeating parts of code, so errors become easier to spot.


 

Loops

 

Functions

  • Always explicitly return values from functions, never rely on the implicit behaviour that returns the last expression.
  • In general, return only from the end of the function, not from multiple places.
  • Explicitly assign values to crucial function arguments, even if you think you know that that value is the default.


 

Efficiency

If possible, do not grow data structures dynamically, but create the whole structure with "empty" values, then assign values to its elements. This is much faster.

 # This is bad: 
 v <- 0
 for (i in 1:100000) {
     v <- c(v, sqrt(i))
 }
    user  system elapsed 
 20.192   2.182  22.540 
 
 # This is marginally better: 
 v <- numeric()
 for (i in 1:100000) {
     v[i] <- sqrt(i)
 }
   user  system elapsed 
 14.185   2.036  16.230 

 # This is much, much better (200 times faster):
 
 v <- numeric(100000)
 for (i in seq_along(v)) {
     v[i] <- sqrt(i)
 }
   user  system elapsed 
  0.101   0.008   0.108



 

# [END]

  • Always end your code with an # [END] comment. This way you can be sure it was copied or saved completely and nothig has been inadvertently omitted.



Sources and Notes

  1. I'm serious: I have reformatted major pieces of code more than once after learning of a better approach, and if that creates better code it is very satisfying.
  2. Separating operators with spaces is especially important for the assignment operator <-. Consider this: myPreciousData < -2 returns a vector of TRUE and FALSE, depending on whether the values in myPreciousData are less than -2. But myPreciousData<-2 overwrites every single element with the number 2!
  3. The = sign is a bit of a special case. When I write e.g. a plot statement, or construct a dataframe, I prefer not to use spaces if the expression ends up all on one line, but to use spaces when the arguments are on separate lines.
  4. For a complementary perspective, see here.
  5. Here are more names that may seem attractive as variable names but that are in fact functions in the base R package and thus may cause confusion: all(), args(), attr(), beta(), body(), col(), date(), det(), diag(), diff(), dim(), dir(), dumpp(), eigen(), file(), files(), gamma(), kappa(), length(), list(), load(), log(), max(), mean(), min(), open(), q(), raw(), row(), sample(), seq(), sub(), summary(), table(), type(), url(), vector(), and version(). I'm sure you get the idea - composite names of the type proposed above in CamelCase are usually safe.