Difference between revisions of "BCH441 Code submisson instructions"

From "A B C"
Jump to navigation Jump to search
 
(One intermediate revision by the same user not shown)
Line 28: Line 28:
  
 
;Developing
 
;Developing
:You must develop your code in one or more '''functions''', not just ''ad hoc'' scripts. The analysis must be run from a function. Make sure that your variables are instantiated within the function body, or passed as parameters. Do not refer to variables in the global workspace, create everything you need in the local context. <small>(Ask, if you are not sure how to do that e.g. for a genetic code table or such.)</small> Please consider the [[http://steipe.biochemistry.utoronto.ca/abc/index.php/ABC-Rubrics#Code|evaluation rubrics]].
+
:You must develop your code in one or more '''functions''', not just ''ad hoc'' scripts. The analysis must be run from a function. Make sure that your variables are instantiated within the function body, or passed as parameters. Do not refer to variables in the global workspace, create everything you need in the local context. <small>(Ask, if you are not sure how to do that e.g. for a genetic code table or such.)</small> Please consider the [[ABC-Rubrics#Code|evaluation rubrics]].
  
 
;Final analysis
 
;Final analysis

Latest revision as of 23:25, 17 November 2020

Code submission instructions


 


 

This course has strict formal requirements for submitting code for credit. These requirements are designed to demonstrate that it was you who ran the code, that the code you have submitted is the exact code that was used to produce the submitted output, and that the results have not been altered after submission.[1] Just to remind you:

Once you have developed your analysis and submitted your results and code
  • The submitted code must be exactly the code that you have used to obtain your results. Altering the code in your documentation - even if only for cosmetic purposes - is an academic offence.
  • The reported results must be exactly the results your code has produced at the time you ran it. Altering the results you obtained - for any reason whatsoever - is an academic offence.
  • It must be possible to reproduce your exact results from your posted code.

Of course a situation might arise where you do need to alter the code - e.g. to format the output better, to add a reference to a source that you forgot, or to add a clarifying comment. Not a problem. Simply modify whatever you need, and replace the entire old code and output in your documentation.

When in doubt: be safe and ask; don't make assumptions.

Instructions

 
Developing
You must develop your code in one or more functions, not just ad hoc scripts. The analysis must be run from a function. Make sure that your variables are instantiated within the function body, or passed as parameters. Do not refer to variables in the global workspace, create everything you need in the local context. (Ask, if you are not sure how to do that e.g. for a genetic code table or such.) Please consider the evaluation rubrics.
Final analysis
Your final analysis should run from a function call without manual intervention. Your function needs to execute sealKey() as its first instruction and as its last instruction. sealKey() takes a snapshot of the function state, deposits it in an encrypted version on the course server, and returns a validating key. (Ask, if you feel this does not work well for the solution you had in mind and you need a different approach.)

Please follow the following code pattern:

mySubmissionCode <- function(<parameters>) {
  sealKey()  # run sealKey() as the first command

# put your code here ...
# make sure it is well commented
# make sure you reference your sources
# make your function draw your plots, compute averages, etc. and use print() or cat()
#   as required to output results.
 
  sealKey()  # run sealKey() again, as the last command
  return()
}

Finally: run your code as a single command (e.g. mySubmissionCode(a = "this", b="that",...)), and paste the entire code and all of its output into your submission


 

Example

Here is a formatted example, in case you need one.

Example code


The code I used
myAApairs <- function(mySeq, N=1000) {
  # compare observed with expected amino acid pair frequencies
  # input: char   a sequence vector of one-letter codes
  #        int    number of times to run a trial
  # value: NULL   (invisible). Results are printed as side-effect

  sealKey()  # run sealKey() as the first command

  aa <- c("a", "c", "d", "e", "f", "g", "h", "i", "k", "l",
          "m", "n", "p", "q", "r", "s", "t", "v", "w", "y")
  aap <- paste0(aa[rep(1:20, each = 20)], aa[rep(1:20, 20)])  # all pairs

  pairs <- function(s) {
    # Return a table of amino acid pair frequencies from a vector s.
    # We make all pairs from s. Then we convert the observed pairs into factors
    # and use table(). Since we define the levels with aap, table() also reports
    # zero-counts. (cf. https://stackoverflow.com/questions/1617061/ )
    l <- length(s)
    myPairs <- paste0(s[1:(l-1)], s[2:l])
    return(table(factor(myPairs, levels = aap)))
  }
  if (FALSE) { # test the function
    myS <- rep(c("a","c"), each = 3) #"a" "a" "a" "c" "c" "c"
    x <- pairs(myS)
    length(x) # 400
    x[x != 0] # look at all non-zero counts
    # aa ac cc
    # 2  1  2
  }
  # process ...
  mySeq <- tolower(mySeq)

  tObs <- pairs(mySeq)

  tSim <- pairs("") # initialize with empty string: all zero counts
  for (i in 1:N) {
    tSim <- tSim + pairs(sample(mySeq)) # add paircounts of shuffled sequence
  }
  tSim <- tSim / N   # average

  # Report differences as log-ratio.
  # First, add dummy observations instead of zero-counts to avoid infinities.
  tObs <- tObs + 0.0001
  tSim <- tSim + 0.0001
  # Get log ratios
  tDiff <- log(tObs / tSim)
  # print results
  cat("\nTen highest enriched pairs (tObs > tSim):\n")
  ord <- order(tDiff, decreasing = TRUE)
  cat(" ")  # IDK why this is needed, but it doesn't align otherwise
  cat(sprintf("Pair: %s  log(q): %5.3f (fObs: %5.2f fSim: %5.2f)\n",
              names(tObs[ord[1:10]]),
              tDiff[ord[1:10]], tObs[ord[1:10]], tSim[ord[1:10]]))

  cat("\nTen most depleted pairs (tObs < tSim):\n")
  ord <- order(tDiff, decreasing = FALSE)
  cat(" ")
  cat(sprintf("Pair: %s  log(q): %5.3f (fObs: %5.2f fSim: %5.2f)\n",
              names(tObs[ord[1:10]]),
              tDiff[ord[1:10]], tObs[ord[1:10]], tSim[ord[1:10]]))

  sealKey()  # run sealKey() again, as the last command
  return(invisible(NULL))
}
How it was run
sel <- grep("MBP1", myDB$protein$name)                    # Mbp1 orthologues
mySeq <- paste0(myDB$protein$sequence[sel], collapse="-") # fetch and separate
mySeq <- unlist(strsplit(mySeq, ""))                      # turn into vector

myAApairs(mySeq)
What it produced
[1] "sealKey> 2020-10-12 05:10:18 (6ca9e8aa-37cc-877f-ef98-f60fafec7ec1)"

Ten highest enriched pairs (tObs > tSim):
 Pair: hw  log(q): 2.053 (fObs: 11.00 fSim:  1.41)
 Pair: wi  log(q): 1.568 (fObs: 13.00 fSim:  2.71)
 Pair: tw  log(q): 1.527 (fObs: 14.00 fSim:  3.04)
 Pair: hh  log(q): 1.215 (fObs: 19.00 fSim:  5.64)
 Pair: pp  log(q): 0.925 (fObs: 65.00 fSim: 25.77)
 Pair: wa  log(q): 0.921 (fObs: 13.00 fSim:  5.18)
 Pair: wv  log(q): 0.906 (fObs:  7.00 fSim:  2.83)
 Pair: dw  log(q): 0.889 (fObs:  9.00 fSim:  3.70)
 Pair: fh  log(q): 0.837 (fObs: 12.00 fSim:  5.19)
 Pair: yf  log(q): 0.750 (fObs: 11.00 fSim:  5.20)

Ten most depleted pairs (tObs < tSim):
 Pair: fy  log(q): -10.868 (fObs:  0.00 fSim:  5.25)
 Pair: aw  log(q): -10.864 (fObs:  0.00 fSim:  5.23)
 Pair: my  log(q): -10.854 (fObs:  0.00 fSim:  5.18)
 Pair: lw  log(q): -10.784 (fObs:  0.00 fSim:  4.82)
 Pair: ew  log(q): -10.616 (fObs:  0.00 fSim:  4.08)
 Pair: wd  log(q): -10.524 (fObs:  0.00 fSim:  3.72)
 Pair: wr  log(q): -10.498 (fObs:  0.00 fSim:  3.62)
 Pair: pw  log(q): -10.330 (fObs:  0.00 fSim:  3.07)
 Pair: ic  log(q): -10.323 (fObs:  0.00 fSim:  3.04)
 Pair: wq  log(q): -10.215 (fObs:  0.00 fSim:  2.73)
[1] "sealKey> 2020-10-12 05:10:22 (5151048f-1ba2-e074-3281-18a0e0a080e7)"



 

Notes

  1. This all is in response to an academic offence that occurred when a student could not get their code to run correctly and then confabulated the results.