Difference between revisions of "BCH441 Code submisson instructions"

From "A B C"
Jump to navigation Jump to search
m
 
(6 intermediate revisions by the same user not shown)
Line 9: Line 9:
  
 
{{Vspace}}
 
{{Vspace}}
 +
 +
This course has strict formal requirements  for submitting code for credit. These requirements are designed to demonstrate that <u>it was you who ran the code</u>, that <u>the code you have submitted is the exact code that was used to produce the submitted output</u>, and that <u>the results have not been altered after submission</u>.<ref>This all is in response to an academic offence that occurred when a student could not get their code to run correctly and then confabulated the results.</ref> Just to remind you:
 +
 +
<div class="note">
 +
;Once you have developed your analysis and submitted your results and code:
 +
* The submitted code must be exactly the code that you have used to obtain your results. Altering the code in your documentation - even if only for cosmetic purposes - is an academic offence.
 +
* The reported results must be exactly the results your code has produced at the time you ran it. Altering the results you obtained - for any reason whatsoever - is an academic offence.
 +
* It must be possible to reproduce your exact results from your posted code.
 +
</div>
 +
 +
Of course a situation might arise where you '''do''' need to alter the code - e.g. to format the output better, to add a reference to a source that you forgot, or to add a clarifying comment. Not a problem. Simply modify whatever you need, and '''replace the entire old code and output''' in your documentation.
 +
 +
When in doubt: be safe and ask; don't make assumptions.
 +
 
== Instructions ==
 
== Instructions ==
  
 
{{Smallvspace}}
 
{{Smallvspace}}
  
<div class="alert">
+
;Developing
Coming soon.
+
:You must develop your code in one or more '''functions''', not just ''ad hoc'' scripts. The analysis must be run from a function. Make sure that your variables are instantiated within the function body, or passed as parameters. Do not refer to variables in the global workspace, create everything you need in the local context. <small>(Ask, if you are not sure how to do that e.g. for a genetic code table or such.)</small> Please consider the [[ABC-Rubrics#Code|evaluation rubrics]].
 +
 
 +
;Final analysis
 +
:Your final analysis should run from a function call without manual intervention. Your function needs to execute <code>sealKey()</code> as its first instruction '''and''' as its last instruction. <code>sealKey()</code> takes a snapshot of the function state, deposits it in an encrypted version on the course server, and returns a validating key. <small>(Ask, if you feel this does not work well for the solution you had in mind and you need a different approach.)</small>
 +
 
 +
Please follow the following code pattern:
 +
<pre>
 +
mySubmissionCode <- function(<parameters>) {
 +
  sealKey()  # run sealKey() as the first command
 +
 
 +
# put your code here ...
 +
# make sure it is well commented
 +
# make sure you reference your sources
 +
# make your function draw your plots, compute averages, etc. and use print() or cat()
 +
#  as required to output results.
 +
 +
  sealKey()  # run sealKey() again, as the last command
 +
  return()
 +
}
 +
</pre>
 +
 
 +
Finally: run your code as a single command (e.g. <code>mySubmissionCode(a = "this", b="that",...)</code>), and paste the entire code and all of its output into your submission
 +
 
 +
{{Vspace}}
 +
 
 +
== Example ==
 +
 
 +
Here is a formatted example, in case you need one.
 +
 
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand" data-collapsetext="Collapse" style="width:80%; ">
 +
Example code
 +
<div class="mw-collapsible-content" style="padding:10px;">
 +
 
 +
----
 +
 
 +
;The code I used:
 +
<pre>
 +
myAApairs <- function(mySeq, N=1000) {
 +
  # compare observed with expected amino acid pair frequencies
 +
  # input: char  a sequence vector of one-letter codes
 +
  #        int    number of times to run a trial
 +
  # value: NULL  (invisible). Results are printed as side-effect
 +
 
 +
  sealKey()  # run sealKey() as the first command
 +
 
 +
  aa <- c("a", "c", "d", "e", "f", "g", "h", "i", "k", "l",
 +
          "m", "n", "p", "q", "r", "s", "t", "v", "w", "y")
 +
  aap <- paste0(aa[rep(1:20, each = 20)], aa[rep(1:20, 20)])  # all pairs
 +
 
 +
  pairs <- function(s) {
 +
    # Return a table of amino acid pair frequencies from a vector s.
 +
    # We make all pairs from s. Then we convert the observed pairs into factors
 +
    # and use table(). Since we define the levels with aap, table() also reports
 +
    # zero-counts. (cf. https://stackoverflow.com/questions/1617061/ )
 +
    l <- length(s)
 +
    myPairs <- paste0(s[1:(l-1)], s[2:l])
 +
    return(table(factor(myPairs, levels = aap)))
 +
  }
 +
  if (FALSE) { # test the function
 +
    myS <- rep(c("a","c"), each = 3) #"a" "a" "a" "c" "c" "c"
 +
    x <- pairs(myS)
 +
    length(x) # 400
 +
    x[x != 0] # look at all non-zero counts
 +
    # aa ac cc
 +
    # 2  1  2
 +
  }
 +
  # process ...
 +
  mySeq <- tolower(mySeq)
 +
 
 +
  tObs <- pairs(mySeq)
 +
 
 +
  tSim <- pairs("") # initialize with empty string: all zero counts
 +
  for (i in 1:N) {
 +
    tSim <- tSim + pairs(sample(mySeq)) # add paircounts of shuffled sequence
 +
  }
 +
  tSim <- tSim / N  # average
 +
 
 +
  # Report differences as log-ratio.
 +
  # First, add dummy observations instead of zero-counts to avoid infinities.
 +
  tObs <- tObs + 0.0001
 +
  tSim <- tSim + 0.0001
 +
  # Get log ratios
 +
  tDiff <- log(tObs / tSim)
 +
  # print results
 +
  cat("\nTen highest enriched pairs (tObs > tSim):\n")
 +
  ord <- order(tDiff, decreasing = TRUE)
 +
  cat(" ")  # IDK why this is needed, but it doesn't align otherwise
 +
  cat(sprintf("Pair: %s  log(q): %5.3f (fObs: %5.2f fSim: %5.2f)\n",
 +
              names(tObs[ord[1:10]]),
 +
              tDiff[ord[1:10]], tObs[ord[1:10]], tSim[ord[1:10]]))
 +
 
 +
  cat("\nTen most depleted pairs (tObs < tSim):\n")
 +
  ord <- order(tDiff, decreasing = FALSE)
 +
  cat(" ")
 +
  cat(sprintf("Pair: %s  log(q): %5.3f (fObs: %5.2f fSim: %5.2f)\n",
 +
              names(tObs[ord[1:10]]),
 +
              tDiff[ord[1:10]], tObs[ord[1:10]], tSim[ord[1:10]]))
 +
 
 +
  sealKey()  # run sealKey() again, as the last command
 +
  return(invisible(NULL))
 +
}
 +
</pre>
 +
 
 +
;How it was run:
 +
<pre>
 +
sel <- grep("MBP1", myDB$protein$name)                    # Mbp1 orthologues
 +
mySeq <- paste0(myDB$protein$sequence[sel], collapse="-") # fetch and separate
 +
mySeq <- unlist(strsplit(mySeq, ""))                      # turn into vector
 +
 
 +
myAApairs(mySeq)
 +
</pre>
 +
 
 +
;What it produced:
 +
<pre>
 +
[1] "sealKey> 2020-10-12 05:10:18 (6ca9e8aa-37cc-877f-ef98-f60fafec7ec1)"
 +
 
 +
Ten highest enriched pairs (tObs > tSim):
 +
Pair: hw  log(q): 2.053 (fObs: 11.00 fSim:  1.41)
 +
Pair: wi  log(q): 1.568 (fObs: 13.00 fSim:  2.71)
 +
Pair: tw  log(q): 1.527 (fObs: 14.00 fSim:  3.04)
 +
Pair: hh  log(q): 1.215 (fObs: 19.00 fSim:  5.64)
 +
Pair: pp  log(q): 0.925 (fObs: 65.00 fSim: 25.77)
 +
Pair: wa  log(q): 0.921 (fObs: 13.00 fSim:  5.18)
 +
Pair: wv  log(q): 0.906 (fObs:  7.00 fSim:  2.83)
 +
Pair: dw  log(q): 0.889 (fObs:  9.00 fSim:  3.70)
 +
Pair: fh  log(q): 0.837 (fObs: 12.00 fSim:  5.19)
 +
Pair: yf  log(q): 0.750 (fObs: 11.00 fSim:  5.20)
 +
 
 +
Ten most depleted pairs (tObs < tSim):
 +
Pair: fy  log(q): -10.868 (fObs:  0.00 fSim:  5.25)
 +
Pair: aw  log(q): -10.864 (fObs:  0.00 fSim:  5.23)
 +
Pair: my  log(q): -10.854 (fObs:  0.00 fSim:  5.18)
 +
Pair: lw  log(q): -10.784 (fObs:  0.00 fSim:  4.82)
 +
Pair: ew  log(q): -10.616 (fObs:  0.00 fSim:  4.08)
 +
Pair: wd  log(q): -10.524 (fObs:  0.00 fSim:  3.72)
 +
Pair: wr  log(q): -10.498 (fObs:  0.00 fSim:  3.62)
 +
Pair: pw  log(q): -10.330 (fObs:  0.00 fSim:  3.07)
 +
Pair: ic  log(q): -10.323 (fObs:  0.00 fSim:  3.04)
 +
Pair: wq  log(q): -10.215 (fObs:  0.00 fSim:  2.73)
 +
[1] "sealKey> 2020-10-12 05:10:22 (5151048f-1ba2-e074-3281-18a0e0a080e7)"
 +
</pre>
 +
----
 +
</div>
 
</div>
 
</div>
  
<!--  
+
 
<ul>
+
<!-- IA specific instructions ???
<li>If there is any programming involved, create another page as a code appendix and link to it from your submission page. All code that you have used must be on this appendix.</li>
+
== For ABC-INT Mutation Impact ==
<li>Make sure that the exact code that you have used is posted to your code appendix. This is a crucial part of the submission. See below for a formatting example. '''If I cannot exactly reproduce your results from your posted code, you will receive 0 marks for this unit'''. Please ask in case there are  technical issues with this point.</li>
+
{{Smallvspace}}
</ul>
+
 
 +
...
 +
 
 
  -->
 
  -->
  
 +
{{Vspace}}
  
 
====Notes====
 
====Notes====

Latest revision as of 23:25, 17 November 2020

Code submission instructions


 


 

This course has strict formal requirements for submitting code for credit. These requirements are designed to demonstrate that it was you who ran the code, that the code you have submitted is the exact code that was used to produce the submitted output, and that the results have not been altered after submission.[1] Just to remind you:

Once you have developed your analysis and submitted your results and code
  • The submitted code must be exactly the code that you have used to obtain your results. Altering the code in your documentation - even if only for cosmetic purposes - is an academic offence.
  • The reported results must be exactly the results your code has produced at the time you ran it. Altering the results you obtained - for any reason whatsoever - is an academic offence.
  • It must be possible to reproduce your exact results from your posted code.

Of course a situation might arise where you do need to alter the code - e.g. to format the output better, to add a reference to a source that you forgot, or to add a clarifying comment. Not a problem. Simply modify whatever you need, and replace the entire old code and output in your documentation.

When in doubt: be safe and ask; don't make assumptions.

Instructions

 
Developing
You must develop your code in one or more functions, not just ad hoc scripts. The analysis must be run from a function. Make sure that your variables are instantiated within the function body, or passed as parameters. Do not refer to variables in the global workspace, create everything you need in the local context. (Ask, if you are not sure how to do that e.g. for a genetic code table or such.) Please consider the evaluation rubrics.
Final analysis
Your final analysis should run from a function call without manual intervention. Your function needs to execute sealKey() as its first instruction and as its last instruction. sealKey() takes a snapshot of the function state, deposits it in an encrypted version on the course server, and returns a validating key. (Ask, if you feel this does not work well for the solution you had in mind and you need a different approach.)

Please follow the following code pattern:

mySubmissionCode <- function(<parameters>) {
  sealKey()  # run sealKey() as the first command

# put your code here ...
# make sure it is well commented
# make sure you reference your sources
# make your function draw your plots, compute averages, etc. and use print() or cat()
#   as required to output results.
 
  sealKey()  # run sealKey() again, as the last command
  return()
}

Finally: run your code as a single command (e.g. mySubmissionCode(a = "this", b="that",...)), and paste the entire code and all of its output into your submission


 

Example

Here is a formatted example, in case you need one.

Example code


The code I used
myAApairs <- function(mySeq, N=1000) {
  # compare observed with expected amino acid pair frequencies
  # input: char   a sequence vector of one-letter codes
  #        int    number of times to run a trial
  # value: NULL   (invisible). Results are printed as side-effect

  sealKey()  # run sealKey() as the first command

  aa <- c("a", "c", "d", "e", "f", "g", "h", "i", "k", "l",
          "m", "n", "p", "q", "r", "s", "t", "v", "w", "y")
  aap <- paste0(aa[rep(1:20, each = 20)], aa[rep(1:20, 20)])  # all pairs

  pairs <- function(s) {
    # Return a table of amino acid pair frequencies from a vector s.
    # We make all pairs from s. Then we convert the observed pairs into factors
    # and use table(). Since we define the levels with aap, table() also reports
    # zero-counts. (cf. https://stackoverflow.com/questions/1617061/ )
    l <- length(s)
    myPairs <- paste0(s[1:(l-1)], s[2:l])
    return(table(factor(myPairs, levels = aap)))
  }
  if (FALSE) { # test the function
    myS <- rep(c("a","c"), each = 3) #"a" "a" "a" "c" "c" "c"
    x <- pairs(myS)
    length(x) # 400
    x[x != 0] # look at all non-zero counts
    # aa ac cc
    # 2  1  2
  }
  # process ...
  mySeq <- tolower(mySeq)

  tObs <- pairs(mySeq)

  tSim <- pairs("") # initialize with empty string: all zero counts
  for (i in 1:N) {
    tSim <- tSim + pairs(sample(mySeq)) # add paircounts of shuffled sequence
  }
  tSim <- tSim / N   # average

  # Report differences as log-ratio.
  # First, add dummy observations instead of zero-counts to avoid infinities.
  tObs <- tObs + 0.0001
  tSim <- tSim + 0.0001
  # Get log ratios
  tDiff <- log(tObs / tSim)
  # print results
  cat("\nTen highest enriched pairs (tObs > tSim):\n")
  ord <- order(tDiff, decreasing = TRUE)
  cat(" ")  # IDK why this is needed, but it doesn't align otherwise
  cat(sprintf("Pair: %s  log(q): %5.3f (fObs: %5.2f fSim: %5.2f)\n",
              names(tObs[ord[1:10]]),
              tDiff[ord[1:10]], tObs[ord[1:10]], tSim[ord[1:10]]))

  cat("\nTen most depleted pairs (tObs < tSim):\n")
  ord <- order(tDiff, decreasing = FALSE)
  cat(" ")
  cat(sprintf("Pair: %s  log(q): %5.3f (fObs: %5.2f fSim: %5.2f)\n",
              names(tObs[ord[1:10]]),
              tDiff[ord[1:10]], tObs[ord[1:10]], tSim[ord[1:10]]))

  sealKey()  # run sealKey() again, as the last command
  return(invisible(NULL))
}
How it was run
sel <- grep("MBP1", myDB$protein$name)                    # Mbp1 orthologues
mySeq <- paste0(myDB$protein$sequence[sel], collapse="-") # fetch and separate
mySeq <- unlist(strsplit(mySeq, ""))                      # turn into vector

myAApairs(mySeq)
What it produced
[1] "sealKey> 2020-10-12 05:10:18 (6ca9e8aa-37cc-877f-ef98-f60fafec7ec1)"

Ten highest enriched pairs (tObs > tSim):
 Pair: hw  log(q): 2.053 (fObs: 11.00 fSim:  1.41)
 Pair: wi  log(q): 1.568 (fObs: 13.00 fSim:  2.71)
 Pair: tw  log(q): 1.527 (fObs: 14.00 fSim:  3.04)
 Pair: hh  log(q): 1.215 (fObs: 19.00 fSim:  5.64)
 Pair: pp  log(q): 0.925 (fObs: 65.00 fSim: 25.77)
 Pair: wa  log(q): 0.921 (fObs: 13.00 fSim:  5.18)
 Pair: wv  log(q): 0.906 (fObs:  7.00 fSim:  2.83)
 Pair: dw  log(q): 0.889 (fObs:  9.00 fSim:  3.70)
 Pair: fh  log(q): 0.837 (fObs: 12.00 fSim:  5.19)
 Pair: yf  log(q): 0.750 (fObs: 11.00 fSim:  5.20)

Ten most depleted pairs (tObs < tSim):
 Pair: fy  log(q): -10.868 (fObs:  0.00 fSim:  5.25)
 Pair: aw  log(q): -10.864 (fObs:  0.00 fSim:  5.23)
 Pair: my  log(q): -10.854 (fObs:  0.00 fSim:  5.18)
 Pair: lw  log(q): -10.784 (fObs:  0.00 fSim:  4.82)
 Pair: ew  log(q): -10.616 (fObs:  0.00 fSim:  4.08)
 Pair: wd  log(q): -10.524 (fObs:  0.00 fSim:  3.72)
 Pair: wr  log(q): -10.498 (fObs:  0.00 fSim:  3.62)
 Pair: pw  log(q): -10.330 (fObs:  0.00 fSim:  3.07)
 Pair: ic  log(q): -10.323 (fObs:  0.00 fSim:  3.04)
 Pair: wq  log(q): -10.215 (fObs:  0.00 fSim:  2.73)
[1] "sealKey> 2020-10-12 05:10:22 (5151048f-1ba2-e074-3281-18a0e0a080e7)"



 

Notes

  1. This all is in response to an academic offence that occurred when a student could not get their code to run correctly and then confabulated the results.