Difference between revisions of "ABC-INT-Mutation impact"
m |
m |
||
(30 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | <div id=" | + | <div id="ABC"> |
− | + | <div style="padding:5px; border:4px solid #000000; background-color:#e19fa7; font-size:300%; font-weight:400; color: #000000; width:100%;"> | |
− | + | Integrator Unit: Mutation Impact | |
− | + | <div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#e19fa7; font-size:30%; font-weight:200; color: #000000; "> | |
− | + | (Integrator unit: assess the impact of mutations in a gene) | |
− | + | </div> | |
− | |||
− | <div | ||
− | |||
− | |||
</div> | </div> | ||
− | {{ | + | {{Smallvspace}} |
− | + | <div style="padding:5px; border:1px solid #000000; background-color:#e19fa733; font-size:85%;"> | |
+ | <div style="font-size:118%;"> | ||
+ | <b>Abstract:</b><br /> | ||
+ | <section begin=abstract /> | ||
+ | This page integrates material from the learning units for R programming, working with sequences and the genetic code, and probability and significance, in a task for evaluation. | ||
+ | <section end=abstract /> | ||
+ | </div> | ||
+ | <!-- ============================ --> | ||
+ | <hr> | ||
+ | <b>Deliverables:</b><br /> | ||
+ | <section begin=deliverables /> | ||
+ | <li><b>Integrator unit</b>: Deliverables can be submitted for course marks. See below for details.</li> | ||
+ | <section end=deliverables /> | ||
+ | <!-- ============================ --> | ||
+ | <hr> | ||
+ | <section begin=prerequisites /> | ||
+ | <b>Prerequisites:</b><br /> | ||
+ | This unit builds on material covered in the following prerequisite units:<br /> | ||
+ | *[[FND-STA-Significance|FND-STA-Significance (Significance)]] | ||
+ | *[[RPR-Genetic_code_optimality|RPR-Genetic_code_optimality (Optimality of the Genetic Code: an R Exploration)]] | ||
+ | *[[RPR-Unit_testing|RPR-Unit_testing (Testing R code)]] | ||
+ | <section end=prerequisites /> | ||
+ | <!-- ============================ --> | ||
+ | </div> | ||
− | {{ | + | {{Smallvspace}} |
− | |||
− | {{ | + | {{Smallvspace}} |
− | + | __TOC__ | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
{{Vspace}} | {{Vspace}} | ||
− | == | + | === Evaluation === |
− | + | This "Integrator Unit" should be submitted for evaluation for a maximum of 13 marks if one of the written deliverables is chosen, resp. 24 marks if you choose this for your oral test<ref>Note: the oral test is cumulative. It will focus on the content of this unit but will also cover other material that leads up to it.</ref>. | |
− | + | :Please note the evaluation types that are available as options for this unit. | |
− | + | :Be mindful of the [[ABC-Rubrics| '''Marking rubrics''']]. | |
− | + | :If this is submitted for your oral test, please read the [[BCH441 Oral Test instructions|Oral test instructions]] before you begin. | |
− | + | :If your submission includes R code, please read the [[BCH441 Code submisson instructions|Code submission instructions]] before you begin. | |
− | |||
− | |||
− | + | Once you have chosen an option ... | |
+ | <ol> | ||
+ | <li>Create a new page on the student Wiki as a subpage of your User Page.</li> | ||
+ | <li>Put all of your writing to submit on this one page.</li> | ||
+ | <li>When you are done with everything, go to the [https://q.utoronto.ca/courses/180416/assignments Quercus '''Assignments''' page] and open the appropriate '''Integrator Unit''' assignment. Paste the URL of your Wiki page into the form, and click on '''Submit Assignment'''.</li> | ||
+ | </ol> | ||
− | + | Your link can be submitted only once and not edited. But you may change your Wiki page at any time. However only the last version before the due date will be marked. All later edits will be silently ignored. | |
− | |||
− | |||
− | |||
− | {{ | + | {{Smallvspace}} |
+ | ;Report option | ||
+ | * Work through the tasks described in the scenario below. | ||
+ | * Document your results in a short technical report on a subpage of your User page on the Student Wiki. Describe your methods (R-code!) in [[BCH441 Code submisson instructions|an appendix]] linked from your report; | ||
+ | * When you are done, submit the link to your page via Quercus as described above. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
<!-- | <!-- | ||
{{Smallvspace}} | {{Smallvspace}} | ||
;Interview option | ;Interview option | ||
− | : Identify a laboratory whose work | + | : Identify a laboratory whose work has recently included producing and interpreting a homology model. Get in touch with the PI, a postdoc or senior graduate student in the laboratory and interview them in person or by eMail. Find out |
− | * why this work is | + | * why this work is important; |
* how they approach it methodologically; | * how they approach it methodologically; | ||
− | * in particular, how they | + | * in particular, how they interpret the model and what the model tells them that a sequence alignment alone would not have; |
* what they have recently learned. | * what they have recently learned. | ||
* write up your interview on a subpage of your User page of the Student Wiki; | * write up your interview on a subpage of your User page of the Student Wiki; | ||
− | * add information that may be required to understand the | + | * add information that may be required to understand the context; |
− | * make sure that you | + | * make sure that you included important literature references. |
− | *When you are done with everything, add the following category tag to the page: | + | * If this is well done and interesting, parts of this may be used to augment the learning unit. Make sure your interviewee is aware of what the interview is for, and has given her or his consent. |
− | ::<code><nowiki>[[Category:EVAL-INT-Mutation_impact]]</nowiki></code> | + | * Make sure contact information for your interviewee is included on your submission page. |
− | + | * Add a CC-BY tag to your submission. | |
+ | * When you are done with everything, add the following category tag '''to the end of page''': | ||
+ | ::<code><nowiki>[[Category:EVAL-INT-Mutation_impact]]</nowiki></code>. | ||
+ | |||
+ | Once the page has been saved with this tag, it is considered "submitted". | ||
+ | '''Do not''' change your submission after this tag has been added. The page will be marked and the category tag will be removed by the instructor. | ||
--> | --> | ||
<!-- | <!-- | ||
Line 91: | Line 97: | ||
--> | --> | ||
{{Smallvspace}} | {{Smallvspace}} | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | ;Oral test option | |
+ | * Work through the tasks described below. Remember to document your work in your journal, but there is no need to format this specially as a report. | ||
+ | * Part of your task will involve writing an R script; refer to the [[BCH441 Code submisson instructions|Code submission instructions]] and link to your page from your Journal. | ||
+ | * Note that the work must be completed [[BCH441 Oral Test instructions| '''before''' your actual test date.]] | ||
+ | {{Smallvspace}} | ||
− | + | <!-- | |
− | + | ;R code option | |
− | + | * Work through the tasks described in the scenario below and develop code as required. | |
− | + | * Put your code and other documentation on a subpage of your User page on the Student Wiki; | |
− | + | * When you are done, submit the link to your page via Quercus as described above. | |
+ | --> | ||
== Contents == | == Contents == | ||
− | |||
− | == | + | == Biological Context == |
− | Cancer is a genetic disease and one aspect that makes cancer hard to treat is that cancer cells | + | Cancer is a genetic disease and one aspect that makes cancer hard to treat is that cancer cells adapt and evolve and as a result of selective pressure from the body's own defenses, they become progressively more aggressive and treatment-resistant. Since the cancer phenotype is ultimately based on genetic alterations, it is important to understand which genes contribute. Unfortunately this is not as simple as just sequencing a few cancers: one of the hallmarks of the disease is genome instability (this contributes to the accelerated evolution), and it is very difficult to distinguish causal mutations from incidental mutations, or, '''driver genes''' from '''passenger genes'''. |
However, an analysis of the distribution of mutations may help. Passenger mutations are expected to be randomly distributed throughout the genome, driver mutations are expected to have either a '''gain of function''' or '''loss of function''' effect. Gain of function mutations are expected to be very specific, targeting only a small number of amino acids in a defined region of the protein. We actually expect purifying selection '''against''' mutations elsewhere. Loss of function mutations are expected to include nonsense mutations, frameshifts, but above all, they should be enriched in missense and nonsense mutations relative to silent mutations. | However, an analysis of the distribution of mutations may help. Passenger mutations are expected to be randomly distributed throughout the genome, driver mutations are expected to have either a '''gain of function''' or '''loss of function''' effect. Gain of function mutations are expected to be very specific, targeting only a small number of amino acids in a defined region of the protein. We actually expect purifying selection '''against''' mutations elsewhere. Loss of function mutations are expected to include nonsense mutations, frameshifts, but above all, they should be enriched in missense and nonsense mutations relative to silent mutations. | ||
Line 129: | Line 127: | ||
{{Smallvspace}} | {{Smallvspace}} | ||
− | {{FullImage|Ras_Cycle.jpg|Sketch of the Ras activation cycle. (PRE) Ras is translated, and C-terminally farnesylated, palmitoylated and located to the plasma membrane. (I) The GEF Sos is activated in its complex with active EGFR. It binds to Ras and removes GDP. (II) Apo-Ras is ready to bind a nucleotide. (III) upon GTP binding, Ras acquires its active conformation. (IV) Ras binds its effectors such as Raf1. This switches the MAPK signalling cascade on and leads to cell proliferation. (V) Src phosporylates Ras Y32. This reduces the affinity of RAf1 by ~1000-fold. (VI) GAPs can now displace the effector and stimulate Ras GTPase activity. GTP is | + | {{FullImage|Ras_Cycle.jpg|Sketch of the Ras activation cycle. (PRE) Ras is translated, and C-terminally farnesylated, palmitoylated and located to the plasma membrane. (I) The GEF Sos is activated in its complex with active EGFR. It binds to Ras and removes GDP. (II) Apo-Ras is ready to bind a nucleotide. (III) upon GTP binding, Ras acquires its active conformation. (IV) Ras binds its effectors such as Raf1. This switches the MAPK signalling cascade on and leads to cell proliferation. (V) Src phosporylates Ras Y32. This reduces the affinity of RAf1 by ~1000-fold. (VI) GAPs can now displace the effector and stimulate Ras GTPase activity. GTP is hydrolyzed to GDP. (VII) with bound GDP, Ras acquires the inactive conformation. PTPN 11 removes the Y32 phosphate and regenerates the effector binding site. The cycle can begin anew.}} |
{{Smallvspace}} | {{Smallvspace}} | ||
− | Nucleotide binding domains are among the oldest known protein families and one family in particular, the G-proteins has diverse roles in all domains of life. These are collectively called GTP hydrolases, or GTPases – a misnomer, since even though they do catalyze the hydrolysis of GTP to GDP, their role in the cell has nothing to do with GTP metabolism, but comes from a conformational change that accompanies binding to either GTP or GDP. As far as enzymes go, GTPases are rather slow. | + | Nucleotide binding domains are among the oldest known protein families and one family in particular, the G-proteins, has diverse roles in all domains of life. These are collectively called GTP hydrolases, or GTPases – a misnomer, since even though they do catalyze the hydrolysis of GTP to GDP, their role in the cell has nothing to do with GTP metabolism, but comes from a conformational change that accompanies binding to either GTP or GDP. As far as enzymes go, GTPases are rather slow. They are switches, not gears. |
− | A large family among these G-proteins are the Ras proteins | + | A large family among these G-proteins are the Ras proteins. In humans, there are three isoforms of Ras called HRas, KRas and NRas. These are differentially expressed in tissues and have slightly different C-termini through which they are localized to different membrane subdomains. When Ras binds GTP, it adopts a stable, active '''ON''' conformation through which it activates effector proteins. But then the Ras protein slowly hydrolyses GTP to GDP, it undergoes a conformational change and enters the '''OFF''' state. Then GDP dissociates from the binding site, Ras can re-bind GTP and is once again switched '''ON'''. This cycle is modified by interactors: GEF proteins (Guanine Nucleotide Exchange factors such as Sos) catalyze the dissociation of GDP and thus speed up the re-uptake of GTP and re-activation of Ras. Thus they shift the cycle to an active state. GAP proteins (GTPase activating proteins such as P120GAP) speed up the conversion of GTP to GDP. This shifts the cycle towards its inactive state. |
− | One of the most important pathways for cell proliferation is the EGFR pathway that feeds into the MAPK cascade. Under physiological conditions, the active EGFR activates the Sos protein, which shifts a pool of Ras molecules into their active state. Active Ras then | + | One of the most important pathways for cell proliferation is the EGFR pathway that feeds into the MAPK cascade. Under physiological conditions, the active EGFR activates the Sos protein, which shifts a pool of Ras molecules into their active state. Active Ras then switches its effectors on – among them Raf1 – which activates a signalling cascade that induces cell proliferation. This is limited by GAPs that speed up Ras GTPase activity which turns the Ras '''OFF''' again. Deactivation of Sos when the EGFR is inactive ensures that GDP remains bound and the Ras protein pool remains '''OFF'''. This matches our expectations about the roles of these proteins well. |
− | The problem is that this system can go terribly wrong if Ras gets mutated in a way that damages its catalytic activity and | + | The problem is that this system can go terribly wrong if Ras gets mutated in a way that damages its catalytic activity and prevents GTP hydrolysis. Activating GAPs no longer works to switch Ras 'off'''OFF''', because if the Ras active site is dead, GAPs have no way of inducing it. And inhibiting GEFs does not switch Ras 'off'''OFF''' either, because GTP does not get hydrolyzed to GDP and there is no need for GEFs to clear the active site of GDP. The switch is '''ON''' and stays '''ON'''. The EGFR pathway is on and stays on. The cell proliferates out of control. This can be the first step of transforming a cell into a cancer cell and this exact mutation in the KRas protein is the second-most frequent mutation seen in cancer genome studies (behind p53) and possibly the most powerful cancer driver mutation of all. The big issue about all this is that mutant Ras is generally considered "undruggable":<ref>{{#pmid:32723567}}</ref> we can't imagine small molecule drugs that would restore Ras' catalytic activity, and the affinity of GTP to the molecule is so high that we haven't found competitive antagonists that don't have dramatic side effects. An interesting new development therefore was the recent discovery that a phosphatase - PTPN11 - somehow works synergistically with Ras to facilitate its activation of effectors: inhibition of PTPN11 suppressed oncogenesis<ref>{{#pmid:26617336}}{{#pmid:30644389}}</ref>. If this is a pathophysiologically relevant effect, we expect cancer mutations to spare PTPN11, or even to deregulate it to anhance its activity. Do they? |
{{Vspace}} | {{Vspace}} | ||
Line 145: | Line 143: | ||
===Cancer gene data=== | ===Cancer gene data=== | ||
− | Knowledge about the mutations of cancer comes from large-scale genome sequencing efforts of cancer tissue samples, and is collected and curated by a small number of databases. These databases sift through the massive volumes of sequence changes, distinguish natural variation from novel somatic mutations, and map the | + | Knowledge about the mutations of cancer comes from large-scale genome sequencing efforts of cancer tissue samples, and is collected and curated by a small number of databases. These databases sift through the massive volumes of sequence changes, distinguish natural variation from novel somatic mutations, and map the nucleotide changes to individual genes. One of these resources is the [http://www.intogen.org/ '''IntOGen database'''] in Barcelona. |
{{Smallvspace}} | {{Smallvspace}} | ||
Line 152: | Line 150: | ||
* visit [http://www.intogen.org/ '''IntOGen''']. | * visit [http://www.intogen.org/ '''IntOGen''']. | ||
* find the KRas information page and briefly explore the information that is available. | * find the KRas information page and briefly explore the information that is available. | ||
+ | * then visit the information pages for three other genes: | ||
+ | ** Rab39B, a small GTPase, homologous to KRAS, but '''not''' associated with cancer; | ||
+ | ** PTPN11, the non-receptor type protein-phosphatase discussed above; | ||
+ | ** PTPN5, a homolgous phosphatase that appears '''not''' to be associated with cancer. | ||
+ | * Here, Kras and PTPN11 are proteins which need to be studied regarding their role in cancer; Rab39B and PTPN5 serve as controls: they have about the same size and similar domain composition - but are not active in cancer-relevant pathways. | ||
+ | }} | ||
+ | |||
+ | {{Vspace}} | ||
+ | |||
+ | == (All Options) Loading Data == | ||
+ | |||
+ | {{Smallvspace}} | ||
+ | |||
+ | {{task|1= | ||
+ | All options initially work with the same data: | ||
+ | * Open the RStudio course project. | ||
+ | * Begin a new R script to explore KRas, Rab39B, PTPN11 and PTPN5 mutations. | ||
+ | * Start by collecting four FASTA sequences that I have provided in the <tt>data/</tt> directory into a data frame. Something like: | ||
+ | <pre> | ||
+ | myFA <- readFASTA("data/RAB39B_HSa_coding.fa") | ||
+ | myFA <- rbind(myFA, readFASTA("data/PTPN5_HSa_coding.fa")) | ||
+ | myFA <- rbind(myFA, readFASTA("data/PTPN11_HSa_coding.fa")) | ||
+ | myFA <- rbind(myFA, readFASTA("data/KRAS_HSa_coding.fa")) | ||
+ | </pre> | ||
+ | : ... should work fine. Give your data frame convenient, meaningful row names - do not refer to the rows simply by row index in your script. Gene names will work fine. | ||
}} | }} | ||
Line 157: | Line 180: | ||
===For the Report Option... === | ===For the Report Option... === | ||
+ | |||
+ | {{Smallvspace}} | ||
{{task|1= | {{task|1= | ||
− | + | * Write code that executes a loop <code>N</code> times (for <code>N <- 100000</code>) to create a point mutation randomly in a gene. Keep track of the number of missense, silent ("synonymous"), and nonsense ("truncating")" mutations you find. To develop your code, use a smaller size of N, obviously. Then put your code '''into a function''' that takes as parameters a single string of nucleotides as input, and the number of trials for your simulation. Refer to the [[BCH441 Code submisson instructions|Code submission instructions]] regarding more detailed specifications and additional validation code that the run must produce. (There are possibilities to handle string input for this purpose; you need to figure out what works best for you. You could consider producing a vector of nucleotides for which you keep track of the indices of the codons, a matrix with three columns in which every row is a codon, or you could leave the string intact and use <tt>substring()</tt> to work with it, or come up with a different solution. Also you could use <tt>biostrings::</tt> functions or <tt>seqinr::</tt> for the translation. Or just use code that is similar to the learning units. Don't forget to reference your sources!) | |
− | + | * Tests: validate your function as follows. | |
− | + | ** The sequence <tt>ATGATGATGATGATGATG</tt> has no silent mutations; | |
− | + | ** The sequence <tt>CCCCCCCCCCCCCCCCCC</tt> has no truncating mutations; | |
− | + | ** The sequence <tt>TATTACTATTACTATTAC</tt> has some truncations, about the same frequency as <tt>TGGTGGTGGTGGTGGTGGTGGTGG</tt>; both those sequences have about twice as many truncating mutations as <tt>TGTTGTTGTTGTTGTTGTTGTTGT</tt>. (Explain what these tests demonstrate.) | |
− | + | * Once your function is ready, run your simulation once for each of the four genes: Kras, Rab39B, PTPN5, and PTPN11. Paste the exact output of each of the four runs into your report. | |
− | * Write code that executes a loop <code>N</code> times (for <code>N <- 100000</code>) to create a point mutation randomly in | + | * For each of the four genes, discuss the relative frequency of the mutations you have observed in each category and compare it to the frequency reported on the IntOGen Web site. |
− | * | + | * Explain whether you think there is an important difference between the expected categories of mutations (i.e. the stochastic background that you simulated), and categories of mutations that were observed in cancer genomes. |
− | * | + | * Write a short report that interprets your results against the context of cancer biology outlined above: what would you expect if any of these genes were cancer drivers, what do you observe, what can you conclude from your observation? <!-- Include a histogram of the expected distribution and the observed values. --> |
− | * Write a short report that interprets your results against the context outlined above: what would you expect if any of these genes were cancer drivers, what do you observe, what can you conclude from your observation? Include a histogram of the expected distribution and the observed values. | ||
}} | }} | ||
{{Vspace}} | {{Vspace}} | ||
− | ===For the Oral | + | ===For the Oral Test Option... === |
{{task|1= | {{task|1= | ||
* Open the RStudio course project. | * Open the RStudio course project. | ||
− | + | ||
− | * | + | * Proceed as described for the short-report option. However there is no need to write a formal report, just document your activities and results in your Journal. Add a brief conclusion / interpretation. |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
}} | }} | ||
{{Vspace}} | {{Vspace}} | ||
+ | |||
+ | <!-- R Code option needs to be re-thought ... not significantly different | ||
===For the R Code Option... === | ===For the R Code Option... === | ||
{{task|1= | {{task|1= | ||
* Open the RStudio course project. | * Open the RStudio course project. | ||
− | * In a new R script, develop a function that explores mutation effects, given | + | * In a new R script, develop a function that explores mutation effects, given cDNA and mutation data. You will find the following three cDNA files in the course project's <code>./data</code> directory, Use them to develop your function: |
./data/KRAS_HSa_coding.fa | ./data/KRAS_HSa_coding.fa | ||
./data/PTPN11_HSa_coding.fa | ./data/PTPN11_HSa_coding.fa | ||
Line 200: | Line 219: | ||
* Here is a header that specifies the function, its parameters and its value. | * Here is a header that specifies the function, its parameters and its value. | ||
− | < | + | <pre> |
evalMut <- function(FA, N) { | evalMut <- function(FA, N) { | ||
# Purpose: evaluate the distribution of silent, missense and nonsense | # Purpose: evaluate the distribution of silent, missense and nonsense | ||
− | # codon changes in | + | # codon changes in cDNA read from FA for N random mutation trials. |
# Parameters: | # Parameters: | ||
− | # FA chr Filename of a FASTA formatted sequence file of | + | # FA chr Filename of a FASTA formatted sequence file of cDNA |
# beginning with a start codon. | # beginning with a start codon. | ||
# N integer The number of point mutation trials to perform | # N integer The number of point mutation trials to perform | ||
Line 211: | Line 230: | ||
# FA chr the input file | # FA chr the input file | ||
# N num same as the input parameter | # N num same as the input parameter | ||
− | # nSilent num the | + | # nSilent num the number of silent mutations |
− | # nMissense num the | + | # nMissense num the number of missense mutations |
− | # nNonsense num the | + | # nNonsense num the number of nonsense mutations |
} | } | ||
− | </ | + | </pre> |
* The IntOGen Website lists the counts and frequencies of silent, missense, and nonsense mutations, but that includes point mutations, splice-site mutations, insertions and deletions. However your method above only simulates the frequency of point mutations; thus, for a correct comparison of observation and expectation we need to distinguish. IntOGen provides data downloads that list the exact mutation and categorize it. Write a second function that reads an IntOGen mutation-distribution file and returns counts for the three categories of point mutations that you are simulating. You will find three files in the course project's <code>./data</code> directory that you can use to develop your function: | * The IntOGen Website lists the counts and frequencies of silent, missense, and nonsense mutations, but that includes point mutations, splice-site mutations, insertions and deletions. However your method above only simulates the frequency of point mutations; thus, for a correct comparison of observation and expectation we need to distinguish. IntOGen provides data downloads that list the exact mutation and categorize it. Write a second function that reads an IntOGen mutation-distribution file and returns counts for the three categories of point mutations that you are simulating. You will find three files in the course project's <code>./data</code> directory that you can use to develop your function: | ||
Line 224: | Line 243: | ||
* Here is a header that specifies the function, its parameters and its value. | * Here is a header that specifies the function, its parameters and its value. | ||
− | < | + | <pre> |
readIntOGen <- function(IN) { | readIntOGen <- function(IN) { | ||
# Purpose: read and parse an IntOGen mutation data file. Return only the | # Purpose: read and parse an IntOGen mutation data file. Return only the | ||
Line 232: | Line 251: | ||
# IN chr Filename of an IntOGen mutation data file. | # IN chr Filename of an IntOGen mutation data file. | ||
# Value: list List with the following elements: | # Value: list List with the following elements: | ||
− | # nSilent num the | + | # nSilent num the number of silent mutations |
− | # nMissense num the | + | # nMissense num the number of missense mutations |
− | # nNonsense num the | + | # nNonsense num the number of nonsense mutations |
} | } | ||
− | </ | + | </pre> |
* You may find the function <code>read.delim()</code>, or <code>read_tsv()</code> from the <code>readr</code> package useful. | * You may find the function <code>read.delim()</code>, or <code>read_tsv()</code> from the <code>readr</code> package useful. | ||
Line 245: | Line 264: | ||
* Write tests for your function. Place them in a protected block of code that will not get executed when the file gets sourced, like so: | * Write tests for your function. Place them in a protected block of code that will not get executed when the file gets sourced, like so: | ||
− | < | + | <pre> |
if (FALSE) { | if (FALSE) { | ||
# Code that won't get executed goes here... | # Code that won't get executed goes here... | ||
# ... but it's easy to manually step through the script and execute it. | # ... but it's easy to manually step through the script and execute it. | ||
} | } | ||
− | </ | + | </pre> |
− | * Write a brief script that simulates 10000 point mutations of PTPN11 and compares the relative frequencies with the values reported in the distribution-data file. | + | * Write a brief script that simulates 10000 point mutations of PTPN11 and compares the relative frequencies with the values reported in the distribution-data file. Describe whether you think there is an important difference between the expected categories of mutations (i.e. the stochastic background that you simulated), and categories of mutations that were observed in cancer genomes. Place this script too in a protected block of code that will not get executed. |
− | + | * Note: with the following FASTA file saved as <tt>GCsample.fa</tt>... | |
+ | >GCsample | ||
+ | ATGAAAAACAAGAATACAACCACGACTAGAAGCAGGAGTATAATCATTCAACACCAGCATCCACCCCCGCCTCGACGCCG | ||
+ | GCGTCTACTCCTGCTTGAAGACGAGGATGCAGCCGCGGCTGGAGGCGGGGGTGTAGTCGTGGTTTAATACTAGTATTCAT | ||
+ | CCTCGTCTTGATGCTGGTGTTTATTCTTGTTT | ||
+ | ... my implementation returns: | ||
+ | <pre> | ||
+ | > evalMut("GCsample.fa", 100000) | ||
+ | $FA | ||
+ | [1] "GCsample.fa" | ||
− | + | $N | |
+ | [1] 1e+05 | ||
+ | $nSilent | ||
+ | [1] 24114 | ||
− | + | $nMissense | |
+ | [1] 67878 | ||
+ | $nNonsense | ||
+ | [1] 8008 | ||
+ | </pre> | ||
− | + | NB... | |
− | < | + | We could do chisq.test() rather than distributions, but we have not properly |
− | + | introduced hypothesis testing with statistical tests... | |
− | + | myDat <- c(23, 82, 7) # silent, missense, frame+stop ... from IntOGen | |
+ | myDat <- rbind(myDat, c(24114, 67878, 8008)) | ||
− | + | (x <- chisq.test(myDat)) | |
+ | x$observed | ||
+ | x$expected | ||
+ | }} | ||
+ | --> | ||
== Notes == | == Notes == | ||
− | |||
− | |||
<references /> | <references /> | ||
{{Vspace}} | {{Vspace}} | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
<div class="about"> | <div class="about"> | ||
Line 326: | Line 324: | ||
:2017-08-05 | :2017-08-05 | ||
<b>Modified:</b><br /> | <b>Modified:</b><br /> | ||
− | : | + | :2020-11-11 |
<b>Version:</b><br /> | <b>Version:</b><br /> | ||
− | :1. | + | :1.5.2 |
<b>Version history:</b><br /> | <b>Version history:</b><br /> | ||
+ | *1.5.2 Removed erroneous reference to old "R-Code option" with partial (and wrong) instructions. | ||
+ | *1.5.1 Fixed error in control sequences | ||
+ | *1.5 Edit policy update | ||
+ | *1.4 2020 rewrite. Different input modes, different genes, more structured requirements, validation of simulation code. | ||
+ | *1.3 Remove "significance" requirements since we didn't simulate distributions and we never introduced chisq.test() | ||
+ | *1.2 Corrected posted marks, which were not consistent with the description in the syllabus. | ||
+ | *1.1 Added sample output | ||
*1.0 New unit | *1.0 New unit | ||
*0.1 First stub | *0.1 First stub | ||
</div> | </div> | ||
− | |||
− | |||
{{CC-BY}} | {{CC-BY}} | ||
+ | [[Category:ABC-units]] | ||
+ | {{INTEGRATOR}} | ||
+ | {{LIVE}} | ||
+ | {{EVAL}} | ||
</div> | </div> | ||
<!-- [END] --> | <!-- [END] --> |
Latest revision as of 03:23, 11 November 2020
Integrator Unit: Mutation Impact
(Integrator unit: assess the impact of mutations in a gene)
Abstract:
This page integrates material from the learning units for R programming, working with sequences and the genetic code, and probability and significance, in a task for evaluation.
Deliverables:
Prerequisites:
This unit builds on material covered in the following prerequisite units:
Contents
Evaluation
This "Integrator Unit" should be submitted for evaluation for a maximum of 13 marks if one of the written deliverables is chosen, resp. 24 marks if you choose this for your oral test[1].
- Please note the evaluation types that are available as options for this unit.
- Be mindful of the Marking rubrics.
- If this is submitted for your oral test, please read the Oral test instructions before you begin.
- If your submission includes R code, please read the Code submission instructions before you begin.
Once you have chosen an option ...
- Create a new page on the student Wiki as a subpage of your User Page.
- Put all of your writing to submit on this one page.
- When you are done with everything, go to the Quercus Assignments page and open the appropriate Integrator Unit assignment. Paste the URL of your Wiki page into the form, and click on Submit Assignment.
Your link can be submitted only once and not edited. But you may change your Wiki page at any time. However only the last version before the due date will be marked. All later edits will be silently ignored.
- Report option
- Work through the tasks described in the scenario below.
- Document your results in a short technical report on a subpage of your User page on the Student Wiki. Describe your methods (R-code!) in an appendix linked from your report;
- When you are done, submit the link to your page via Quercus as described above.
- Oral test option
- Work through the tasks described below. Remember to document your work in your journal, but there is no need to format this specially as a report.
- Part of your task will involve writing an R script; refer to the Code submission instructions and link to your page from your Journal.
- Note that the work must be completed before your actual test date.
Contents
Biological Context
Cancer is a genetic disease and one aspect that makes cancer hard to treat is that cancer cells adapt and evolve and as a result of selective pressure from the body's own defenses, they become progressively more aggressive and treatment-resistant. Since the cancer phenotype is ultimately based on genetic alterations, it is important to understand which genes contribute. Unfortunately this is not as simple as just sequencing a few cancers: one of the hallmarks of the disease is genome instability (this contributes to the accelerated evolution), and it is very difficult to distinguish causal mutations from incidental mutations, or, driver genes from passenger genes.
However, an analysis of the distribution of mutations may help. Passenger mutations are expected to be randomly distributed throughout the genome, driver mutations are expected to have either a gain of function or loss of function effect. Gain of function mutations are expected to be very specific, targeting only a small number of amino acids in a defined region of the protein. We actually expect purifying selection against mutations elsewhere. Loss of function mutations are expected to include nonsense mutations, frameshifts, but above all, they should be enriched in missense and nonsense mutations relative to silent mutations.
The task of this unit is to analyze the relative frequencies of neutral, missense and nonsense mutations in a gene, and contrast that with the frequencies one would expect if the distribution of mutations was purely due to chance. This analysis should work on an actual sequence, and consider actually observed mutations. We will develop it to evaluate mutations of the KRas gene, a known cancer driver, an olfactory receptor (OR1A1), most likely not involved in cancer, and the PTPN11 phosphatase, a gene of interest whose role in cancer we would like to understand better.
KRas and cancer
Sketch of the Ras activation cycle. (PRE) Ras is translated, and C-terminally farnesylated, palmitoylated and located to the plasma membrane. (I) The GEF Sos is activated in its complex with active EGFR. It binds to Ras and removes GDP. (II) Apo-Ras is ready to bind a nucleotide. (III) upon GTP binding, Ras acquires its active conformation. (IV) Ras binds its effectors such as Raf1. This switches the MAPK signalling cascade on and leads to cell proliferation. (V) Src phosporylates Ras Y32. This reduces the affinity of RAf1 by ~1000-fold. (VI) GAPs can now displace the effector and stimulate Ras GTPase activity. GTP is hydrolyzed to GDP. (VII) with bound GDP, Ras acquires the inactive conformation. PTPN 11 removes the Y32 phosphate and regenerates the effector binding site. The cycle can begin anew.
Nucleotide binding domains are among the oldest known protein families and one family in particular, the G-proteins, has diverse roles in all domains of life. These are collectively called GTP hydrolases, or GTPases – a misnomer, since even though they do catalyze the hydrolysis of GTP to GDP, their role in the cell has nothing to do with GTP metabolism, but comes from a conformational change that accompanies binding to either GTP or GDP. As far as enzymes go, GTPases are rather slow. They are switches, not gears.
A large family among these G-proteins are the Ras proteins. In humans, there are three isoforms of Ras called HRas, KRas and NRas. These are differentially expressed in tissues and have slightly different C-termini through which they are localized to different membrane subdomains. When Ras binds GTP, it adopts a stable, active ON conformation through which it activates effector proteins. But then the Ras protein slowly hydrolyses GTP to GDP, it undergoes a conformational change and enters the OFF state. Then GDP dissociates from the binding site, Ras can re-bind GTP and is once again switched ON. This cycle is modified by interactors: GEF proteins (Guanine Nucleotide Exchange factors such as Sos) catalyze the dissociation of GDP and thus speed up the re-uptake of GTP and re-activation of Ras. Thus they shift the cycle to an active state. GAP proteins (GTPase activating proteins such as P120GAP) speed up the conversion of GTP to GDP. This shifts the cycle towards its inactive state.
One of the most important pathways for cell proliferation is the EGFR pathway that feeds into the MAPK cascade. Under physiological conditions, the active EGFR activates the Sos protein, which shifts a pool of Ras molecules into their active state. Active Ras then switches its effectors on – among them Raf1 – which activates a signalling cascade that induces cell proliferation. This is limited by GAPs that speed up Ras GTPase activity which turns the Ras OFF again. Deactivation of Sos when the EGFR is inactive ensures that GDP remains bound and the Ras protein pool remains OFF. This matches our expectations about the roles of these proteins well.
The problem is that this system can go terribly wrong if Ras gets mutated in a way that damages its catalytic activity and prevents GTP hydrolysis. Activating GAPs no longer works to switch Ras 'offOFF, because if the Ras active site is dead, GAPs have no way of inducing it. And inhibiting GEFs does not switch Ras 'offOFF either, because GTP does not get hydrolyzed to GDP and there is no need for GEFs to clear the active site of GDP. The switch is ON and stays ON. The EGFR pathway is on and stays on. The cell proliferates out of control. This can be the first step of transforming a cell into a cancer cell and this exact mutation in the KRas protein is the second-most frequent mutation seen in cancer genome studies (behind p53) and possibly the most powerful cancer driver mutation of all. The big issue about all this is that mutant Ras is generally considered "undruggable":[2] we can't imagine small molecule drugs that would restore Ras' catalytic activity, and the affinity of GTP to the molecule is so high that we haven't found competitive antagonists that don't have dramatic side effects. An interesting new development therefore was the recent discovery that a phosphatase - PTPN11 - somehow works synergistically with Ras to facilitate its activation of effectors: inhibition of PTPN11 suppressed oncogenesis[3]. If this is a pathophysiologically relevant effect, we expect cancer mutations to spare PTPN11, or even to deregulate it to anhance its activity. Do they?
Cancer gene data
Knowledge about the mutations of cancer comes from large-scale genome sequencing efforts of cancer tissue samples, and is collected and curated by a small number of databases. These databases sift through the massive volumes of sequence changes, distinguish natural variation from novel somatic mutations, and map the nucleotide changes to individual genes. One of these resources is the IntOGen database in Barcelona.
Task:
- visit IntOGen.
- find the KRas information page and briefly explore the information that is available.
- then visit the information pages for three other genes:
- Rab39B, a small GTPase, homologous to KRAS, but not associated with cancer;
- PTPN11, the non-receptor type protein-phosphatase discussed above;
- PTPN5, a homolgous phosphatase that appears not to be associated with cancer.
- Here, Kras and PTPN11 are proteins which need to be studied regarding their role in cancer; Rab39B and PTPN5 serve as controls: they have about the same size and similar domain composition - but are not active in cancer-relevant pathways.
(All Options) Loading Data
Task:
All options initially work with the same data:
- Open the RStudio course project.
- Begin a new R script to explore KRas, Rab39B, PTPN11 and PTPN5 mutations.
- Start by collecting four FASTA sequences that I have provided in the data/ directory into a data frame. Something like:
myFA <- readFASTA("data/RAB39B_HSa_coding.fa") myFA <- rbind(myFA, readFASTA("data/PTPN5_HSa_coding.fa")) myFA <- rbind(myFA, readFASTA("data/PTPN11_HSa_coding.fa")) myFA <- rbind(myFA, readFASTA("data/KRAS_HSa_coding.fa"))
- ... should work fine. Give your data frame convenient, meaningful row names - do not refer to the rows simply by row index in your script. Gene names will work fine.
For the Report Option...
Task:
- Write code that executes a loop
N
times (forN <- 100000
) to create a point mutation randomly in a gene. Keep track of the number of missense, silent ("synonymous"), and nonsense ("truncating")" mutations you find. To develop your code, use a smaller size of N, obviously. Then put your code into a function that takes as parameters a single string of nucleotides as input, and the number of trials for your simulation. Refer to the Code submission instructions regarding more detailed specifications and additional validation code that the run must produce. (There are possibilities to handle string input for this purpose; you need to figure out what works best for you. You could consider producing a vector of nucleotides for which you keep track of the indices of the codons, a matrix with three columns in which every row is a codon, or you could leave the string intact and use substring() to work with it, or come up with a different solution. Also you could use biostrings:: functions or seqinr:: for the translation. Or just use code that is similar to the learning units. Don't forget to reference your sources!) - Tests: validate your function as follows.
- The sequence ATGATGATGATGATGATG has no silent mutations;
- The sequence CCCCCCCCCCCCCCCCCC has no truncating mutations;
- The sequence TATTACTATTACTATTAC has some truncations, about the same frequency as TGGTGGTGGTGGTGGTGGTGGTGG; both those sequences have about twice as many truncating mutations as TGTTGTTGTTGTTGTTGTTGTTGT. (Explain what these tests demonstrate.)
- Once your function is ready, run your simulation once for each of the four genes: Kras, Rab39B, PTPN5, and PTPN11. Paste the exact output of each of the four runs into your report.
- For each of the four genes, discuss the relative frequency of the mutations you have observed in each category and compare it to the frequency reported on the IntOGen Web site.
- Explain whether you think there is an important difference between the expected categories of mutations (i.e. the stochastic background that you simulated), and categories of mutations that were observed in cancer genomes.
- Write a short report that interprets your results against the context of cancer biology outlined above: what would you expect if any of these genes were cancer drivers, what do you observe, what can you conclude from your observation?
For the Oral Test Option...
Task:
- Open the RStudio course project.
- Proceed as described for the short-report option. However there is no need to write a formal report, just document your activities and results in your Journal. Add a brief conclusion / interpretation.
Notes
- ↑ Note: the oral test is cumulative. It will focus on the content of this unit but will also cover other material that leads up to it.
- ↑
Rhett et al. (2020) Biology, pathology, and therapeutic targeting of RAS. Adv Cancer Res 148:69-146. (pmid: 32723567) [ PubMed ] [ DOI ] RAS was identified as a human oncogene in the early 1980s and subsequently found to be mutated in nearly 30% of all human cancers. More importantly, RAS plays a central role in driving tumor development and maintenance. Despite decades of effort, there remain no FDA approved drugs that directly inhibit RAS. The prevalence of RAS mutations in cancer and the lack of effective anti-RAS therapies stem from RAS' core role in growth factor signaling, unique structural features, and biochemistry. However, recent advances have brought promising new drugs to clinical trials and shone a ray of hope in the field. Here, we will exposit the details of RAS biology that illustrate its key role in cell signaling and shed light on the difficulties in therapeutically targeting RAS. Furthermore, past and current efforts to develop RAS inhibitors will be discussed in depth.
- ↑
Bunda et al. (2015) Inhibition of SHP2-mediated dephosphorylation of Ras suppresses oncogenesis. Nat Commun 6:8859. (pmid: 26617336) [ PubMed ] [ DOI ] Ras is phosphorylated on a conserved tyrosine at position 32 within the switch I region via Src kinase. This phosphorylation inhibits the binding of effector Raf while promoting the engagement of GTPase-activating protein (GAP) and GTP hydrolysis. Here we identify SHP2 as the ubiquitously expressed tyrosine phosphatase that preferentially binds to and dephosphorylates Ras to increase its association with Raf and activate downstream proliferative Ras/ERK/MAPK signalling. In comparison to normal astrocytes, SHP2 activity is elevated in astrocytes isolated from glioblastoma multiforme (GBM)-prone H-Ras(12V) knock-in mice as well as in glioma cell lines and patient-derived GBM specimens exhibiting hyperactive Ras. Pharmacologic inhibition of SHP2 activity attenuates cell proliferation, soft-agar colony formation and orthotopic GBM growth in NOD/SCID mice and decelerates the progression of low-grade astrocytoma to GBM in a spontaneous transgenic glioma mouse model. These results identify SHP2 as a direct activator of Ras and a potential therapeutic target for cancers driven by a previously 'undruggable' oncogenic or hyperactive Ras.
Kano et al. (2019) Tyrosyl phosphorylation of KRAS stalls GTPase cycle via alteration of switch I and II conformation. Nat Commun 10:224. (pmid: 30644389) [ PubMed ] [ DOI ] Deregulation of the RAS GTPase cycle due to mutations in the three RAS genes is commonly associated with cancer development. Protein tyrosine phosphatase SHP2 promotes RAF-to-MAPK signaling pathway and is an essential factor in RAS-driven oncogenesis. Despite the emergence of SHP2 inhibitors for the treatment of cancers harbouring mutant KRAS, the mechanism underlying SHP2 activation of KRAS signaling remains unclear. Here we report tyrosyl-phosphorylation of endogenous RAS and demonstrate that KRAS phosphorylation via Src on Tyr32 and Tyr64 alters the conformation of switch I and II regions, which stalls multiple steps of the GTPase cycle and impairs binding to effectors. In contrast, SHP2 dephosphorylates KRAS, a process that is required to maintain dynamic canonical KRAS GTPase cycle. Notably, Src- and SHP2-mediated regulation of KRAS activity extends to oncogenic KRAS and the inhibition of SHP2 disrupts the phosphorylation cycle, shifting the equilibrium of the GTPase cycle towards the stalled 'dark state'.
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2017-08-05
Modified:
- 2020-11-11
Version:
- 1.5.2
Version history:
- 1.5.2 Removed erroneous reference to old "R-Code option" with partial (and wrong) instructions.
- 1.5.1 Fixed error in control sequences
- 1.5 Edit policy update
- 1.4 2020 rewrite. Different input modes, different genes, more structured requirements, validation of simulation code.
- 1.3 Remove "significance" requirements since we didn't simulate distributions and we never introduced chisq.test()
- 1.2 Corrected posted marks, which were not consistent with the description in the syllabus.
- 1.1 Added sample output
- 1.0 New unit
- 0.1 First stub
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.