ABC-INT-Mutation impact

From "A B C"
Revision as of 23:43, 26 January 2018 by Boris (talk | contribs)
Jump to navigation Jump to search

Integrator Unit: Mutation Impact

(Integrator unit: assess the impact of mutations in a gene)


 


Abstract:

This page integrates material from the learning units for R programming, working with sequences and the genetic code, and probability and significance, in a task for evaluation.


Deliverables:

  • Integrator unit: Deliverables can be submitted for course marks. See below for details.

Prerequisites:
This unit builds on material covered in the following prerequisite units:


 



 



 


Evaluation

This "Integrator Unit" should be submitted for evaluation for a maximum of 8 marks if one of the written deliverables is chosen, resp. 16 marks for the oral exam[1].

Please note the evaluation types that are available as options for this unit. Choose one evaluation type that you have not chosen for another Integrator Unit. (Each submitted Integrator Unit must be evaluated in a different way and one of your evaluations - but not your first one - must be an oral exam).
 
Report option
  • Work through the tasks described in the scenario.
  • Document your results in a short report on a subpage of your User page on the Student Wiki. Describe your methods (R-code!) in an appendix;
  • When you are done with everything, add the following category tag to the page:
[[Category:EVAL-INT-Mutation_impact]]
Do not change your submission page after this tag has been added. The page will be marked and the category tag will be removed by the instructor.
 
Oral exam option
  • Work through the tasks described in the scenario. Remember to document your work in your journal.
  • Part of your task will involve writing an R script, place that code in a subpage of your User page on the Student Wiki and link to it from your Journal. (Do not add an evaluation category tag to that code).
  • Your work must be complete before 21:00 on the day before your exam.
  • Schedule an oral exam by editing the signup page on the Student Wiki. Enter the unit that you are signing up for, and your name. You must have signed-up for an exam slot before 21:00 on the day before your exam.
 
R code option
  • Work through the tasks described in the scenario and develop code as required.
  • Put your code on a subpage of your User page on the Student Wiki;
  • When you are done with everything, add the following category tag to the page:
[[Category:EVAL-INT-Mutation_impact]]
Do not change your submission page after this tag has been added. The page will be marked and the category tag will be removed by the instructor.

Contents

Scenario background

Cancer is a genetic disease and one aspect that makes cancer hard to treat is that cancer cells progress through their own micro-evolution and become progressively more aggressive and treatment-resistant. But since the cancer phenotype is ultimately based on genetic alterations, it is important to understand which genes contribute. Unfortunately this is not as simple as just sequencing a few cancers: one of the hallmarks of the disease is genome instability (this contributes to the accelerated evolution), and it is very difficult to distinguish causal mutations from incidental mutations, or, driver genes from passenger genes.

However, an analysis of the distribution of mutations may help. Passenger mutations are expected to be randomly distributed throughout the genome, driver mutations are expected to have either a gain of function or loss of function effect. Gain of function mutations are expected to be very specific, targeting only a small number of amino acids in a defined region of the protein. We actually expect purifying selection against mutations elsewhere. Loss of function mutations are expected to include nonsense mutations, frameshifts, but above all, they should be enriched in missense and nonsense mutations relative to silent mutations.

The task of this unit is to analyze the relative frequencies of neutral, missense and nonsense mutations in a gene, and contrast that with the frequencies one would expect if the distribution of mutations was purely due to chance. This analysis should work on an actual sequence, and consider actually observed mutations. We will develop it to evaluate mutations of the KRas gene, a known cancer driver, an olfactory receptor (OR1A1), most likely not involved in cancer, and the PTPN11 phosphatase, a gene of interest whose role in cancer we would like to understand better.


 

KRas and cancer

 
Ras Cycle.jpg

Sketch of the Ras activation cycle. (PRE) Ras is translated, and C-terminally farnesylated, palmitoylated and located to the plasma membrane. (I) The GEF Sos is activated in its complex with active EGFR. It binds to Ras and removes GDP. (II) Apo-Ras is ready to bind a nucleotide. (III) upon GTP binding, Ras acquires its active conformation. (IV) Ras binds its effectors such as Raf1. This switches the MAPK signalling cascade on and leads to cell proliferation. (V) Src phosporylates Ras Y32. This reduces the affinity of RAf1 by ~1000-fold. (VI) GAPs can now displace the effector and stimulate Ras GTPase activity. GTP is hydrolyzed to GDP. (VII) with bound GDP, Ras acquires the inactive conformation. PTPN 11 removes the Y32 phosphate and regenerates the effector binding site. The cycle can begin anew.


 

Nucleotide binding domains are among the oldest known protein families and one family in particular, the G-proteins has diverse roles in all domains of life. These are collectively called GTP hydrolases, or GTPases – a misnomer, since even though they do catalyze the hydrolysis of GTP to GDP, their role in the cell has nothing to do with GTP metabolism, but comes from a conformational change that accompanies binding to either GTP or GDP. As far as enzymes go, GTPases are rather slow.

A large family among these G-proteins are the Ras proteins: these act as molecular switches. In humans, there are three isoforms of Ras called HRas, KRas and NRas. These are differentially expressed in tissues and have slightly different C-termini through which they are localized to different membrane subdomains. When Ras binds GTP, it adopts a stable, active ON conformation through which it activates effector proteins. But then the Ras protein slowly hydrolyses GTP to GDP, it undergoes a conformational change and enters the OFF state. Then GDP dissociates from the binding site, Ras can re-bind GTP and is once again switched on. This cycle is modified by interactors: GEF proteins (Guanine Nucleotide Exchange factors such as Sos) catalyze the dissociation of GDP and thus speed up the re-uptake of GTP and re-activation of Ras. Thus they shift the cycle to an active state. GAP proteins (GTPase activating proteins such as P120GAP) speed up the conversion of GTP to GDP. This shifts the cycle towards its inactive state.

One of the most important pathways for cell proliferation is the EGFR pathway that feeds into the MAPK cascade. Under physiological conditions, the active EGFR activates the Sos protein, which shifts a pool of Ras molecules into their active state. Active Ras then turns on its effectors – among them Raf1 – which activates a signalling cascade that induces cell proliferation. This is limited by GAPs that speed up Ras GTPase activity which turns the protein off. Deactivation of Sos when the EGFR is inactive ensures that GDP remains bound and the Ras protein pool remains off. This matches our expectations about the roles of these proteins well.

The problem is that this system can go terribly wrong if Ras gets mutated in a way that damages its catalytic activity and prevent GTP hydrolysis. Activating GAPs no longer works to switch Ras off, because if the Ras active site is dead, GAPs have no way of inducing it. And inhibiting GEFs does not switch Ras off either, because GTP does not get hydrolyzed to GDP and there is no need for GEFs to clear the active site of GDP. The switch is ON and stays ON. The EGFR pathway is on and stays on. The cell proliferates out of control. This can be the first step of transforming a cell into a cancer cell and this exact mutation in the KRas protein is the third-most frequent mutation seen in cancer genome studies and possibly the most powerful cancer driver mutation of all. The big issue about all this is that mutant Ras is generally considered "undruggable": we can't imagine small molecule drugs that would restore Ras' catalytic activity, and the affinity of GTP to the molecule is so high that we haven't found competitive antagonists that don't have dramatic side effects. An interesting new development therefore was the recent discovery that a phosphatase - PTPN11 - somehow works synergistically with Ras to facilitate its activation of effectors: inhibition of PTPN11 suppressed oncogenesis[2]. If this is a pathophysiologically relevant effect, we expect cancer mutations to spare PTPN11, or even to deregulate it to anhance its activity. Do they?


 

Cancer gene data

Knowledge about the mutations of cancer comes from large-scale genome sequencing efforts of cancer tissue samples, and is collected and curated by a small number of databases. These databases sift through the massive volumes of sequence changes, distinguish natural variation from novel somatic mutations, and map the nucleotide changes to individual genes. One of these resources is the IntOGen database in Barcelona.


 

Task:

  • visit IntOGen.
  • find the KRas information page and briefly explore the information that is available.


 

For the Report Option...

Task:

  • Open the RStudio course project.
  • Begin a new R script to explore KRas, PTPN11 and OR1A1 mutations.
  • Load the data file of mRNAs I have prepared for you. This will create the three R objects, KRascodons, PTPN11codons, and OR1A1codons:
load(file = "./data/ABC-INT-Mutation_impact.RData")
  • Write code that executes a loop N times (for N <- 100000) to create a point mutation randomly in each of the three genes. Keep track of the number of missense, silent ("synonymous"), and nonsense ("truncating")" mutations you find.
  • Contrast that with the relative frequency of the mutations in each category reported on the IntOGen Web page for each of the three genes.
  • Describe whether you think there is an important difference between the expected categories of mutations (i.e. the stochastic background that you simulated), and categories of mutations that were observed in cancer genomes.
  • Write a short report that interprets your results against the context outlined above: what would you expect if any of these genes were cancer drivers, what do you observe, what can you conclude from your observation?


 

For the Oral Exam Option...

Task:

  • Open the RStudio course project.
  • Begin a new R script to explore KRas, PTPN11 and OR1A1 mutations.
  • Load the data file of mRNAs I have prepared for you. This will create the three R objects, KRAscodons, PTPN11codons, and OR1A1codons:
load(file = "./data/ABC-INT-Mutation_impact.RData")
  • Write code that executes a loop N times (for N <- 100000) to create a point mutation randomly in each of the three genes. Keep track of the number of missense, silent ("synonymous"), and nonsense ("truncating")" mutations you find.
  • Contrast that with the relative frequency of the mutations in each category reported on the IntOGen Web page for each of the three genes.
  • Describe whether you think there is an important difference between the expected categories of mutations (i.e. the stochastic background that you simulated), and categories of mutations that were observed in cancer genomes.
  • Document your activities and results in your Journal. Add a brief conclusion / interpretation.


 

For the R Code Option...

Task:

  • Open the RStudio course project.
  • In a new R script, develop a function that explores mutation effects, given cDNA and mutation data. You will find the following three cDNA files in the course project's ./data directory, Use them to develop your function:
./data/KRAS_HSa_coding.fa
./data/PTPN11_HSa_coding.fa
./data/OR1A1_HSa_coding.fa
  • Here is a header that specifies the function, its parameters and its value.
evalMut <- function(FA, N) {
    # Purpose: evaluate the distribution of silent, missense and nonsense
    # codon changes in cDNA read from FA for N random mutation trials.
    # Parameters:
    #     FA   chr      Filename of a FASTA formatted sequence file of cDNA
    #                     beginning with a start codon.
    #     N    integer  The number of point mutation trials to perform
    # Value:   list     List with the following elements:
    #                      FA    chr  the input file
    #                      N     num  same as the input parameter
    #                      nSilent    num  the number of silent mutations
    #                      nMissense  num  the number of missense mutations
    #                      nNonsense  num  the number of nonsense mutations

}
  • The IntOGen Website lists the counts and frequencies of silent, missense, and nonsense mutations, but that includes point mutations, splice-site mutations, insertions and deletions. However your method above only simulates the frequency of point mutations; thus, for a correct comparison of observation and expectation we need to distinguish. IntOGen provides data downloads that list the exact mutation and categorize it. Write a second function that reads an IntOGen mutation-distribution file and returns counts for the three categories of point mutations that you are simulating. You will find three files in the course project's ./data directory that you can use to develop your function:
./data/intogen-KRAS-distribution-data.fa
./data/intogen-PTPN11-distribution-data.fa
./data/intogen-OR1A1-distribution-data.fa
  • Here is a header that specifies the function, its parameters and its value.
readIntOGen <- function(IN) {
    # Purpose: read and parse an IntOGen mutation data file. Return only the
    #            number of silent, missense, and nonsense point mutations.
    #            All indels are ignored.
    # Parameters:
    #     IN   chr      Filename of an IntOGen mutation data file.
    # Value:   list     List with the following elements:
    #                      nSilent    num the number of silent mutations
    #                      nMissense  num the number of missense mutations
    #                      nNonsense  num the number of nonsense mutations

}
  • You may find the function read.delim(), or read_tsv() from the readr package useful.
  • Ensure that the script is "clean" in the sense that source()'ing the file has no effects other than loading the functions and any packages they need.
  • Write tests for your function. Place them in a protected block of code that will not get executed when the file gets sourced, like so:
if (FALSE) {
   # Code that won't get executed goes here...
   # ... but it's easy to manually step through the script and execute it.
}
  • Write a brief script that simulates 10000 point mutations of PTPN11 and compares the relative frequencies with the values reported in the distribution-data file. Describe whether you think there is an important difference between the expected categories of mutations (i.e. the stochastic background that you simulated), and categories of mutations that were observed in cancer genomes. Place this script too in a protected block of code that will not get executed.
  • Note: with the following FASTA file saved as GCsample.fa...
>GCsample
ATGAAAAACAAGAATACAACCACGACTAGAAGCAGGAGTATAATCATTCAACACCAGCATCCACCCCCGCCTCGACGCCG
GCGTCTACTCCTGCTTGAAGACGAGGATGCAGCCGCGGCTGGAGGCGGGGGTGTAGTCGTGGTTTAATACTAGTATTCAT
CCTCGTCTTGATGCTGGTGTTTATTCTTGTTT

... my implementation returns:

> evalMut("GCsample.fa", 100000)
$FA
[1] "GCsample.fa"

$N
[1] 1e+05

$nSilent
[1] 24114

$nMissense
[1] 67878

$nNonsense
[1] 8008

Self-evaluation

Notes

  1. Note: the oral exam will focus on the unit content but will also cover other material that leads up to it
  2. Bunda et al. (2015) Inhibition of SHP2-mediated dephosphorylation of Ras suppresses oncogenesis. Nat Commun 6:8859. (pmid: 26617336)

    PubMed ] [ DOI ] Ras is phosphorylated on a conserved tyrosine at position 32 within the switch I region via Src kinase. This phosphorylation inhibits the binding of effector Raf while promoting the engagement of GTPase-activating protein (GAP) and GTP hydrolysis. Here we identify SHP2 as the ubiquitously expressed tyrosine phosphatase that preferentially binds to and dephosphorylates Ras to increase its association with Raf and activate downstream proliferative Ras/ERK/MAPK signalling. In comparison to normal astrocytes, SHP2 activity is elevated in astrocytes isolated from glioblastoma multiforme (GBM)-prone H-Ras(12V) knock-in mice as well as in glioma cell lines and patient-derived GBM specimens exhibiting hyperactive Ras. Pharmacologic inhibition of SHP2 activity attenuates cell proliferation, soft-agar colony formation and orthotopic GBM growth in NOD/SCID mice and decelerates the progression of low-grade astrocytoma to GBM in a spontaneous transgenic glioma mouse model. These results identify SHP2 as a direct activator of Ras and a potential therapeutic target for cancers driven by a previously 'undruggable' oncogenic or hyperactive Ras.

Further reading, links and resources

 




 

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.



 

About ...
 
Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2017-08-09

Version:

1.3

Version history:

  • 1.3 Remove "significance" requirements since we didn't simulate distributions and we never introduced chisq.test()
  • 1.2 Corrected posted marks, which were not consistent with the description in the syllabus.
  • 1.1 Added sample output
  • 1.0 New unit
  • 0.1 First stub

CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.