Difference between revisions of "RPR-GEO2R"

From "A B C"
Jump to navigation Jump to search
m
m
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
<div id="BIO">
+
<div id="ABC">
  <div class="b1">
+
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce; font-size:300%; font-weight:400; color: #000000; width:100%;">
 
GEO2R
 
GEO2R
  </div>
+
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#b3dbce; font-size:30%; font-weight:200; color: #000000; ">
 
+
(Programming for analysis of GEO datasets)
  {{Vspace}}
+
</div>
 
 
<div class="keywords">
 
<b>Keywords:</b>&nbsp;
 
Programming for analysis of GEO datasets
 
 
</div>
 
</div>
  
{{Vspace}}
+
{{Smallvspace}}
 
 
 
 
__TOC__
 
 
 
{{Vspace}}
 
 
 
 
 
{{STUB}}
 
  
{{Vspace}}
 
  
 
+
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce33; font-size:85%;">
</div>
+
<div style="font-size:118%;">
<div id="ABC-unit-framework">
+
<b>Abstract:</b><br />
== Abstract ==
 
 
<section begin=abstract />
 
<section begin=abstract />
<!-- included from "../components/RPR-GEO2R.components.wtxt", section: "abstract" -->
+
This unit demonstrates accessing and working with datasets downloaded from NCBI GEO.
...
 
 
<section end=abstract />
 
<section end=abstract />
 
+
</div>
{{Vspace}}
+
<!-- ============================  -->
 
+
<hr>
 
+
<table>
== This unit ... ==
+
<tr>
=== Prerequisites ===
+
<td style="padding:10px;">
<!-- included from "../components/RPR-GEO2R.components.wtxt", section: "prerequisites" -->
+
<b>Objectives:</b><br />
<!-- included from "ABC-unit_components.wtxt", section: "notes-external_prerequisites" -->
+
This unit will ...
You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:
+
* ... teach downloading and annotating GEO data, and performing differential expression analysis.
<!-- included from "FND-prerequisites.wtxt", section: "central_dogma" -->
+
</td>
 +
<td style="padding:10px;">
 +
<b>Outcomes:</b><br />
 +
After working through this unit you ...
 +
* ... can access GEO data;
 +
* ... are familar with the structure of GEO expression sets;
 +
* ... can annotate the data, perform differential expression anlysis and critically evaluate the results.
 +
</td>
 +
</tr>
 +
</table>
 +
<!-- ============================ -->
 +
<hr>
 +
<b>Deliverables:</b><br />
 +
<section begin=deliverables />
 +
<li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li>
 +
<li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li>
 +
<li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li>
 +
<section end=deliverables />
 +
<!-- ============================  -->
 +
<hr>
 +
<section begin=prerequisites />
 +
<b>Prerequisites:</b><br />
 +
You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:<br />
 
*<b>The Central Dogma</b>: Regulation of transcription and translation; protein biosynthesis and degradation; quality control.
 
*<b>The Central Dogma</b>: Regulation of transcription and translation; protein biosynthesis and degradation; quality control.
<!-- included from "ABC-unit_components.wtxt", section: "notes-prerequisites" -->
+
This unit builds on material covered in the following prerequisite units:<br />
You need to complete the following units before beginning this one:
+
*[[BIN-EXPR-Multiple_testing|BIN-EXPR-Multiple_testing (Multiple Testing and Significance)]]
*[[BIN-EXPR-DE|BIN-EXPR-DE (Discovering Differentially Expressed Genes)]]
+
<section end=prerequisites />
 +
<!-- ============================  -->
 +
</div>
  
{{Vspace}}
+
{{Smallvspace}}
  
  
=== Objectives ===
 
<!-- included from "../components/RPR-GEO2R.components.wtxt", section: "objectives" -->
 
...
 
  
{{Vspace}}
+
{{Smallvspace}}
  
  
=== Outcomes ===
+
__TOC__
<!-- included from "../components/RPR-GEO2R.components.wtxt", section: "outcomes" -->
 
...
 
 
 
{{Vspace}}
 
 
 
 
 
=== Deliverables ===
 
<!-- included from "../components/RPR-GEO2R.components.wtxt", section: "deliverables" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-time_management" -->
 
*<b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-journal" -->
 
*<b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-insights" -->
 
*<b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].
 
  
 
{{Vspace}}
 
{{Vspace}}
Line 76: Line 68:
  
 
=== Evaluation ===
 
=== Evaluation ===
<!-- included from "../components/RPR-GEO2R.components.wtxt", section: "evaluation" -->
+
This learning unit can be evaluated for a maximum of 5 marks. If you choose to submit tasks from this unit for credit:
<!-- included from "ABC-unit_components.wtxt", section: "eval-INT-TBD" -->
+
<ol>
<b>Evaluation: Integrated Unit</b><br />
+
<li>Create a new page on the student Wiki as a subpage of your User Page.</li>
:This unit should be submitted for evaluation for a maximum of 10 marks. Details TBD.
+
<li>The R-script for this unit contains a number of tasks in which you are explicitly asked  to submit code or results for credit. Put all of your writing to submit on this one page.</li>
 +
<li>When you are done with everything, go to the [https://q.utoronto.ca/courses/180416/assignments Quercus '''Assignments''' page] and open the first Learning Unit that you have not submitted yet. Paste the URL of your Wiki page into the form, and click on '''Submit Assignment'''.</li>
 +
</ol>
  
{{Vspace}}
+
Your link can be submitted only once and not edited. But you may change your Wiki page at any time. However only the last version before the due date will be marked. All later edits will be silently ignored.
  
 +
{{Smallvspace}}
  
</div>
 
<div id="BIO">
 
 
== Contents ==
 
== Contents ==
<!-- included from "../components/RPR-GEO2R.components.wtxt", section: "contents" -->
 
 
 
GEO regex example
 
 
    ===Labeling===
 
    <div class="mw-collapsible mw-collapsed  exercise-box" data-expandtext="Hint" data-collapsetext="Collapse">
 
    Write an '''R''' script that creates ''meaningful'' labels for data elements from metadata and shows them in a plot. Use the sample data below - or any other data you are interested in.
 
 
 
 
  <div class="mw-collapsible mw-collapsed exercise-box" data-expandtext="Expand" data-collapsetext="Collapse" style="background-color:#EEEEF9;">
 
    Sample input data from GEO, and task description ...
 
  <div class="mw-collapsible-content">
 
    These data were downloaded from the NCBI GEO database using the GEO2R tool, this is a microarray expression data study that compares tumor and metastasis tissue. You can access the dataset [http://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE42952 '''here'''.] Grouping primary PDAC (pancreatic ductal adenocarcinoma) as "tumor" and liver/peritoneal metastasis as "metastasis", an '''R''' script on the server calculates significantly differentially expressed genes using the {{[http://www.bioconductor.org/packages/2.12/bioc/html/limma.html Bioconductor limma package]. I have selected the top 100 genes, and now would like to plot significance (adjusted P value) vs. level of differential expression (logFC). Moreover I would like to vaguely identify the function of each gene if that is discernible from the  "Gene title".
 
 
      <source lang="text">
 
        "ID" "adj.P.Val" "P.Value" "t" "B" "logFC" "Gene.symbol" "Gene.title"
 
      "238376_at" "3.69e-19" "4.53e-23" "-49.138515" "42.43328" "-2.202043" "LOC100505564///DEXI" "uncharacterized LOC100505564///Dexi homolog (mouse)"
 
      "214041_x_at" "2.36e-17" "8.74e-21" "38.089228" "37.60995" "4.541989" "RPL37A" "ribosomal protein L37a"
 
      "241662_x_at" "2.36e-17" "1.03e-20" "-37.793765" "37.45851" "-2.105123" "" ""
 
      "231628_s_at" "2.36e-17" "1.16e-20" "-37.574182" "37.34507" "-1.97516" "SERPINB6" "serpin peptidase inhibitor, clade B (ovalbumin), member 6"
 
      "224760_at" "3.23e-17" "2.10e-20" "36.500909" "36.77932" "3.798724" "SP1" "Sp1 transcription factor"
 
      "214149_s_at" "3.23e-17" "2.38e-20" "36.282193" "36.66167" "4.246787" "ATP6V0E1" "ATPase, H+ transporting, lysosomal 9kDa, V0 subunit e1"
 
      "243177_at" "4.15e-17" "3.57e-20" "-35.573827" "36.275" "-1.801709" "" ""
 
      "243800_at" "5.63e-17" "5.52e-20" "-34.825113" "35.85663" "-2.018088" "NR1H4" "nuclear receptor subfamily 1, group H, member 4"
 
      "238398_s_at" "1.10e-16" "1.21e-19" "-33.519208" "35.10201" "-2.245806" "" ""
 
      "1569856_at" "1.48e-16" "1.82e-19" "-32.860752" "34.70891" "-1.810438" "TPP2" "tripeptidyl peptidase II"
 
      "1555116_s_at" "1.51e-16" "2.14e-19" "-32.598656" "34.55" "-1.990665" "SLC11A1" "solute carrier family 11 (proton-coupled divalent metal ion transporters), member 1"
 
      "218733_at" "1.51e-16" "2.23e-19" "32.535823" "34.51169" "2.764663" "MSL2" "male-specific lethal 2 homolog (Drosophila)"
 
      "201225_s_at" "2.72e-16" "4.33e-19" "31.497695" "33.86667" "3.447828" "SRRM1" "serine/arginine repetitive matrix 1"
 
      "217052_x_at" "4.45e-16" "7.64e-19" "30.636232" "33.31345" "1.601527" "" ""
 
      "1569348_at" "5.24e-16" "9.65e-19" "-30.289176" "33.08577" "-1.793925" "TPTEP1" "transmembrane phosphatase with tensin homology pseudogene 1"
 
      "219492_at" "6.96e-16" "1.37e-18" "29.777415" "32.74483" "3.586919" "CHIC2" "cysteine-rich hydrophobic domain 2"
 
      "215047_at" "7.51e-16" "1.58e-18" "-29.567379" "32.60307" "-2.033635" "TRIM58" "tripartite motif containing 58"
 
      "232877_at" "7.51e-16" "1.66e-18" "-29.491388" "32.55151" "-1.65225" "" ""
 
      "229265_at" "7.51e-16" "1.75e-18" "29.419139" "32.50236" "3.933071" "SKI" "v-ski sarcoma viral oncogene homolog (avian)"
 
      "1553842_at" "8.16e-16" "2.00e-18" "-29.226409" "32.37061" "-1.832581" "BEND2" "BEN domain containing 2"
 
      "220791_x_at" "1.11e-15" "2.87e-18" "-28.71601" "32.01715" "-1.969381" "SCN11A" "sodium channel, voltage-gated, type XI, alpha subunit"
 
      "212911_at" "1.17e-15" "3.15e-18" "28.584094" "31.92471" "2.143175" "DNAJC16" "DnaJ (Hsp40) homolog, subfamily C, member 16"
 
      "243464_at" "1.22e-15" "3.43e-18" "-28.463254" "31.83963" "-1.675747" "" ""
 
      "243823_at" "1.30e-15" "3.81e-18" "-28.316669" "31.7359" "-1.499823" "" ""
 
      "201533_at" "1.56e-15" "4.80e-18" "27.999089" "31.5092" "4.054743" "CTNNB1" "catenin (cadherin-associated protein), beta 1, 88kDa"
 
      "210878_s_at" "1.59e-15" "5.06e-18" "27.927536" "31.45775" "2.982033" "KDM3B" "lysine (K)-specific demethylase 3B"
 
      "227712_at" "3.18e-15" "1.05e-17" "26.938855" "30.73223" "2.426311" "LYRM2" "LYR motif containing 2"
 
      "228520_s_at" "3.56e-15" "1.22e-17" "26.742683" "30.58495" "3.744881" "APLP2" "amyloid beta (A4) precursor-like protein 2"
 
      "210242_x_at" "3.80e-15" "1.36e-17" "26.605262" "30.48111" "1.815311" "ST20" "suppressor of tumorigenicity 20"
 
      "217301_x_at" "3.80e-15" "1.40e-17" "26.565414" "30.45089" "3.275566" "RBBP4" "retinoblastoma binding protein 4"
 
      "1557551_at" "6.17e-15" "2.35e-17" "-25.892664" "29.93351" "-1.78824" "" ""
 
      "201392_s_at" "6.17e-15" "2.42e-17" "25.856344" "29.90519" "3.283483" "IGF2R" "insulin-like growth factor 2 receptor"
 
      "210371_s_at" "7.18e-15" "2.91e-17" "25.62344" "29.72255" "3.463431" "RBBP4" "retinoblastoma binding protein 4"
 
      "204252_at" "9.08e-15" "3.79e-17" "25.291186" "29.45902" "2.789842" "CDK2" "cyclin-dependent kinase 2"
 
      "243200_at" "1.04e-14" "4.48e-17" "-25.082134" "29.29138" "-1.539093" "" ""
 
      "201140_s_at" "1.16e-14" "5.13e-17" "24.916407" "29.15746" "2.834707" "RAB5C" "RAB5C, member RAS oncogene family"
 
      "1559066_at" "1.23e-14" "5.57e-17" "-24.813534" "29.07387" "-1.595061" "" ""
 
      "201123_s_at" "1.27e-14" "5.91e-17" "24.741268" "29.01494" "4.870779" "EIF5A" "eukaryotic translation initiation factor 5A"
 
      "218291_at" "1.41e-14" "6.83e-17" "24.565645" "28.87099" "2.605328" "LAMTOR2" "late endosomal/lysosomal adaptor, MAPK and MTOR activator 2"
 
      "217704_x_at" "1.41e-14" "6.91e-17" "-24.550405" "28.85845" "-1.711476" "SUZ12P1" "suppressor of zeste 12 homolog pseudogene 1"
 
      "227338_at" "1.44e-14" "7.22e-17" "-24.498114" "28.81536" "-2.927581" "LOC440983" "hypothetical gene supported by BC066916"
 
      "210231_x_at" "1.64e-14" "8.47e-17" "24.305184" "28.65556" "4.548338" "SET" "SET nuclear oncogene"
 
      "225289_at" "1.86e-14" "9.82e-17" "24.127523" "28.50726" "3.062123" "STAT3" "signal transducer and activator of transcription 3 (acute-phase response factor)"
 
      "204658_at" "1.93e-14" "1.04e-16" "24.056703" "28.44783" "2.868797" "TRA2A" "transformer 2 alpha homolog (Drosophila)"
 
      "208819_at" "2.54e-14" "1.40e-16" "23.705016" "28.15009" "2.593365" "RAB8A" "RAB8A, member RAS oncogene family"
 
      "210011_s_at" "2.58e-14" "1.46e-16" "23.660126" "28.11176" "2.309763" "EWSR1" "EWS RNA-binding protein 1"
 
      "202397_at" "2.58e-14" "1.48e-16" "23.638422" "28.0932" "4.332132" "NUTF2" "nuclear transport factor 2"
 
      "1552628_a_at" "2.86e-14" "1.68e-16" "23.492249" "27.96778" "2.892763" "HERPUD2" "HERPUD family member 2"
 
      "233757_x_at" "3.85e-14" "2.31e-16" "23.123802" "27.64812" "2.430056" "" ""
 
      "201545_s_at" "5.07e-14" "3.16e-16" "22.767216" "27.33385" "2.568005" "PABPN1" "poly(A) binding protein, nuclear 1"
 
      "1562463_at" "5.07e-14" "3.17e-16" "-22.763883" "27.33089" "-1.119718" "" ""
 
      "219859_at" "5.41e-14" "3.45e-16" "-22.669239" "27.24664" "-1.787549" "CLEC4E" "C-type lectin domain family 4, member E"
 
      "1569136_at" "6.91e-14" "4.50e-16" "-22.372385" "26.98011" "-1.95396" "MGAT4A" "mannosyl (alpha-1,3-)-glycoprotein beta-1,4-N-acetylglucosaminyltransferase, isozyme A"
 
      "208601_s_at" "7.15e-14" "4.74e-16" "-22.314594" "26.92781" "-1.323653" "TUBB1" "tubulin, beta 1 class VI"
 
      "226194_at" "1.11e-13" "7.47e-16" "21.813583" "26.46872" "2.331245" "CHAMP1" "chromosome alignment maintaining phosphoprotein 1"
 
      "217877_s_at" "1.15e-13" "7.93e-16" "21.748093" "26.40795" "2.862688" "GPBP1L1" "GC-rich promoter binding protein 1-like 1"
 
      "225371_at" "1.25e-13" "8.73e-16" "21.644444" "26.31139" "2.518013" "GLE1" "GLE1 RNA export mediator homolog (yeast)"
 
      "1563431_x_at" "1.44e-13" "1.02e-15" "21.472848" "26.15053" "1.874743" "CALM3" "calmodulin 3 (phosphorylase kinase, delta)"
 
      "211505_s_at" "1.45e-13" "1.06e-15" "21.437744" "26.11746" "2.642609" "STAU1" "staufen double-stranded RNA binding protein 1"
 
      "201585_s_at" "1.45e-13" "1.07e-15" "21.430113" "26.11027" "2.787833" "SFPQ" "splicing factor proline/glutamine-rich"
 
      "225197_at" "1.75e-13" "1.31e-15" "21.212989" "25.90451" "2.845005" "" ""
 
      "220336_s_at" "1.83e-13" "1.41e-15" "-21.132294" "25.82752" "-1.848273" "GP6" "glycoprotein VI (platelet)"
 
      "216515_x_at" "1.83e-13" "1.42e-15" "21.128023" "25.82343" "2.877477" "MIR1244-2///MIR1244-3///MIR1244-1///PTMAP5///PTMA" "microRNA 1244-2///microRNA 1244-3///microRNA 1244-1///prothymosin, alpha pseudogene 5///prothymosin, alpha"
 
      "241773_at" "3.49e-13" "2.74e-15" "-20.441442" "25.15639" "-1.835223" "" ""
 
      "1558011_at" "3.89e-13" "3.15e-15" "-20.297118" "25.01342" "-1.577874" "LOC100510697" "putative POM121-like protein 1-like"
 
      "215240_at" "3.89e-13" "3.15e-15" "-20.29699" "25.01329" "-1.613308" "ITGB3" "integrin, beta 3 (platelet glycoprotein IIIa, antigen CD61)"
 
      "233746_x_at" "3.95e-13" "3.25e-15" "20.265986" "24.98245" "2.364699" "HYPK///SERF2" "huntingtin interacting protein K///small EDRK-rich factor 2"
 
      "1555338_s_at" "4.10e-13" "3.42e-15" "-20.214797" "24.93143" "-1.280803" "AQP10" "aquaporin 10"
 
      "217714_x_at" "4.12e-13" "3.48e-15" "20.195128" "24.91179" "2.247023" "STMN1" "stathmin 1"
 
      "202276_at" "4.75e-13" "4.08e-15" "20.035595" "24.75183" "2.654202" "SHFM1" "split hand/foot malformation (ectrodactyly) type 1"
 
      "225414_at" "6.34e-13" "5.52e-15" "19.733786" "24.44585" "3.287225" "RNF149" "ring finger protein 149"
 
      "243930_x_at" "7.43e-13" "6.64e-15" "-19.55046" "24.2578" "-1.219467" "" ""
 
      "1569263_at" "7.43e-13" "6.66e-15" "-19.548534" "24.25581" "-1.662363" "" ""
 
      "1554876_a_at" "8.55e-13" "7.77e-15" "-19.397142" "24.09923" "-1.388081" "S100Z" "S100 calcium binding protein Z"
 
      "220001_at" "1.08e-12" "9.97e-15" "-19.15375" "23.84505" "-1.412727" "PADI4" "peptidyl arginine deiminase, type IV"
 
      "228170_at" "1.12e-12" "1.05e-14" "-19.106672" "23.79554" "-1.840114" "OLIG1" "oligodendrocyte transcription factor 1"
 
      "211445_x_at" "1.29e-12" "1.22e-14" "-18.959325" "23.63981" "-1.134266" "NACAP1" "nascent-polypeptide-associated complex alpha polypeptide pseudogene 1"
 
      "1555311_at" "1.33e-12" "1.27e-14" "-18.91869" "23.59666" "-1.45603" "" ""
 
      "201643_x_at" "1.47e-12" "1.43e-14" "18.808994" "23.47974" "1.867155" "KDM3B" "lysine (K)-specific demethylase 3B"
 
      "216449_x_at" "1.51e-12" "1.48e-14" "18.773094" "23.44134" "3.178009" "HSP90B1" "heat shock protein 90kDa beta (Grp94), member 1"
 
      "218680_x_at" "1.51e-12" "1.50e-14" "18.763896" "23.43149" "2.262739" "HYPK///SERF2" "huntingtin interacting protein K///small EDRK-rich factor 2"
 
      "225954_s_at" "1.65e-12" "1.67e-14" "18.662853" "23.32298" "2.405388" "MIDN" "midnolin"
 
      "203102_s_at" "1.65e-12" "1.68e-14" "18.658192" "23.31796" "2.476697" "MGAT2" "mannosyl (alpha-1,6-)-glycoprotein beta-1,2-N-acetylglucosaminyltransferase"
 
      "1569345_at" "1.69e-12" "1.74e-14" "18.624203" "23.28133" "1.236884" "" ""
 
      "214001_x_at" "1.71e-12" "1.78e-14" "18.598496" "23.25358" "2.570012" "" ""
 
      "231812_x_at" "1.72e-12" "1.81e-14" "18.583236" "23.2371" "1.678685" "PHAX" "phosphorylated adaptor for RNA export"
 
      "232075_at" "1.93e-12" "2.06e-14" "-18.462717" "23.10643" "-2.150701" "WDR61" "WD repeat domain 61"
 
      "200669_s_at" "1.96e-12" "2.12e-14" "18.438729" "23.08033" "1.891968" "UBE2D3" "ubiquitin-conjugating enzyme E2D 3"
 
      "236995_x_at" "2.04e-12" "2.23e-14" "-18.389604" "23.02677" "-1.879369" "TFEC" "transcription factor EC"
 
      "218008_at" "2.24e-12" "2.48e-14" "18.291537" "22.91946" "2.445428" "TMEM248" "transmembrane protein 248"
 
      "217140_s_at" "2.30e-12" "2.56e-14" "18.260017" "22.88485" "3.983721" "VDAC1" "voltage-dependent anion channel 1"
 
      "210183_x_at" "2.46e-12" "2.79e-14" "18.183339" "22.80044" "1.79105" "PNN" "pinin, desmosome associated protein"
 
      "216954_x_at" "2.46e-12" "2.80e-14" "-18.177967" "22.79451" "-1.090193" "ATP5O" "ATP synthase, H+ transporting, mitochondrial F1 complex, O subunit"
 
      "207688_s_at" "2.53e-12" "2.92e-14" "18.141153" "22.75385" "2.492309" "INHBC" "inhibin, beta C"
 
      "218020_s_at" "2.63e-12" "3.06e-14" "18.095669" "22.70351" "1.772689" "ZFAND3" "zinc finger, AN1-type domain 3"
 
      "217756_x_at" "3.12e-12" "3.67e-14" "17.930201" "22.51939" "1.914366" "SERF2" "small EDRK-rich factor 2"
 
      "214150_x_at" "3.42e-12" "4.07e-14" "-17.835551" "22.41336" "-1.177963" "ATP6V0E1" "ATPase, H+ transporting, lysosomal 9kDa, V0 subunit e1"
 
      "208750_s_at" "3.48e-12" "4.18e-14" "17.812279" "22.38721" "2.649599" "ARF1" "ADP-ribosylation factor 1"
 
      "201749_at" "3.59e-12" "4.42e-14" "17.761415" "22.32994" "1.917794" "ECE1" "endothelin converting enzyme 1"
 
      </source>
 
        </div>
 
        </div>
 
 
 
 
 
        <div class="mw-collapsible-content  exercise-box">
 
        <div class="mw-collapsible mw-collapsed" data-expandtext="Solution" data-collapsetext="Collapse">
 
        Read the data into '''R'''. Plot log(P) against log(FC). Define some regular expressions that identify keywords in the gene title: things like "X-ase", "Y factor", "Z gene" etc. Apply these to the gene titles using {{R|regex()||regexpr()}} and store the results by applying {{R|regmatches()}} to the text. Then use {{R|graphics|text()}} to plot the extracted strings.
 
 
 
      <div class="mw-collapsible-content  exercise-box">
 
 
        <source lang="R">
 
        #GEO-hits.R
 
        # bs - Sept. 2013
 
 
        dat <- read.table("GEO-hits_100.txt", header = TRUE) # this is a file of GEO
 
      # differential expression data
 
      head(dat)
 
 
      plot(-log(dat[,"adj.P.Val"]), dat[,"logFC"], cex=0.7, pch=16, col="#BB0000")
 
      # Note that all these genes have at least one log of
 
      # differential expression - up or down. As a trend,
 
      # higher probabilities are found for higher levels of
 
      # differential expression.
 
 
      # The dataframe produced by R's read.table() function
 
      # defines all character-containing rows as _factors_.
 
      # However to process them as strings, we need to convert
 
      # them to characters.
 
 
      dat[,"Gene.title"] <- as.character(dat[,"Gene.title"])
 
 
      # First, let's define some regexes for keywords to guess
 
      # a function ...
 
 
      # (Note the need for doubled escape characters in R!)
 
 
      r <- c(  "\\b(\\w+ase)\\b")  # peptidase, kinase ...
 
      r <- c(r, "\\b(?!factor)(\\w+or)") # suppressor, adaptor ...
 
      r <- c(r, "\\b(\\w+)\\b\\s(factor|protein|homolog)") # the preceeding word ...
 
 
 
      # Now iterate over the Gene.title column and for each row try all regular
 
      # expressions.
 
 
      for (i in 1:nrow(dat)) { # for all rows ...
 
        for (j in 1:length(r)) { # for all regular expressions
 
          dat[i,"Function.guess"] <- "" # clear the contents of the column
 
          M <- regexpr(r[j], dat[i, "Gene.title"], perl = TRUE)
 
          if (M[1] > 0) {
 
            dat[i,"Function.guess"] <- regmatches(dat[i,"Gene.title"], M)
 
            break  # stop regexing if something was found
 
          }
 
        }
 
      }
 
 
      dat[,"Function.guess"] # check what we found ...
 
      # ... and plot the strings to the right of its point.
 
      text(-log(dat[,"adj.P.Val"]), dat[,"logFC"], dat[,"Function.guess"], cex=0.4, pos=4)
 
 
      # I'm not sure we are actually learning anything important from this.
 
      # But the code was merely meant to illustrate how
 
      # to work with regular expressions in R (and introduce you to GEO
 
      # differential expression data on the side). Mission accomplished.
 
 
      </source>
 
 
 
 
 
 
 
{{Vspace}}
 
  
 +
{{ABC-unit|RPR-GEO2R.R}}
  
 
== Further reading, links and resources ==
 
== Further reading, links and resources ==
 
<!-- {{#pmid: 19957275}} -->
 
<!-- {{#pmid: 19957275}} -->
<!-- {{WWW|WWW_GMOD}} -->
+
<div class="reference-box">This unit has focussed on microarray analysis with GEO2R. For RNAseq experiments, refer to the excellent [https://www.bioconductor.org/help/workflows/RNAseq123/ '''Bioconductor RNAseq analysis tutorial'''].</div>
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 
 
 
{{Vspace}}
 
 
 
 
 
 
== Notes ==
 
== Notes ==
<!-- included from "../components/RPR-GEO2R.components.wtxt", section: "notes" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "notes" -->
 
 
<references />
 
<references />
  
 
{{Vspace}}
 
{{Vspace}}
  
 
</div>
 
<div id="ABC-unit-framework">
 
== Self-evaluation ==
 
<!-- included from "../components/RPR-GEO2R.components.wtxt", section: "self-evaluation" -->
 
<!--
 
=== Question 1===
 
 
Question ...
 
 
<div class="toccolours mw-collapsible mw-collapsed" style="width:800px">
 
Answer ...
 
<div class="mw-collapsible-content">
 
Answer ...
 
 
</div>
 
  </div>
 
 
  {{Vspace}}
 
 
-->
 
 
{{Vspace}}
 
 
 
 
{{Vspace}}
 
 
 
<!-- included from "ABC-unit_components.wtxt", section: "ABC-unit_ask" -->
 
 
----
 
 
{{Vspace}}
 
 
<b>If in doubt, ask!</b> If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
 
 
----
 
 
{{Vspace}}
 
  
 
<div class="about">
 
<div class="about">
Line 346: Line 100:
 
:2017-08-05
 
:2017-08-05
 
<b>Modified:</b><br />
 
<b>Modified:</b><br />
:2017-08-05
+
:2020-10-07
 
<b>Version:</b><br />
 
<b>Version:</b><br />
:0.1
+
:1.2
 
<b>Version history:</b><br />
 
<b>Version history:</b><br />
 +
*1.2 Edit policy update
 +
*1.1 2020 Updates; all online
 +
*1.0 First live version
 
*0.1 First stub
 
*0.1 First stub
 
</div>
 
</div>
[[Category:ABC-units]]
 
<!-- included from "ABC-unit_components.wtxt", section: "ABC-unit_footer" -->
 
  
 
{{CC-BY}}
 
{{CC-BY}}
  
 +
[[Category:ABC-units]]
 +
{{UNIT}}
 +
{{LIVE}}
 
</div>
 
</div>
 
<!-- [END] -->
 
<!-- [END] -->

Latest revision as of 05:02, 10 October 2020

GEO2R

(Programming for analysis of GEO datasets)


 


Abstract:

This unit demonstrates accessing and working with datasets downloaded from NCBI GEO.


Objectives:
This unit will ...

  • ... teach downloading and annotating GEO data, and performing differential expression analysis.

Outcomes:
After working through this unit you ...

  • ... can access GEO data;
  • ... are familar with the structure of GEO expression sets;
  • ... can annotate the data, perform differential expression anlysis and critically evaluate the results.

Deliverables:

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

  • Prerequisites:
    You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:

    • The Central Dogma: Regulation of transcription and translation; protein biosynthesis and degradation; quality control.

    This unit builds on material covered in the following prerequisite units:


     



     



     


    Evaluation

    This learning unit can be evaluated for a maximum of 5 marks. If you choose to submit tasks from this unit for credit:

    1. Create a new page on the student Wiki as a subpage of your User Page.
    2. The R-script for this unit contains a number of tasks in which you are explicitly asked to submit code or results for credit. Put all of your writing to submit on this one page.
    3. When you are done with everything, go to the Quercus Assignments page and open the first Learning Unit that you have not submitted yet. Paste the URL of your Wiki page into the form, and click on Submit Assignment.

    Your link can be submitted only once and not edited. But you may change your Wiki page at any time. However only the last version before the due date will be marked. All later edits will be silently ignored.


     

    Contents

    Task:

     
    • Open RStudio and load the ABC-units R project. If you have loaded it before, choose FileRecent projectsABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
    • Choose ToolsVersion ControlPull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
    • Type init() if requested.
    • Open the file RPR-GEO2R.R and follow the instructions.


     

    Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.


     

    Further reading, links and resources

    This unit has focussed on microarray analysis with GEO2R. For RNAseq experiments, refer to the excellent Bioconductor RNAseq analysis tutorial.

    Notes


     


    About ...
     
    Author:

    Boris Steipe <boris.steipe@utoronto.ca>

    Created:

    2017-08-05

    Modified:

    2020-10-07

    Version:

    1.2

    Version history:

    • 1.2 Edit policy update
    • 1.1 2020 Updates; all online
    • 1.0 First live version
    • 0.1 First stub

    CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.