RPR-GEO2R

From "A B C"
Revision as of 17:08, 31 October 2017 by Boris (talk | contribs)
Jump to navigation Jump to search

GEO2R


 

Keywords:  Programming for analysis of GEO datasets


 



 


Sorry!

This page is only a stub; it is here as a placeholder to establish the logical framework of the site but there is no significant content as yet. Do not work with this material until it is updated to "live" status.


 


Abstract

...


 


This unit ...

Prerequisites

You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:

  • The Central Dogma: Regulation of transcription and translation; protein biosynthesis and degradation; quality control.

You need to complete the following units before beginning this one:


 


Objectives

...


 


Outcomes

...


 


Deliverables

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.


 


Evaluation

Evaluation: TBD

This unit can be submitted for evaluation for a maximum of 6 marks. Details TBD.


 


Contents

GEO regex example

   ===Labeling===
   Write an R script that creates meaningful labels for data elements from metadata and shows them in a plot. Use the sample data below - or any other data you are interested in.


   Sample input data from GEO, and task description ...
   These data were downloaded from the NCBI GEO database using the GEO2R tool, this is a microarray expression data study that compares tumor and metastasis tissue. You can access the dataset here. Grouping primary PDAC (pancreatic ductal adenocarcinoma) as "tumor" and liver/peritoneal metastasis as "metastasis", an R script on the server calculates significantly differentially expressed genes using the {{Bioconductor limma package. I have selected the top 100 genes, and now would like to plot significance (adjusted P value) vs. level of differential expression (logFC). Moreover I would like to vaguely identify the function of each gene if that is discernible from the  "Gene title".
        "ID"	"adj.P.Val"	"P.Value"	"t"	"B"	"logFC"	"Gene.symbol"	"Gene.title"
      "238376_at"	"3.69e-19"	"4.53e-23"	"-49.138515"	"42.43328"	"-2.202043"	"LOC100505564///DEXI"	"uncharacterized LOC100505564///Dexi homolog (mouse)"
      "214041_x_at"	"2.36e-17"	"8.74e-21"	"38.089228"	"37.60995"	"4.541989"	"RPL37A"	"ribosomal protein L37a"
      "241662_x_at"	"2.36e-17"	"1.03e-20"	"-37.793765"	"37.45851"	"-2.105123"	""	""
      "231628_s_at"	"2.36e-17"	"1.16e-20"	"-37.574182"	"37.34507"	"-1.97516"	"SERPINB6"	"serpin peptidase inhibitor, clade B (ovalbumin), member 6"
      "224760_at"	"3.23e-17"	"2.10e-20"	"36.500909"	"36.77932"	"3.798724"	"SP1"	"Sp1 transcription factor"
      "214149_s_at"	"3.23e-17"	"2.38e-20"	"36.282193"	"36.66167"	"4.246787"	"ATP6V0E1"	"ATPase, H+ transporting, lysosomal 9kDa, V0 subunit e1"
      "243177_at"	"4.15e-17"	"3.57e-20"	"-35.573827"	"36.275"	"-1.801709"	""	""
      "243800_at"	"5.63e-17"	"5.52e-20"	"-34.825113"	"35.85663"	"-2.018088"	"NR1H4"	"nuclear receptor subfamily 1, group H, member 4"
      "238398_s_at"	"1.10e-16"	"1.21e-19"	"-33.519208"	"35.10201"	"-2.245806"	""	""
      "1569856_at"	"1.48e-16"	"1.82e-19"	"-32.860752"	"34.70891"	"-1.810438"	"TPP2"	"tripeptidyl peptidase II"
      "1555116_s_at"	"1.51e-16"	"2.14e-19"	"-32.598656"	"34.55"	"-1.990665"	"SLC11A1"	"solute carrier family 11 (proton-coupled divalent metal ion transporters), member 1"
      "218733_at"	"1.51e-16"	"2.23e-19"	"32.535823"	"34.51169"	"2.764663"	"MSL2"	"male-specific lethal 2 homolog (Drosophila)"
      "201225_s_at"	"2.72e-16"	"4.33e-19"	"31.497695"	"33.86667"	"3.447828"	"SRRM1"	"serine/arginine repetitive matrix 1"
      "217052_x_at"	"4.45e-16"	"7.64e-19"	"30.636232"	"33.31345"	"1.601527"	""	""
      "1569348_at"	"5.24e-16"	"9.65e-19"	"-30.289176"	"33.08577"	"-1.793925"	"TPTEP1"	"transmembrane phosphatase with tensin homology pseudogene 1"
      "219492_at"	"6.96e-16"	"1.37e-18"	"29.777415"	"32.74483"	"3.586919"	"CHIC2"	"cysteine-rich hydrophobic domain 2"
      "215047_at"	"7.51e-16"	"1.58e-18"	"-29.567379"	"32.60307"	"-2.033635"	"TRIM58"	"tripartite motif containing 58"
      "232877_at"	"7.51e-16"	"1.66e-18"	"-29.491388"	"32.55151"	"-1.65225"	""	""
      "229265_at"	"7.51e-16"	"1.75e-18"	"29.419139"	"32.50236"	"3.933071"	"SKI"	"v-ski sarcoma viral oncogene homolog (avian)"
      "1553842_at"	"8.16e-16"	"2.00e-18"	"-29.226409"	"32.37061"	"-1.832581"	"BEND2"	"BEN domain containing 2"
      "220791_x_at"	"1.11e-15"	"2.87e-18"	"-28.71601"	"32.01715"	"-1.969381"	"SCN11A"	"sodium channel, voltage-gated, type XI, alpha subunit"
      "212911_at"	"1.17e-15"	"3.15e-18"	"28.584094"	"31.92471"	"2.143175"	"DNAJC16"	"DnaJ (Hsp40) homolog, subfamily C, member 16"
      "243464_at"	"1.22e-15"	"3.43e-18"	"-28.463254"	"31.83963"	"-1.675747"	""	""
      "243823_at"	"1.30e-15"	"3.81e-18"	"-28.316669"	"31.7359"	"-1.499823"	""	""
      "201533_at"	"1.56e-15"	"4.80e-18"	"27.999089"	"31.5092"	"4.054743"	"CTNNB1"	"catenin (cadherin-associated protein), beta 1, 88kDa"
      "210878_s_at"	"1.59e-15"	"5.06e-18"	"27.927536"	"31.45775"	"2.982033"	"KDM3B"	"lysine (K)-specific demethylase 3B"
      "227712_at"	"3.18e-15"	"1.05e-17"	"26.938855"	"30.73223"	"2.426311"	"LYRM2"	"LYR motif containing 2"
      "228520_s_at"	"3.56e-15"	"1.22e-17"	"26.742683"	"30.58495"	"3.744881"	"APLP2"	"amyloid beta (A4) precursor-like protein 2"
      "210242_x_at"	"3.80e-15"	"1.36e-17"	"26.605262"	"30.48111"	"1.815311"	"ST20"	"suppressor of tumorigenicity 20"
      "217301_x_at"	"3.80e-15"	"1.40e-17"	"26.565414"	"30.45089"	"3.275566"	"RBBP4"	"retinoblastoma binding protein 4"
      "1557551_at"	"6.17e-15"	"2.35e-17"	"-25.892664"	"29.93351"	"-1.78824"	""	""
      "201392_s_at"	"6.17e-15"	"2.42e-17"	"25.856344"	"29.90519"	"3.283483"	"IGF2R"	"insulin-like growth factor 2 receptor"
      "210371_s_at"	"7.18e-15"	"2.91e-17"	"25.62344"	"29.72255"	"3.463431"	"RBBP4"	"retinoblastoma binding protein 4"
      "204252_at"	"9.08e-15"	"3.79e-17"	"25.291186"	"29.45902"	"2.789842"	"CDK2"	"cyclin-dependent kinase 2"
      "243200_at"	"1.04e-14"	"4.48e-17"	"-25.082134"	"29.29138"	"-1.539093"	""	""
      "201140_s_at"	"1.16e-14"	"5.13e-17"	"24.916407"	"29.15746"	"2.834707"	"RAB5C"	"RAB5C, member RAS oncogene family"
      "1559066_at"	"1.23e-14"	"5.57e-17"	"-24.813534"	"29.07387"	"-1.595061"	""	""
      "201123_s_at"	"1.27e-14"	"5.91e-17"	"24.741268"	"29.01494"	"4.870779"	"EIF5A"	"eukaryotic translation initiation factor 5A"
      "218291_at"	"1.41e-14"	"6.83e-17"	"24.565645"	"28.87099"	"2.605328"	"LAMTOR2"	"late endosomal/lysosomal adaptor, MAPK and MTOR activator 2"
      "217704_x_at"	"1.41e-14"	"6.91e-17"	"-24.550405"	"28.85845"	"-1.711476"	"SUZ12P1"	"suppressor of zeste 12 homolog pseudogene 1"
      "227338_at"	"1.44e-14"	"7.22e-17"	"-24.498114"	"28.81536"	"-2.927581"	"LOC440983"	"hypothetical gene supported by BC066916"
      "210231_x_at"	"1.64e-14"	"8.47e-17"	"24.305184"	"28.65556"	"4.548338"	"SET"	"SET nuclear oncogene"
      "225289_at"	"1.86e-14"	"9.82e-17"	"24.127523"	"28.50726"	"3.062123"	"STAT3"	"signal transducer and activator of transcription 3 (acute-phase response factor)"
      "204658_at"	"1.93e-14"	"1.04e-16"	"24.056703"	"28.44783"	"2.868797"	"TRA2A"	"transformer 2 alpha homolog (Drosophila)"
      "208819_at"	"2.54e-14"	"1.40e-16"	"23.705016"	"28.15009"	"2.593365"	"RAB8A"	"RAB8A, member RAS oncogene family"
      "210011_s_at"	"2.58e-14"	"1.46e-16"	"23.660126"	"28.11176"	"2.309763"	"EWSR1"	"EWS RNA-binding protein 1"
      "202397_at"	"2.58e-14"	"1.48e-16"	"23.638422"	"28.0932"	"4.332132"	"NUTF2"	"nuclear transport factor 2"
      "1552628_a_at"	"2.86e-14"	"1.68e-16"	"23.492249"	"27.96778"	"2.892763"	"HERPUD2"	"HERPUD family member 2"
      "233757_x_at"	"3.85e-14"	"2.31e-16"	"23.123802"	"27.64812"	"2.430056"	""	""
      "201545_s_at"	"5.07e-14"	"3.16e-16"	"22.767216"	"27.33385"	"2.568005"	"PABPN1"	"poly(A) binding protein, nuclear 1"
      "1562463_at"	"5.07e-14"	"3.17e-16"	"-22.763883"	"27.33089"	"-1.119718"	""	""
      "219859_at"	"5.41e-14"	"3.45e-16"	"-22.669239"	"27.24664"	"-1.787549"	"CLEC4E"	"C-type lectin domain family 4, member E"
      "1569136_at"	"6.91e-14"	"4.50e-16"	"-22.372385"	"26.98011"	"-1.95396"	"MGAT4A"	"mannosyl (alpha-1,3-)-glycoprotein beta-1,4-N-acetylglucosaminyltransferase, isozyme A"
      "208601_s_at"	"7.15e-14"	"4.74e-16"	"-22.314594"	"26.92781"	"-1.323653"	"TUBB1"	"tubulin, beta 1 class VI"
      "226194_at"	"1.11e-13"	"7.47e-16"	"21.813583"	"26.46872"	"2.331245"	"CHAMP1"	"chromosome alignment maintaining phosphoprotein 1"
      "217877_s_at"	"1.15e-13"	"7.93e-16"	"21.748093"	"26.40795"	"2.862688"	"GPBP1L1"	"GC-rich promoter binding protein 1-like 1"
      "225371_at"	"1.25e-13"	"8.73e-16"	"21.644444"	"26.31139"	"2.518013"	"GLE1"	"GLE1 RNA export mediator homolog (yeast)"
      "1563431_x_at"	"1.44e-13"	"1.02e-15"	"21.472848"	"26.15053"	"1.874743"	"CALM3"	"calmodulin 3 (phosphorylase kinase, delta)"
      "211505_s_at"	"1.45e-13"	"1.06e-15"	"21.437744"	"26.11746"	"2.642609"	"STAU1"	"staufen double-stranded RNA binding protein 1"
      "201585_s_at"	"1.45e-13"	"1.07e-15"	"21.430113"	"26.11027"	"2.787833"	"SFPQ"	"splicing factor proline/glutamine-rich"
      "225197_at"	"1.75e-13"	"1.31e-15"	"21.212989"	"25.90451"	"2.845005"	""	""
      "220336_s_at"	"1.83e-13"	"1.41e-15"	"-21.132294"	"25.82752"	"-1.848273"	"GP6"	"glycoprotein VI (platelet)"
      "216515_x_at"	"1.83e-13"	"1.42e-15"	"21.128023"	"25.82343"	"2.877477"	"MIR1244-2///MIR1244-3///MIR1244-1///PTMAP5///PTMA"	"microRNA 1244-2///microRNA 1244-3///microRNA 1244-1///prothymosin, alpha pseudogene 5///prothymosin, alpha"
      "241773_at"	"3.49e-13"	"2.74e-15"	"-20.441442"	"25.15639"	"-1.835223"	""	""
      "1558011_at"	"3.89e-13"	"3.15e-15"	"-20.297118"	"25.01342"	"-1.577874"	"LOC100510697"	"putative POM121-like protein 1-like"
      "215240_at"	"3.89e-13"	"3.15e-15"	"-20.29699"	"25.01329"	"-1.613308"	"ITGB3"	"integrin, beta 3 (platelet glycoprotein IIIa, antigen CD61)"
      "233746_x_at"	"3.95e-13"	"3.25e-15"	"20.265986"	"24.98245"	"2.364699"	"HYPK///SERF2"	"huntingtin interacting protein K///small EDRK-rich factor 2"
      "1555338_s_at"	"4.10e-13"	"3.42e-15"	"-20.214797"	"24.93143"	"-1.280803"	"AQP10"	"aquaporin 10"
      "217714_x_at"	"4.12e-13"	"3.48e-15"	"20.195128"	"24.91179"	"2.247023"	"STMN1"	"stathmin 1"
      "202276_at"	"4.75e-13"	"4.08e-15"	"20.035595"	"24.75183"	"2.654202"	"SHFM1"	"split hand/foot malformation (ectrodactyly) type 1"
      "225414_at"	"6.34e-13"	"5.52e-15"	"19.733786"	"24.44585"	"3.287225"	"RNF149"	"ring finger protein 149"
      "243930_x_at"	"7.43e-13"	"6.64e-15"	"-19.55046"	"24.2578"	"-1.219467"	""	""
      "1569263_at"	"7.43e-13"	"6.66e-15"	"-19.548534"	"24.25581"	"-1.662363"	""	""
      "1554876_a_at"	"8.55e-13"	"7.77e-15"	"-19.397142"	"24.09923"	"-1.388081"	"S100Z"	"S100 calcium binding protein Z"
      "220001_at"	"1.08e-12"	"9.97e-15"	"-19.15375"	"23.84505"	"-1.412727"	"PADI4"	"peptidyl arginine deiminase, type IV"
      "228170_at"	"1.12e-12"	"1.05e-14"	"-19.106672"	"23.79554"	"-1.840114"	"OLIG1"	"oligodendrocyte transcription factor 1"
      "211445_x_at"	"1.29e-12"	"1.22e-14"	"-18.959325"	"23.63981"	"-1.134266"	"NACAP1"	"nascent-polypeptide-associated complex alpha polypeptide pseudogene 1"
      "1555311_at"	"1.33e-12"	"1.27e-14"	"-18.91869"	"23.59666"	"-1.45603"	""	""
      "201643_x_at"	"1.47e-12"	"1.43e-14"	"18.808994"	"23.47974"	"1.867155"	"KDM3B"	"lysine (K)-specific demethylase 3B"
      "216449_x_at"	"1.51e-12"	"1.48e-14"	"18.773094"	"23.44134"	"3.178009"	"HSP90B1"	"heat shock protein 90kDa beta (Grp94), member 1"
      "218680_x_at"	"1.51e-12"	"1.50e-14"	"18.763896"	"23.43149"	"2.262739"	"HYPK///SERF2"	"huntingtin interacting protein K///small EDRK-rich factor 2"
      "225954_s_at"	"1.65e-12"	"1.67e-14"	"18.662853"	"23.32298"	"2.405388"	"MIDN"	"midnolin"
      "203102_s_at"	"1.65e-12"	"1.68e-14"	"18.658192"	"23.31796"	"2.476697"	"MGAT2"	"mannosyl (alpha-1,6-)-glycoprotein beta-1,2-N-acetylglucosaminyltransferase"
      "1569345_at"	"1.69e-12"	"1.74e-14"	"18.624203"	"23.28133"	"1.236884"	""	""
      "214001_x_at"	"1.71e-12"	"1.78e-14"	"18.598496"	"23.25358"	"2.570012"	""	""
      "231812_x_at"	"1.72e-12"	"1.81e-14"	"18.583236"	"23.2371"	"1.678685"	"PHAX"	"phosphorylated adaptor for RNA export"
      "232075_at"	"1.93e-12"	"2.06e-14"	"-18.462717"	"23.10643"	"-2.150701"	"WDR61"	"WD repeat domain 61"
      "200669_s_at"	"1.96e-12"	"2.12e-14"	"18.438729"	"23.08033"	"1.891968"	"UBE2D3"	"ubiquitin-conjugating enzyme E2D 3"
      "236995_x_at"	"2.04e-12"	"2.23e-14"	"-18.389604"	"23.02677"	"-1.879369"	"TFEC"	"transcription factor EC"
      "218008_at"	"2.24e-12"	"2.48e-14"	"18.291537"	"22.91946"	"2.445428"	"TMEM248"	"transmembrane protein 248"
      "217140_s_at"	"2.30e-12"	"2.56e-14"	"18.260017"	"22.88485"	"3.983721"	"VDAC1"	"voltage-dependent anion channel 1"
      "210183_x_at"	"2.46e-12"	"2.79e-14"	"18.183339"	"22.80044"	"1.79105"	"PNN"	"pinin, desmosome associated protein"
      "216954_x_at"	"2.46e-12"	"2.80e-14"	"-18.177967"	"22.79451"	"-1.090193"	"ATP5O"	"ATP synthase, H+ transporting, mitochondrial F1 complex, O subunit"
      "207688_s_at"	"2.53e-12"	"2.92e-14"	"18.141153"	"22.75385"	"2.492309"	"INHBC"	"inhibin, beta C"
      "218020_s_at"	"2.63e-12"	"3.06e-14"	"18.095669"	"22.70351"	"1.772689"	"ZFAND3"	"zinc finger, AN1-type domain 3"
      "217756_x_at"	"3.12e-12"	"3.67e-14"	"17.930201"	"22.51939"	"1.914366"	"SERF2"	"small EDRK-rich factor 2"
      "214150_x_at"	"3.42e-12"	"4.07e-14"	"-17.835551"	"22.41336"	"-1.177963"	"ATP6V0E1"	"ATPase, H+ transporting, lysosomal 9kDa, V0 subunit e1"
      "208750_s_at"	"3.48e-12"	"4.18e-14"	"17.812279"	"22.38721"	"2.649599"	"ARF1"	"ADP-ribosylation factor 1"
      "201749_at"	"3.59e-12"	"4.42e-14"	"17.761415"	"22.32994"	"1.917794"	"ECE1"	"endothelin converting enzyme 1"



       Read the data into R. Plot log(P) against log(FC). Define some regular expressions that identify keywords in the gene title: things like "X-ase", "Y factor", "Z gene" etc. Apply these to the gene titles using regexpr() and store the results by applying regmatches() to the text. Then use text() to plot the extracted strings.


        #GEO-hits.R
        # bs - Sept. 2013

        dat <- read.table("GEO-hits_100.txt", header = TRUE) # this is a file of GEO
      # differential expression data
      head(dat)

      plot(-log(dat[,"adj.P.Val"]), dat[,"logFC"], cex=0.7, pch=16, col="#BB0000")
      # Note that all these genes have at least one log of
      # differential expression - up or down. As a trend,
      # higher probabilities are found for higher levels of
      # differential expression.

      # The dataframe produced by R's read.table() function
      # defines all character-containing rows as _factors_.
      # However to process them as strings, we need to convert
      # them to characters.

      dat[,"Gene.title"] <- as.character(dat[,"Gene.title"])

      # First, let's define some regexes for keywords to guess
      # a function ...

      # (Note the need for doubled escape characters in R!)

      r <- c(   "\\b(\\w+ase)\\b")  # peptidase, kinase ...
      r <- c(r, "\\b(?!factor)(\\w+or)") # suppressor, adaptor ...
      r <- c(r, "\\b(\\w+)\\b\\s(factor|protein|homolog)") # the preceeding word ...


      # Now iterate over the Gene.title column and for each row try all regular
      # expressions.

      for (i in 1:nrow(dat)) { # for all rows ...
        for (j in 1:length(r)) { # for all regular expressions
          dat[i,"Function.guess"] <- "" # clear the contents of the column
          M <- regexpr(r[j], dat[i, "Gene.title"], perl = TRUE)
          if (M[1] > 0) {
            dat[i,"Function.guess"] <- regmatches(dat[i,"Gene.title"], M)
            break  # stop regexing if something was found
          }
        }
      }

      dat[,"Function.guess"] # check what we found ...
      # ... and plot the strings to the right of its point.
      text(-log(dat[,"adj.P.Val"]), dat[,"logFC"], dat[,"Function.guess"], cex=0.4, pos=4)

      # I'm not sure we are actually learning anything important from this.
      # But the code was merely meant to illustrate how
      # to work with regular expressions in R (and introduce you to GEO
      # differential expression data on the side). Mission accomplished.




 


Further reading, links and resources

 


Notes


 


Self-evaluation

 



 




 

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.



 

About ...
 
Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2017-08-05

Version:

0.1

Version history:

  • 0.1 First stub

CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.