Difference between revisions of "Workshops/Saskatoon 2015-Exploratory Data Analysis"

From "A B C"
Jump to navigation Jump to search
m
 
(15 intermediate revisions by the same user not shown)
Line 17: Line 17:
  
  
<table width="50%">
+
<table width="90%" cellpadding="5">
  
 
<tr class="sh">
 
<tr class="sh">
Line 28: Line 28:
  
 
<tr class="s2">
 
<tr class="s2">
<td height="72">09:00 &mdash; 10:30</td>
+
<td height="72">09:00 &ndash; 10:30</td>
<td height="72"<br />Lecture and practicals: EDA</td>
+
<td height="72">Lecture and practicals: EDA</td>
<td height="72"<br />Lecture and practicals: Dimension reduction</td>
+
<td height="72">Lecture and practicals: Dimension reduction</td>
 
</tr>
 
</tr>
  
 
<tr class="s1">
 
<tr class="s1">
<td height="24">10:30 &mdash; 11:00</td>
+
<td height="24">10:30 &ndash; 11:00</td>
 
<td height="24" colspan="2">Coffee break</td>
 
<td height="24" colspan="2">Coffee break</td>
 
</tr>
 
</tr>
  
 
<tr class="s2">
 
<tr class="s2">
<td height="72">11:00 &mdash; 12:30</td>
+
<td height="72">11:00 &ndash; 12:30</td>
 
<td height="72">Lecture and practicals: EDA</td>
 
<td height="72">Lecture and practicals: EDA</td>
 
<td height="72">Lecture and practicals: Clustering</td>
 
<td height="72">Lecture and practicals: Clustering</td>
Line 45: Line 45:
  
 
<tr class="s1">
 
<tr class="s1">
<td height="48">12:30 &mdash; 13:30</td>
+
<td height="48">12:30 &ndash; 13:30</td>
 
<td height="48" colspan="2">Lunch break</td>
 
<td height="48" colspan="2">Lunch break</td>
 
</tr>
 
</tr>
  
 
<tr class="s2">
 
<tr class="s2">
<td height="72">13:30 &mdash; 15:00</td>
+
<td height="72">13:30 &ndash; 15:00</td>
 
<td height="72">Lecture and practicals: Software development</td>
 
<td height="72">Lecture and practicals: Software development</td>
 
<td height="72">Lecture and practicals:Clustering</td>
 
<td height="72">Lecture and practicals:Clustering</td>
Line 56: Line 56:
  
 
<tr class="s1">
 
<tr class="s1">
<td height="24">15:00 &mdash; 15:30</td>
+
<td height="24">15:00 &ndash; 15:30</td>
 
<td height="24" colspan="2">Coffee break</td>
 
<td height="24" colspan="2">Coffee break</td>
 
</tr>
 
</tr>
  
 
<tr class="s2">
 
<tr class="s2">
<td height="72">13:30 &mdash; 15:00</td>
+
<td height="72">13:30 &ndash; 15:00</td>
 
<td height="72">Lecture and practicals: Regression</td>
 
<td height="72">Lecture and practicals: Regression</td>
 
<td height="72">Lecture and practicals: Hypothesis testing</td>
 
<td height="72">Lecture and practicals: Hypothesis testing</td>
Line 73: Line 73:
 
&nbsp;
 
&nbsp;
  
==Module 1==
 
  
 +
==General Resources==
 +
*[[Media:R_refcard-data-mining.pdf|A concise reference card of '''R''' functions for data mining (pdf)]]
 +
 +
 +
 +
 +
==Progress Notes Thursday==
 +
 +
Selected objectives we covered during the workshop:
 +
 +
* subsetting
 +
** selecting rows and columns by "index"
 +
** ... by rowname or columnname as string or vector of strings
 +
** ... using the $ sign for individual columns of a dataframe
 +
** using order() to get values sorted by some property
 +
   
 +
* filtering
 +
** finding elements that contain a string with grep() (and using that to select rows)
 +
** finding elements that match a logical expression, such as ==, <, > etc.
 +
 +
 +
----
 +
 +
 +
* simple descriptive statistics
 +
** mean() / median()
 +
** using as.numeric(), as.logical() etc. to force evaluation as a  particular type
 +
** sd() / IQR()  / summary()
 +
** theoretical and empirical quantiles; quantile()
 +
 
 +
* random numbers and seeded random numbers; set.seed()
 +
* normally distributed random numbers; rnorm()
 +
   
 +
* simple plots
 +
**  abline() to draw lines on plots with parameters h= ... or v = ...
 +
 +
 +
----
 +
 +
 +
* scatterplot
 +
** empty plots and overplotting with
 +
*** points()
 +
*** lines()
 +
*** segments()
 +
*** text()
 +
* boxplot
 +
* barplot
 +
* colors
 +
** color names
 +
** colors as hexcodes
 +
** color palettes
 +
** transparency
 +
* lines
 +
** linetypes (lty=) / line width (lwd=)
 +
* plotting characters (pch=)
 +
 +
* hexbin package
 +
 +
 +
----
 +
 +
 +
* synthetic data is useful
 +
* linear regression
 +
** retrieving parameters;  lm()
 +
** analyzing quality;  resid()
 +
**  plotting prediction and confidence intervals
 +
* non-linear regression
 +
** set up a formula
 +
** initiate with starting values
 +
** plot results
 +
* MIC as alternative to Pearson Correlation 
 +
 +
 +
&nbsp;
 +
 +
 +
==Progress Notes Friday==
 +
 +
Selected objectives we covered during the workshop:
 +
 +
* principles of PCA
 +
* interpreting a PCA as a projection
 +
* calculated PCA for crabs morphometric data and for gene expression profiles
 +
* discussed the output of prcomp()
 +
* plotted rotations and discussed their interpretation ("eigengenes")
 +
 +
* parallel coordinate plots for high-dimensional data matplot()
 +
 +
* normalizing data (subtract mean and divide by standard deviation)
 +
 +
* plotting an empty frame and using points(), lines(), text() or other functions to plot elements under detailed, individual control
 +
 +
* showing categories and features on a plot
 +
** using color
 +
** using shapes
 +
** using size
 +
** using text()
 +
 
 +
* discussed confounding factors and how to recognize them through PCA
 +
 +
* used PCA / Correlations etc. to discover similar elements (e.g. genes with similar expression profiles)
 +
 +
* correlating against models (e.g. sine wave) as an alternative to PCA
 +
 +
* using "embedding" (package "tsne") as alternative to correlations
 +
   
 +
----
 +
 +
* clustering
 +
** ...needs a measurable notion of "similarity"
 +
** ... using distance "metrics" such euclidian, 1-correlation ... many more
 +
 +
* hierarchical clustering
 +
** different linkage algorithms possible
 +
** results in a dendrogram
 +
** needs to cut dendrogram to define clusters and retrieve elements
 +
 +
* many alternatives, need to try more than one, experiment and find (ideally) some external way to validate (e.g. functional annotation)
 +
 +
* discussed ways to interact with plots (pick points with identify() and locate())
 +
 +
----
 +
* discussed principles of Hypothesis testing
 +
** Error types
 +
** pValues
 +
** significance levels
 +
** multiple testing problem
 +
 +
* examples from GEO dataset analysis
 +
* discussed simulation / permutation testing as alternatives for statistical tests
 +
 +
 +
&nbsp;
 +
 +
==EDA==
 +
 +
 +
;Slides
 +
*[[Media:EDA.pdf|EDA.pdf (pdf of slides)]]
 +
 +
;Scripts
 +
*[[Media:EDA.R|'''EDA.R''' (the main script for this session)]]
 +
*<small>&nbsp;&nbsp;&nbsp;&nbsp;[[Media:SubsettingQuizAnswers.R|SubsettingQuizAnswers.R]]&nbsp;&nbsp;(Answers, in case you didn't already write them in your script)</small>
 +
*[[Media:PlottingReference.R|PlottingReference.R]]
 +
 +
 +
;Data
 +
*[[Media:GvHD.txt|GvHD.txt]]
 +
 +
 +
;Resources
 +
 +
 +
&nbsp;
 +
 +
==Software==
 +
 +
 +
;Links
 +
*[[Regular_Expressions]]
 +
*[[Software_Development]]
 +
*[[R_knitr]]
 +
 +
 +
 +
;Resources
 +
*[https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects '''Using Projects with R Studio''']
 +
*[http://software-carpentry.org/ '''Software Carpentry''']
 +
**[http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745 Best Practices for Scientific Computing, Wilson ''et al.'', PLoS Biology, Jan. 2014]
 +
*[https://support.rstudio.com/hc/en-us/articles/200532077-Version-Control-with-Git-and-SVN '''Version control in R Studio''']
 +
 +
 +
&nbsp;
 +
 +
==Regression==
 +
 +
 +
;Slides
 +
*[[Media:EDA_Regression.pdf|EDA_Regression.pdf (pdf of slides)]]
  
 
;Scripts
 
;Scripts
<!-- *[[Media:ScriptTemplate.R|ScriptTemplate.R]] -->
+
*[[Media:EDA_Regression.R|'''EDA_Regression.R''' (the main script for this session)]]
 +
 
  
  
 
&nbsp;
 
&nbsp;
 +
 +
 +
==Dimension Reduction==
 +
 +
 +
;Slides
 +
*[[Media:EDA_DimensionReduction.pdf|EDA_DimensionReduction.pdf (pdf of slides)]]
 +
 +
 +
;Scripts
 +
*[[Media:EDA_DimensionReduction.R|EDA_DimensionReduction.R]]
 +
 +
 +
;Data
 +
*[[Media:Logcho_237_4class.txt|Logcho_237_4class.txt]]
 +
 +
 +
;Resources
 +
 +
 +
&nbsp;
 +
 +
==Clustering==
 +
 +
 +
;Slides
 +
*[[Media:EDA_Clustering.pdf|EDA_Clustering.pdf (pdf of slides)]]
 +
 +
 +
;Scripts
 +
*[[Media:EDA_ClusteringExpressionData.R|EDA_ClusteringExpressionData.R]]
 +
 +
 +
;Data
 +
*[[Media:GSE26922.dat|GSE26922.dat (Fallback data)]]
 +
 +
 +
 +
&nbsp;
 +
 +
 +
 +
==Hypothesis Testing==
 +
 +
 +
;Slides
 +
*[[Media:EDA_HypothesisTesting.pdf|EDA_HypothesisTesting.pdf (pdf of slides)]]
 +
 +
 +
;Scripts
 +
*[[Media:EDA_HypothesisTesting.R|EDA_HypothesisTesting.R]]
 +
 +
 +
;Resources
 +
*[[Media:Tan_2015-NGSdifferentialTranscription.pdf|Tan_2015-NGSdifferentialTranscription.pdf]]
 +
*[[Media:ErroneusAnalysesOfSignificance-NatureNeuroscience2011.pdf|ErroneusAnalysesOfSignificance-NatureNeuroscience2011.pdf]]
 +
 +
&nbsp;
 +
 +
 +
==Generally useful links==
 +
 +
;Help and Information
 +
* The '''R''' help mailing list:  https://stat.ethz.ch/mailman/listinfo/r-help
 +
* '''Rseek''': the specialized search engine for '''R''' topics: http://rseek.org/
 +
* '''R''' questions on stackoverflow:  http://stackoverflow.com/questions/tagged/r
 +
* The Comprehensive '''R''' Archive Network '''CRAN''':  http://cran.r-project.org/
 +
* The '''CRAN''' task-view collection: http://cran.r-project.org/web/views/
 +
* '''Bioconductor''' task views:  http://www.bioconductor.org/packages/release/BiocViews.html
 +
 +
 +
;Resources
 +
*[[Media:Weissgerber_(2015)_BeyondBarcharts.pdf|Weissgerber_(2015)_BeyondBarcharts.pdf]]
 +
 +
 +
&nbsp;
 +
  
 
==Notes==
 
==Notes==

Latest revision as of 19:04, 23 August 2015

Introduction to Exploratory Data Analysis with R



 

Schedule

Please note: this schedule is a rough guideline only, we will be very flexible to adapt to class needs as we proceed.


Time Thursday's Activities Friday
09:00 – 10:30 Lecture and practicals: EDA Lecture and practicals: Dimension reduction
10:30 – 11:00 Coffee break
11:00 – 12:30 Lecture and practicals: EDA Lecture and practicals: Clustering
12:30 – 13:30 Lunch break
13:30 – 15:00 Lecture and practicals: Software development Lecture and practicals:Clustering
15:00 – 15:30 Coffee break
13:30 – 15:00 Lecture and practicals: Regression Lecture and practicals: Hypothesis testing


 


General Resources



Progress Notes Thursday

Selected objectives we covered during the workshop:

  • subsetting
    • selecting rows and columns by "index"
    • ... by rowname or columnname as string or vector of strings
    • ... using the $ sign for individual columns of a dataframe
    • using order() to get values sorted by some property
  • filtering
    • finding elements that contain a string with grep() (and using that to select rows)
    • finding elements that match a logical expression, such as ==, <, > etc.




  • simple descriptive statistics
    • mean() / median()
    • using as.numeric(), as.logical() etc. to force evaluation as a particular type
    • sd() / IQR() / summary()
    • theoretical and empirical quantiles; quantile()
  • random numbers and seeded random numbers; set.seed()
  • normally distributed random numbers; rnorm()
  • simple plots
    • abline() to draw lines on plots with parameters h= ... or v = ...




  • scatterplot
    • empty plots and overplotting with
      • points()
      • lines()
      • segments()
      • text()
  • boxplot
  • barplot
  • colors
    • color names
    • colors as hexcodes
    • color palettes
    • transparency
  • lines
    • linetypes (lty=) / line width (lwd=)
  • plotting characters (pch=)
  • hexbin package




  • synthetic data is useful
  • linear regression
    • retrieving parameters; lm()
    • analyzing quality; resid()
    • plotting prediction and confidence intervals
  • non-linear regression
    • set up a formula
    • initiate with starting values
    • plot results
  • MIC as alternative to Pearson Correlation


 


Progress Notes Friday

Selected objectives we covered during the workshop:

  • principles of PCA
  • interpreting a PCA as a projection
  • calculated PCA for crabs morphometric data and for gene expression profiles
  • discussed the output of prcomp()
  • plotted rotations and discussed their interpretation ("eigengenes")
  • parallel coordinate plots for high-dimensional data matplot()
  • normalizing data (subtract mean and divide by standard deviation)
  • plotting an empty frame and using points(), lines(), text() or other functions to plot elements under detailed, individual control
  • showing categories and features on a plot
    • using color
    • using shapes
    • using size
    • using text()
  • discussed confounding factors and how to recognize them through PCA
  • used PCA / Correlations etc. to discover similar elements (e.g. genes with similar expression profiles)
  • correlating against models (e.g. sine wave) as an alternative to PCA
  • using "embedding" (package "tsne") as alternative to correlations

  • clustering
    • ...needs a measurable notion of "similarity"
    • ... using distance "metrics" such euclidian, 1-correlation ... many more
  • hierarchical clustering
    • different linkage algorithms possible
    • results in a dendrogram
    • needs to cut dendrogram to define clusters and retrieve elements
  • many alternatives, need to try more than one, experiment and find (ideally) some external way to validate (e.g. functional annotation)
  • discussed ways to interact with plots (pick points with identify() and locate())

  • discussed principles of Hypothesis testing
    • Error types
    • pValues
    • significance levels
    • multiple testing problem
  • examples from GEO dataset analysis
  • discussed simulation / permutation testing as alternatives for statistical tests


 

EDA

Slides
Scripts


Data


Resources


 

Software

Links


Resources


 

Regression

Slides
Scripts


 


Dimension Reduction

Slides


Scripts


Data


Resources


 

Clustering

Slides


Scripts


Data


 


Hypothesis Testing

Slides


Scripts


Resources

 


Generally useful links

Help and Information


Resources


 


Notes