Difference between revisions of "Workshops/Saskatoon 2015-Exploratory Data Analysis"

From "A B C"
Jump to navigation Jump to search
 
(6 intermediate revisions by the same user not shown)
Line 80: Line 80:
  
  
 +
==Progress Notes Thursday==
  
 +
Selected objectives we covered during the workshop:
 +
 +
* subsetting
 +
** selecting rows and columns by "index"
 +
** ... by rowname or columnname as string or vector of strings
 +
** ... using the $ sign for individual columns of a dataframe
 +
** using order() to get values sorted by some property
 +
   
 +
* filtering
 +
** finding elements that contain a string with grep() (and using that to select rows)
 +
** finding elements that match a logical expression, such as ==, <, > etc.
 +
 +
 +
----
 +
 +
 +
* simple descriptive statistics
 +
** mean() / median()
 +
** using as.numeric(), as.logical() etc. to force evaluation as a  particular type
 +
** sd() / IQR()  / summary()
 +
** theoretical and empirical quantiles; quantile()
 +
 
 +
* random numbers and seeded random numbers; set.seed()
 +
* normally distributed random numbers; rnorm()
 +
   
 +
* simple plots
 +
**  abline() to draw lines on plots with parameters h= ... or v = ...
 +
 +
 +
----
 +
 +
 +
* scatterplot
 +
** empty plots and overplotting with
 +
*** points()
 +
*** lines()
 +
*** segments()
 +
*** text()
 +
* boxplot
 +
* barplot
 +
* colors
 +
** color names
 +
** colors as hexcodes
 +
** color palettes
 +
** transparency
 +
* lines
 +
** linetypes (lty=) / line width (lwd=)
 +
* plotting characters (pch=)
 +
 +
* hexbin package
 +
 +
 +
----
 +
 +
 +
* synthetic data is useful
 +
* linear regression
 +
** retrieving parameters;  lm()
 +
** analyzing quality;  resid()
 +
**  plotting prediction and confidence intervals
 +
* non-linear regression
 +
** set up a formula
 +
** initiate with starting values
 +
** plot results
 +
* MIC as alternative to Pearson Correlation 
 +
 +
 +
&nbsp;
 +
 +
 +
==Progress Notes Friday==
 +
 +
Selected objectives we covered during the workshop:
 +
 +
* principles of PCA
 +
* interpreting a PCA as a projection
 +
* calculated PCA for crabs morphometric data and for gene expression profiles
 +
* discussed the output of prcomp()
 +
* plotted rotations and discussed their interpretation ("eigengenes")
 +
 +
* parallel coordinate plots for high-dimensional data matplot()
 +
 +
* normalizing data (subtract mean and divide by standard deviation)
 +
 +
* plotting an empty frame and using points(), lines(), text() or other functions to plot elements under detailed, individual control
 +
 +
* showing categories and features on a plot
 +
** using color
 +
** using shapes
 +
** using size
 +
** using text()
 +
 
 +
* discussed confounding factors and how to recognize them through PCA
 +
 +
* used PCA / Correlations etc. to discover similar elements (e.g. genes with similar expression profiles)
 +
 +
* correlating against models (e.g. sine wave) as an alternative to PCA
 +
 +
* using "embedding" (package "tsne") as alternative to correlations
 +
   
 +
----
 +
 +
* clustering
 +
** ...needs a measurable notion of "similarity"
 +
** ... using distance "metrics" such euclidian, 1-correlation ... many more
 +
 +
* hierarchical clustering
 +
** different linkage algorithms possible
 +
** results in a dendrogram
 +
** needs to cut dendrogram to define clusters and retrieve elements
 +
 +
* many alternatives, need to try more than one, experiment and find (ideally) some external way to validate (e.g. functional annotation)
 +
 +
* discussed ways to interact with plots (pick points with identify() and locate())
 +
 +
----
 +
* discussed principles of Hypothesis testing
 +
** Error types
 +
** pValues
 +
** significance levels
 +
** multiple testing problem
 +
 +
* examples from GEO dataset analysis
 +
* discussed simulation / permutation testing as alternatives for statistical tests
 +
 +
 +
&nbsp;
  
 
==EDA==
 
==EDA==
Line 90: Line 218:
 
;Scripts
 
;Scripts
 
*[[Media:EDA.R|'''EDA.R''' (the main script for this session)]]
 
*[[Media:EDA.R|'''EDA.R''' (the main script for this session)]]
 +
*<small>&nbsp;&nbsp;&nbsp;&nbsp;[[Media:SubsettingQuizAnswers.R|SubsettingQuizAnswers.R]]&nbsp;&nbsp;(Answers, in case you didn't already write them in your script)</small>
 
*[[Media:PlottingReference.R|PlottingReference.R]]
 
*[[Media:PlottingReference.R|PlottingReference.R]]
  
Line 139: Line 268:
  
 
;Slides
 
;Slides
 +
*[[Media:EDA_DimensionReduction.pdf|EDA_DimensionReduction.pdf (pdf of slides)]]
  
  
 
;Scripts
 
;Scripts
 +
*[[Media:EDA_DimensionReduction.R|EDA_DimensionReduction.R]]
 +
 +
 +
;Data
 +
*[[Media:Logcho_237_4class.txt|Logcho_237_4class.txt]]
  
  
Line 148: Line 283:
  
 
&nbsp;
 
&nbsp;
 
 
  
 
==Clustering==
 
==Clustering==
Line 155: Line 288:
  
 
;Slides
 
;Slides
 +
*[[Media:EDA_Clustering.pdf|EDA_Clustering.pdf (pdf of slides)]]
  
  
 
;Scripts
 
;Scripts
 +
*[[Media:EDA_ClusteringExpressionData.R|EDA_ClusteringExpressionData.R]]
  
  
;Resources
+
;Data
 +
*[[Media:GSE26922.dat|GSE26922.dat (Fallback data)]]
 +
 
  
  
Line 171: Line 308:
  
 
;Slides
 
;Slides
 +
*[[Media:EDA_HypothesisTesting.pdf|EDA_HypothesisTesting.pdf (pdf of slides)]]
  
  
 
;Scripts
 
;Scripts
 +
*[[Media:EDA_HypothesisTesting.R|EDA_HypothesisTesting.R]]
  
  
 
;Resources
 
;Resources
 
+
*[[Media:Tan_2015-NGSdifferentialTranscription.pdf|Tan_2015-NGSdifferentialTranscription.pdf]]
 +
*[[Media:ErroneusAnalysesOfSignificance-NatureNeuroscience2011.pdf|ErroneusAnalysesOfSignificance-NatureNeuroscience2011.pdf]]
  
 
&nbsp;
 
&nbsp;

Latest revision as of 19:04, 23 August 2015

Introduction to Exploratory Data Analysis with R



 

Schedule

Please note: this schedule is a rough guideline only, we will be very flexible to adapt to class needs as we proceed.


Time Thursday's Activities Friday
09:00 – 10:30 Lecture and practicals: EDA Lecture and practicals: Dimension reduction
10:30 – 11:00 Coffee break
11:00 – 12:30 Lecture and practicals: EDA Lecture and practicals: Clustering
12:30 – 13:30 Lunch break
13:30 – 15:00 Lecture and practicals: Software development Lecture and practicals:Clustering
15:00 – 15:30 Coffee break
13:30 – 15:00 Lecture and practicals: Regression Lecture and practicals: Hypothesis testing


 


General Resources



Progress Notes Thursday

Selected objectives we covered during the workshop:

  • subsetting
    • selecting rows and columns by "index"
    • ... by rowname or columnname as string or vector of strings
    • ... using the $ sign for individual columns of a dataframe
    • using order() to get values sorted by some property
  • filtering
    • finding elements that contain a string with grep() (and using that to select rows)
    • finding elements that match a logical expression, such as ==, <, > etc.




  • simple descriptive statistics
    • mean() / median()
    • using as.numeric(), as.logical() etc. to force evaluation as a particular type
    • sd() / IQR() / summary()
    • theoretical and empirical quantiles; quantile()
  • random numbers and seeded random numbers; set.seed()
  • normally distributed random numbers; rnorm()
  • simple plots
    • abline() to draw lines on plots with parameters h= ... or v = ...




  • scatterplot
    • empty plots and overplotting with
      • points()
      • lines()
      • segments()
      • text()
  • boxplot
  • barplot
  • colors
    • color names
    • colors as hexcodes
    • color palettes
    • transparency
  • lines
    • linetypes (lty=) / line width (lwd=)
  • plotting characters (pch=)
  • hexbin package




  • synthetic data is useful
  • linear regression
    • retrieving parameters; lm()
    • analyzing quality; resid()
    • plotting prediction and confidence intervals
  • non-linear regression
    • set up a formula
    • initiate with starting values
    • plot results
  • MIC as alternative to Pearson Correlation


 


Progress Notes Friday

Selected objectives we covered during the workshop:

  • principles of PCA
  • interpreting a PCA as a projection
  • calculated PCA for crabs morphometric data and for gene expression profiles
  • discussed the output of prcomp()
  • plotted rotations and discussed their interpretation ("eigengenes")
  • parallel coordinate plots for high-dimensional data matplot()
  • normalizing data (subtract mean and divide by standard deviation)
  • plotting an empty frame and using points(), lines(), text() or other functions to plot elements under detailed, individual control
  • showing categories and features on a plot
    • using color
    • using shapes
    • using size
    • using text()
  • discussed confounding factors and how to recognize them through PCA
  • used PCA / Correlations etc. to discover similar elements (e.g. genes with similar expression profiles)
  • correlating against models (e.g. sine wave) as an alternative to PCA
  • using "embedding" (package "tsne") as alternative to correlations

  • clustering
    • ...needs a measurable notion of "similarity"
    • ... using distance "metrics" such euclidian, 1-correlation ... many more
  • hierarchical clustering
    • different linkage algorithms possible
    • results in a dendrogram
    • needs to cut dendrogram to define clusters and retrieve elements
  • many alternatives, need to try more than one, experiment and find (ideally) some external way to validate (e.g. functional annotation)
  • discussed ways to interact with plots (pick points with identify() and locate())

  • discussed principles of Hypothesis testing
    • Error types
    • pValues
    • significance levels
    • multiple testing problem
  • examples from GEO dataset analysis
  • discussed simulation / permutation testing as alternatives for statistical tests


 

EDA

Slides
Scripts


Data


Resources


 

Software

Links


Resources


 

Regression

Slides
Scripts


 


Dimension Reduction

Slides


Scripts


Data


Resources


 

Clustering

Slides


Scripts


Data


 


Hypothesis Testing

Slides


Scripts


Resources

 


Generally useful links

Help and Information


Resources


 


Notes