Difference between revisions of "Workshops/Saskatoon 2015-Exploratory Data Analysis"
Jump to navigation
Jump to search
m (→EDA) |
|||
(6 intermediate revisions by the same user not shown) | |||
Line 80: | Line 80: | ||
+ | ==Progress Notes Thursday== | ||
+ | Selected objectives we covered during the workshop: | ||
+ | |||
+ | * subsetting | ||
+ | ** selecting rows and columns by "index" | ||
+ | ** ... by rowname or columnname as string or vector of strings | ||
+ | ** ... using the $ sign for individual columns of a dataframe | ||
+ | ** using order() to get values sorted by some property | ||
+ | |||
+ | * filtering | ||
+ | ** finding elements that contain a string with grep() (and using that to select rows) | ||
+ | ** finding elements that match a logical expression, such as ==, <, > etc. | ||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | |||
+ | * simple descriptive statistics | ||
+ | ** mean() / median() | ||
+ | ** using as.numeric(), as.logical() etc. to force evaluation as a particular type | ||
+ | ** sd() / IQR() / summary() | ||
+ | ** theoretical and empirical quantiles; quantile() | ||
+ | |||
+ | * random numbers and seeded random numbers; set.seed() | ||
+ | * normally distributed random numbers; rnorm() | ||
+ | |||
+ | * simple plots | ||
+ | ** abline() to draw lines on plots with parameters h= ... or v = ... | ||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | |||
+ | * scatterplot | ||
+ | ** empty plots and overplotting with | ||
+ | *** points() | ||
+ | *** lines() | ||
+ | *** segments() | ||
+ | *** text() | ||
+ | * boxplot | ||
+ | * barplot | ||
+ | * colors | ||
+ | ** color names | ||
+ | ** colors as hexcodes | ||
+ | ** color palettes | ||
+ | ** transparency | ||
+ | * lines | ||
+ | ** linetypes (lty=) / line width (lwd=) | ||
+ | * plotting characters (pch=) | ||
+ | |||
+ | * hexbin package | ||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | |||
+ | * synthetic data is useful | ||
+ | * linear regression | ||
+ | ** retrieving parameters; lm() | ||
+ | ** analyzing quality; resid() | ||
+ | ** plotting prediction and confidence intervals | ||
+ | * non-linear regression | ||
+ | ** set up a formula | ||
+ | ** initiate with starting values | ||
+ | ** plot results | ||
+ | * MIC as alternative to Pearson Correlation | ||
+ | |||
+ | |||
+ | | ||
+ | |||
+ | |||
+ | ==Progress Notes Friday== | ||
+ | |||
+ | Selected objectives we covered during the workshop: | ||
+ | |||
+ | * principles of PCA | ||
+ | * interpreting a PCA as a projection | ||
+ | * calculated PCA for crabs morphometric data and for gene expression profiles | ||
+ | * discussed the output of prcomp() | ||
+ | * plotted rotations and discussed their interpretation ("eigengenes") | ||
+ | |||
+ | * parallel coordinate plots for high-dimensional data matplot() | ||
+ | |||
+ | * normalizing data (subtract mean and divide by standard deviation) | ||
+ | |||
+ | * plotting an empty frame and using points(), lines(), text() or other functions to plot elements under detailed, individual control | ||
+ | |||
+ | * showing categories and features on a plot | ||
+ | ** using color | ||
+ | ** using shapes | ||
+ | ** using size | ||
+ | ** using text() | ||
+ | |||
+ | * discussed confounding factors and how to recognize them through PCA | ||
+ | |||
+ | * used PCA / Correlations etc. to discover similar elements (e.g. genes with similar expression profiles) | ||
+ | |||
+ | * correlating against models (e.g. sine wave) as an alternative to PCA | ||
+ | |||
+ | * using "embedding" (package "tsne") as alternative to correlations | ||
+ | |||
+ | ---- | ||
+ | |||
+ | * clustering | ||
+ | ** ...needs a measurable notion of "similarity" | ||
+ | ** ... using distance "metrics" such euclidian, 1-correlation ... many more | ||
+ | |||
+ | * hierarchical clustering | ||
+ | ** different linkage algorithms possible | ||
+ | ** results in a dendrogram | ||
+ | ** needs to cut dendrogram to define clusters and retrieve elements | ||
+ | |||
+ | * many alternatives, need to try more than one, experiment and find (ideally) some external way to validate (e.g. functional annotation) | ||
+ | |||
+ | * discussed ways to interact with plots (pick points with identify() and locate()) | ||
+ | |||
+ | ---- | ||
+ | * discussed principles of Hypothesis testing | ||
+ | ** Error types | ||
+ | ** pValues | ||
+ | ** significance levels | ||
+ | ** multiple testing problem | ||
+ | |||
+ | * examples from GEO dataset analysis | ||
+ | * discussed simulation / permutation testing as alternatives for statistical tests | ||
+ | |||
+ | |||
+ | | ||
==EDA== | ==EDA== | ||
Line 90: | Line 218: | ||
;Scripts | ;Scripts | ||
*[[Media:EDA.R|'''EDA.R''' (the main script for this session)]] | *[[Media:EDA.R|'''EDA.R''' (the main script for this session)]] | ||
+ | *<small> [[Media:SubsettingQuizAnswers.R|SubsettingQuizAnswers.R]] (Answers, in case you didn't already write them in your script)</small> | ||
*[[Media:PlottingReference.R|PlottingReference.R]] | *[[Media:PlottingReference.R|PlottingReference.R]] | ||
Line 139: | Line 268: | ||
;Slides | ;Slides | ||
+ | *[[Media:EDA_DimensionReduction.pdf|EDA_DimensionReduction.pdf (pdf of slides)]] | ||
;Scripts | ;Scripts | ||
+ | *[[Media:EDA_DimensionReduction.R|EDA_DimensionReduction.R]] | ||
+ | |||
+ | |||
+ | ;Data | ||
+ | *[[Media:Logcho_237_4class.txt|Logcho_237_4class.txt]] | ||
Line 148: | Line 283: | ||
| | ||
− | |||
− | |||
==Clustering== | ==Clustering== | ||
Line 155: | Line 288: | ||
;Slides | ;Slides | ||
+ | *[[Media:EDA_Clustering.pdf|EDA_Clustering.pdf (pdf of slides)]] | ||
;Scripts | ;Scripts | ||
+ | *[[Media:EDA_ClusteringExpressionData.R|EDA_ClusteringExpressionData.R]] | ||
− | ; | + | ;Data |
+ | *[[Media:GSE26922.dat|GSE26922.dat (Fallback data)]] | ||
+ | |||
Line 171: | Line 308: | ||
;Slides | ;Slides | ||
+ | *[[Media:EDA_HypothesisTesting.pdf|EDA_HypothesisTesting.pdf (pdf of slides)]] | ||
;Scripts | ;Scripts | ||
+ | *[[Media:EDA_HypothesisTesting.R|EDA_HypothesisTesting.R]] | ||
;Resources | ;Resources | ||
− | + | *[[Media:Tan_2015-NGSdifferentialTranscription.pdf|Tan_2015-NGSdifferentialTranscription.pdf]] | |
+ | *[[Media:ErroneusAnalysesOfSignificance-NatureNeuroscience2011.pdf|ErroneusAnalysesOfSignificance-NatureNeuroscience2011.pdf]] | ||
| |
Latest revision as of 19:04, 23 August 2015
Introduction to Exploratory Data Analysis with R
Contents
Schedule
Please note: this schedule is a rough guideline only, we will be very flexible to adapt to class needs as we proceed.
Time | Thursday's Activities | Friday |
09:00 – 10:30 | Lecture and practicals: EDA | Lecture and practicals: Dimension reduction |
10:30 – 11:00 | Coffee break | |
11:00 – 12:30 | Lecture and practicals: EDA | Lecture and practicals: Clustering |
12:30 – 13:30 | Lunch break | |
13:30 – 15:00 | Lecture and practicals: Software development | Lecture and practicals:Clustering |
15:00 – 15:30 | Coffee break | |
13:30 – 15:00 | Lecture and practicals: Regression | Lecture and practicals: Hypothesis testing |
General Resources
Progress Notes Thursday
Selected objectives we covered during the workshop:
- subsetting
- selecting rows and columns by "index"
- ... by rowname or columnname as string or vector of strings
- ... using the $ sign for individual columns of a dataframe
- using order() to get values sorted by some property
- filtering
- finding elements that contain a string with grep() (and using that to select rows)
- finding elements that match a logical expression, such as ==, <, > etc.
- simple descriptive statistics
- mean() / median()
- using as.numeric(), as.logical() etc. to force evaluation as a particular type
- sd() / IQR() / summary()
- theoretical and empirical quantiles; quantile()
- random numbers and seeded random numbers; set.seed()
- normally distributed random numbers; rnorm()
- simple plots
- abline() to draw lines on plots with parameters h= ... or v = ...
- scatterplot
- empty plots and overplotting with
- points()
- lines()
- segments()
- text()
- empty plots and overplotting with
- boxplot
- barplot
- colors
- color names
- colors as hexcodes
- color palettes
- transparency
- lines
- linetypes (lty=) / line width (lwd=)
- plotting characters (pch=)
- hexbin package
- synthetic data is useful
- linear regression
- retrieving parameters; lm()
- analyzing quality; resid()
- plotting prediction and confidence intervals
- non-linear regression
- set up a formula
- initiate with starting values
- plot results
- MIC as alternative to Pearson Correlation
Progress Notes Friday
Selected objectives we covered during the workshop:
- principles of PCA
- interpreting a PCA as a projection
- calculated PCA for crabs morphometric data and for gene expression profiles
- discussed the output of prcomp()
- plotted rotations and discussed their interpretation ("eigengenes")
- parallel coordinate plots for high-dimensional data matplot()
- normalizing data (subtract mean and divide by standard deviation)
- plotting an empty frame and using points(), lines(), text() or other functions to plot elements under detailed, individual control
- showing categories and features on a plot
- using color
- using shapes
- using size
- using text()
- discussed confounding factors and how to recognize them through PCA
- used PCA / Correlations etc. to discover similar elements (e.g. genes with similar expression profiles)
- correlating against models (e.g. sine wave) as an alternative to PCA
- using "embedding" (package "tsne") as alternative to correlations
- clustering
- ...needs a measurable notion of "similarity"
- ... using distance "metrics" such euclidian, 1-correlation ... many more
- hierarchical clustering
- different linkage algorithms possible
- results in a dendrogram
- needs to cut dendrogram to define clusters and retrieve elements
- many alternatives, need to try more than one, experiment and find (ideally) some external way to validate (e.g. functional annotation)
- discussed ways to interact with plots (pick points with identify() and locate())
- discussed principles of Hypothesis testing
- Error types
- pValues
- significance levels
- multiple testing problem
- examples from GEO dataset analysis
- discussed simulation / permutation testing as alternatives for statistical tests
EDA
- Slides
- Scripts
- EDA.R (the main script for this session)
- SubsettingQuizAnswers.R (Answers, in case you didn't already write them in your script)
- PlottingReference.R
- Data
- Resources
Software
- Links
- Resources
Regression
- Slides
- Scripts
Dimension Reduction
- Slides
- Scripts
- Data
- Resources
Clustering
- Slides
- Scripts
- Data
Hypothesis Testing
- Slides
- Scripts
- Resources
Generally useful links
- Help and Information
- The R help mailing list: https://stat.ethz.ch/mailman/listinfo/r-help
- Rseek: the specialized search engine for R topics: http://rseek.org/
- R questions on stackoverflow: http://stackoverflow.com/questions/tagged/r
- The Comprehensive R Archive Network CRAN: http://cran.r-project.org/
- The CRAN task-view collection: http://cran.r-project.org/web/views/
- Bioconductor task views: http://www.bioconductor.org/packages/release/BiocViews.html
- Resources
Notes