Difference between revisions of "Workshops/Saskatoon 2015-Exploratory Data Analysis"
Jump to navigation
Jump to search
m |
|||
(2 intermediate revisions by the same user not shown) | |||
Line 109: | Line 109: | ||
* simple plots | * simple plots | ||
** abline() to draw lines on plots with parameters h= ... or v = ... | ** abline() to draw lines on plots with parameters h= ... or v = ... | ||
+ | |||
---- | ---- | ||
Line 131: | Line 132: | ||
* hexbin package | * hexbin package | ||
+ | |||
---- | ---- | ||
+ | |||
* synthetic data is useful | * synthetic data is useful | ||
Line 148: | Line 151: | ||
| | ||
+ | |||
+ | ==Progress Notes Friday== | ||
+ | |||
+ | Selected objectives we covered during the workshop: | ||
+ | |||
+ | * principles of PCA | ||
+ | * interpreting a PCA as a projection | ||
+ | * calculated PCA for crabs morphometric data and for gene expression profiles | ||
+ | * discussed the output of prcomp() | ||
+ | * plotted rotations and discussed their interpretation ("eigengenes") | ||
+ | |||
+ | * parallel coordinate plots for high-dimensional data matplot() | ||
+ | |||
+ | * normalizing data (subtract mean and divide by standard deviation) | ||
+ | |||
+ | * plotting an empty frame and using points(), lines(), text() or other functions to plot elements under detailed, individual control | ||
+ | |||
+ | * showing categories and features on a plot | ||
+ | ** using color | ||
+ | ** using shapes | ||
+ | ** using size | ||
+ | ** using text() | ||
+ | |||
+ | * discussed confounding factors and how to recognize them through PCA | ||
+ | |||
+ | * used PCA / Correlations etc. to discover similar elements (e.g. genes with similar expression profiles) | ||
+ | |||
+ | * correlating against models (e.g. sine wave) as an alternative to PCA | ||
+ | |||
+ | * using "embedding" (package "tsne") as alternative to correlations | ||
+ | |||
+ | ---- | ||
+ | |||
+ | * clustering | ||
+ | ** ...needs a measurable notion of "similarity" | ||
+ | ** ... using distance "metrics" such euclidian, 1-correlation ... many more | ||
+ | |||
+ | * hierarchical clustering | ||
+ | ** different linkage algorithms possible | ||
+ | ** results in a dendrogram | ||
+ | ** needs to cut dendrogram to define clusters and retrieve elements | ||
+ | |||
+ | * many alternatives, need to try more than one, experiment and find (ideally) some external way to validate (e.g. functional annotation) | ||
+ | |||
+ | * discussed ways to interact with plots (pick points with identify() and locate()) | ||
+ | |||
+ | ---- | ||
+ | * discussed principles of Hypothesis testing | ||
+ | ** Error types | ||
+ | ** pValues | ||
+ | ** significance levels | ||
+ | ** multiple testing problem | ||
+ | |||
+ | * examples from GEO dataset analysis | ||
+ | * discussed simulation / permutation testing as alternatives for statistical tests | ||
+ | |||
+ | |||
+ | | ||
==EDA== | ==EDA== |
Latest revision as of 19:04, 23 August 2015
Introduction to Exploratory Data Analysis with R
Contents
Schedule
Please note: this schedule is a rough guideline only, we will be very flexible to adapt to class needs as we proceed.
Time | Thursday's Activities | Friday |
09:00 – 10:30 | Lecture and practicals: EDA | Lecture and practicals: Dimension reduction |
10:30 – 11:00 | Coffee break | |
11:00 – 12:30 | Lecture and practicals: EDA | Lecture and practicals: Clustering |
12:30 – 13:30 | Lunch break | |
13:30 – 15:00 | Lecture and practicals: Software development | Lecture and practicals:Clustering |
15:00 – 15:30 | Coffee break | |
13:30 – 15:00 | Lecture and practicals: Regression | Lecture and practicals: Hypothesis testing |
General Resources
Progress Notes Thursday
Selected objectives we covered during the workshop:
- subsetting
- selecting rows and columns by "index"
- ... by rowname or columnname as string or vector of strings
- ... using the $ sign for individual columns of a dataframe
- using order() to get values sorted by some property
- filtering
- finding elements that contain a string with grep() (and using that to select rows)
- finding elements that match a logical expression, such as ==, <, > etc.
- simple descriptive statistics
- mean() / median()
- using as.numeric(), as.logical() etc. to force evaluation as a particular type
- sd() / IQR() / summary()
- theoretical and empirical quantiles; quantile()
- random numbers and seeded random numbers; set.seed()
- normally distributed random numbers; rnorm()
- simple plots
- abline() to draw lines on plots with parameters h= ... or v = ...
- scatterplot
- empty plots and overplotting with
- points()
- lines()
- segments()
- text()
- empty plots and overplotting with
- boxplot
- barplot
- colors
- color names
- colors as hexcodes
- color palettes
- transparency
- lines
- linetypes (lty=) / line width (lwd=)
- plotting characters (pch=)
- hexbin package
- synthetic data is useful
- linear regression
- retrieving parameters; lm()
- analyzing quality; resid()
- plotting prediction and confidence intervals
- non-linear regression
- set up a formula
- initiate with starting values
- plot results
- MIC as alternative to Pearson Correlation
Progress Notes Friday
Selected objectives we covered during the workshop:
- principles of PCA
- interpreting a PCA as a projection
- calculated PCA for crabs morphometric data and for gene expression profiles
- discussed the output of prcomp()
- plotted rotations and discussed their interpretation ("eigengenes")
- parallel coordinate plots for high-dimensional data matplot()
- normalizing data (subtract mean and divide by standard deviation)
- plotting an empty frame and using points(), lines(), text() or other functions to plot elements under detailed, individual control
- showing categories and features on a plot
- using color
- using shapes
- using size
- using text()
- discussed confounding factors and how to recognize them through PCA
- used PCA / Correlations etc. to discover similar elements (e.g. genes with similar expression profiles)
- correlating against models (e.g. sine wave) as an alternative to PCA
- using "embedding" (package "tsne") as alternative to correlations
- clustering
- ...needs a measurable notion of "similarity"
- ... using distance "metrics" such euclidian, 1-correlation ... many more
- hierarchical clustering
- different linkage algorithms possible
- results in a dendrogram
- needs to cut dendrogram to define clusters and retrieve elements
- many alternatives, need to try more than one, experiment and find (ideally) some external way to validate (e.g. functional annotation)
- discussed ways to interact with plots (pick points with identify() and locate())
- discussed principles of Hypothesis testing
- Error types
- pValues
- significance levels
- multiple testing problem
- examples from GEO dataset analysis
- discussed simulation / permutation testing as alternatives for statistical tests
EDA
- Slides
- Scripts
- EDA.R (the main script for this session)
- SubsettingQuizAnswers.R (Answers, in case you didn't already write them in your script)
- PlottingReference.R
- Data
- Resources
Software
- Links
- Resources
Regression
- Slides
- Scripts
Dimension Reduction
- Slides
- Scripts
- Data
- Resources
Clustering
- Slides
- Scripts
- Data
Hypothesis Testing
- Slides
- Scripts
- Resources
Generally useful links
- Help and Information
- The R help mailing list: https://stat.ethz.ch/mailman/listinfo/r-help
- Rseek: the specialized search engine for R topics: http://rseek.org/
- R questions on stackoverflow: http://stackoverflow.com/questions/tagged/r
- The Comprehensive R Archive Network CRAN: http://cran.r-project.org/
- The CRAN task-view collection: http://cran.r-project.org/web/views/
- Bioconductor task views: http://www.bioconductor.org/packages/release/BiocViews.html
- Resources
Notes