Workshops/Saskatoon 2015-Exploratory Data Analysis

From "A B C"

Revision as of 19:04, 23 August 2015 by Boris (talk | contribs) (→‎Progress Notes Friday)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to navigation Jump to search

Introduction to Exploratory Data Analysis with R

Contents

1 Schedule
2 General Resources
3 Progress Notes Thursday
4 Progress Notes Friday
5 EDA
6 Software
7 Regression
8 Dimension Reduction
9 Clustering
10 Hypothesis Testing
11 Generally useful links
12 Notes

Schedule

Please note: this schedule is a rough guideline only, we will be very flexible to adapt to class needs as we proceed.

Time	Thursday's Activities	Friday

09:00 – 10:30	Lecture and practicals: EDA	Lecture and practicals: Dimension reduction
10:30 – 11:00	Coffee break
11:00 – 12:30	Lecture and practicals: EDA	Lecture and practicals: Clustering
12:30 – 13:30	Lunch break
13:30 – 15:00	Lecture and practicals: Software development	Lecture and practicals:Clustering
15:00 – 15:30	Coffee break
13:30 – 15:00	Lecture and practicals: Regression	Lecture and practicals: Hypothesis testing

General Resources

A concise reference card of R functions for data mining (pdf)

Progress Notes Thursday

Selected objectives we covered during the workshop:

subsetting
- selecting rows and columns by "index"
- ... by rowname or columnname as string or vector of strings
- ... using the $ sign for individual columns of a dataframe
- using order() to get values sorted by some property

filtering
- finding elements that contain a string with grep() (and using that to select rows)
- finding elements that match a logical expression, such as ==, <, > etc.

simple descriptive statistics
- mean() / median()
- using as.numeric(), as.logical() etc. to force evaluation as a particular type
- sd() / IQR() / summary()
- theoretical and empirical quantiles; quantile()

random numbers and seeded random numbers; set.seed()
normally distributed random numbers; rnorm()

simple plots
- abline() to draw lines on plots with parameters h= ... or v = ...

scatterplot
- empty plots and overplotting with
  - points()
  - lines()
  - segments()
  - text()
boxplot
barplot
colors
- color names
- colors as hexcodes
- color palettes
- transparency
lines
- linetypes (lty=) / line width (lwd=)
plotting characters (pch=)

hexbin package

synthetic data is useful
linear regression
- retrieving parameters; lm()
- analyzing quality; resid()
- plotting prediction and confidence intervals
non-linear regression
- set up a formula
- initiate with starting values
- plot results
MIC as alternative to Pearson Correlation

Progress Notes Friday

Selected objectives we covered during the workshop:

principles of PCA
interpreting a PCA as a projection
calculated PCA for crabs morphometric data and for gene expression profiles
discussed the output of prcomp()
plotted rotations and discussed their interpretation ("eigengenes")

parallel coordinate plots for high-dimensional data matplot()

normalizing data (subtract mean and divide by standard deviation)

plotting an empty frame and using points(), lines(), text() or other functions to plot elements under detailed, individual control

showing categories and features on a plot
- using color
- using shapes
- using size
- using text()

discussed confounding factors and how to recognize them through PCA

used PCA / Correlations etc. to discover similar elements (e.g. genes with similar expression profiles)

correlating against models (e.g. sine wave) as an alternative to PCA

using "embedding" (package "tsne") as alternative to correlations

clustering
- ...needs a measurable notion of "similarity"
- ... using distance "metrics" such euclidian, 1-correlation ... many more

hierarchical clustering
- different linkage algorithms possible
- results in a dendrogram
- needs to cut dendrogram to define clusters and retrieve elements

many alternatives, need to try more than one, experiment and find (ideally) some external way to validate (e.g. functional annotation)

discussed ways to interact with plots (pick points with identify() and locate())

discussed principles of Hypothesis testing
- Error types
- pValues
- significance levels
- multiple testing problem

examples from GEO dataset analysis
discussed simulation / permutation testing as alternatives for statistical tests

EDA

Slides

EDA.pdf (pdf of slides)

Scripts

EDA.R (the main script for this session)
SubsettingQuizAnswers.R (Answers, in case you didn't already write them in your script)
PlottingReference.R

Data

GvHD.txt

Resources

Software

Links

Resources

Regression

Slides

EDA_Regression.pdf (pdf of slides)

Scripts

EDA_Regression.R (the main script for this session)

Dimension Reduction

Slides

EDA_DimensionReduction.pdf (pdf of slides)

Scripts

EDA_DimensionReduction.R

Data

Logcho_237_4class.txt

Resources

Clustering

Slides

EDA_Clustering.pdf (pdf of slides)

Scripts

EDA_ClusteringExpressionData.R

Data

GSE26922.dat (Fallback data)

Hypothesis Testing

Slides

EDA_HypothesisTesting.pdf (pdf of slides)

Scripts

EDA_HypothesisTesting.R

Resources

Generally useful links

Help and Information

The R help mailing list: https://stat.ethz.ch/mailman/listinfo/r-help
Rseek: the specialized search engine for R topics: http://rseek.org/
R questions on stackoverflow: http://stackoverflow.com/questions/tagged/r
The Comprehensive R Archive Network CRAN: http://cran.r-project.org/
The CRAN task-view collection: http://cran.r-project.org/web/views/
Bioconductor task views: http://www.bioconductor.org/packages/release/BiocViews.html

Resources

Weissgerber_(2015)_BeyondBarcharts.pdf

Notes

Retrieved from "http://steipe.biochemistry.utoronto.ca/abc/index.php?title=Workshops/Saskatoon_2015-Exploratory_Data_Analysis&oldid=7637"