Workshops/Saskatoon 2015-Exploratory Data Analysis
Jump to navigation
Jump to search
Introduction to Exploratory Data Analysis with R
Contents
Schedule
Please note: this schedule is a rough guideline only, we will be very flexible to adapt to class needs as we proceed.
Time | Thursday's Activities | Friday |
09:00 – 10:30 | Lecture and practicals: EDA | Lecture and practicals: Dimension reduction |
10:30 – 11:00 | Coffee break | |
11:00 – 12:30 | Lecture and practicals: EDA | Lecture and practicals: Clustering |
12:30 – 13:30 | Lunch break | |
13:30 – 15:00 | Lecture and practicals: Software development | Lecture and practicals:Clustering |
15:00 – 15:30 | Coffee break | |
13:30 – 15:00 | Lecture and practicals: Regression | Lecture and practicals: Hypothesis testing |
General Resources
Progress Notes Thursday
Selected objectives we covered during the workshop:
- subsetting
- selecting rows and columns by "index"
- ... by rowname or columnname as string or vector of strings
- ... using the $ sign for individual columns of a dataframe
- using order() to get values sorted by some property
- filtering
- finding elements that contain a string with grep() (and using that to select rows)
- finding elements that match a logical expression, such as ==, <, > etc.
- simple descriptive statistics
- mean() / median()
- using as.numeric(), as.logical() etc. to force evaluation as a particular type
- sd() / IQR() / summary()
- theoretical and empirical quantiles; quantile()
- random numbers and seeded random numbers; set.seed()
- normally distributed random numbers; rnorm()
- simple plots
- abline() to draw lines on plots with parameters h= ... or v = ...
- scatterplot
- empty plots and overplotting with
- points()
- lines()
- segments()
- text()
- empty plots and overplotting with
- boxplot
- barplot
- colors
- color names
- colors as hexcodes
- color palettes
- transparency
- lines
- linetypes (lty=) / line width (lwd=)
- plotting characters (pch=)
- hexbin package
- synthetic data is useful
- linear regression
- retrieving parameters; lm()
- analyzing quality; resid()
- plotting prediction and confidence intervals
- non-linear regression
- set up a formula
- initiate with starting values
- plot results
- MIC as alternative to Pearson Correlation
Progress Notes Friday
Selected objectives we covered during the workshop:
- principles of PCA
- interpreting a PCA as a projection
- calculated PCA for crabs morphometric data and for gene expression profiles
- discussed the output of prcomp()
- plotted rotations and discussed their interpretation ("eigengenes")
- parallel coordinate plots for high-dimensional data matplot()
- normalizing data (subtract mean and divide by standard deviation)
- plotting an empty frame and using points(), lines(), text() or other functions to plot elements under detailed, individual control
- showing categories and features on a plot
- using color
- using shapes
- using size
- using text()
- discussed confounding factors and how to recognize them through PCA
- used PCA / Corrleations etc. to discover similar elements (e.g. genes with similar expression profiles)
- correlating against models (e.g. sine wave) as an alternative to PCA
- using "embedding" (package "tsne") as alternative to correlations
- clustering
- ...needs a measurable notion of "similarity"
- ... using distance "metrics" such euclidian, 1-correlation ... many more
- hierarchical clustering
- different linkage algorithms possible
- results in a dendrogram
- needs to cut dendrogram to define clusters and retrieve elements
- many alternatives, need to try more than one, experiment and find (ideally) some external way to validate (e.g. functional annotation)
- discussed ways to interact with plots (pick points with identify() and locate())
- discussed principles of Hypothesis testing
- Error types
- pValues
- significance levels
- multiple testing problem
- examples from GEO dataset analysis
- discussed simulation / permutation testing as alternatives for statistical tests
EDA
- Slides
- Scripts
- EDA.R (the main script for this session)
- SubsettingQuizAnswers.R (Answers, in case you didn't already write them in your script)
- PlottingReference.R
- Data
- Resources
Software
- Links
- Resources
Regression
- Slides
- Scripts
Dimension Reduction
- Slides
- Scripts
- Data
- Resources
Clustering
- Slides
- Scripts
- Data
Hypothesis Testing
- Slides
- Scripts
- Resources
Generally useful links
- Help and Information
- The R help mailing list: https://stat.ethz.ch/mailman/listinfo/r-help
- Rseek: the specialized search engine for R topics: http://rseek.org/
- R questions on stackoverflow: http://stackoverflow.com/questions/tagged/r
- The Comprehensive R Archive Network CRAN: http://cran.r-project.org/
- The CRAN task-view collection: http://cran.r-project.org/web/views/
- Bioconductor task views: http://www.bioconductor.org/packages/release/BiocViews.html
- Resources
Notes