Difference between revisions of "Workshops/Saskatoon 2015-Exploratory Data Analysis"

From "A B C"

Latest revision as of 19:04, 23 August 2015

Introduction to Exploratory Data Analysis with R

1 Schedule
2 General Resources
3 Progress Notes Thursday
4 Progress Notes Friday
5 EDA
6 Software
7 Regression
8 Dimension Reduction
9 Clustering
10 Hypothesis Testing
11 Generally useful links
12 Notes

Schedule

Please note: this schedule is a rough guideline only, we will be very flexible to adapt to class needs as we proceed.

Time	Thursday's Activities	Friday

09:00 – 10:30	Lecture and practicals: EDA	Lecture and practicals: Dimension reduction
10:30 – 11:00	Coffee break
11:00 – 12:30	Lecture and practicals: EDA	Lecture and practicals: Clustering
12:30 – 13:30	Lunch break
13:30 – 15:00	Lecture and practicals: Software development	Lecture and practicals:Clustering
15:00 – 15:30	Coffee break
13:30 – 15:00	Lecture and practicals: Regression	Lecture and practicals: Hypothesis testing

General Resources

A concise reference card of R functions for data mining (pdf)

Progress Notes Thursday

Selected objectives we covered during the workshop:

subsetting
- selecting rows and columns by "index"
- ... by rowname or columnname as string or vector of strings
- ... using the $ sign for individual columns of a dataframe
- using order() to get values sorted by some property

filtering
- finding elements that contain a string with grep() (and using that to select rows)
- finding elements that match a logical expression, such as ==, <, > etc.

simple descriptive statistics
- mean() / median()
- using as.numeric(), as.logical() etc. to force evaluation as a particular type
- sd() / IQR() / summary()
- theoretical and empirical quantiles; quantile()

random numbers and seeded random numbers; set.seed()
normally distributed random numbers; rnorm()

simple plots
- abline() to draw lines on plots with parameters h= ... or v = ...

scatterplot
- empty plots and overplotting with
  - points()
  - lines()
  - segments()
  - text()
boxplot
barplot
colors
- color names
- colors as hexcodes
- color palettes
- transparency
lines
- linetypes (lty=) / line width (lwd=)
plotting characters (pch=)

hexbin package

synthetic data is useful
linear regression
- retrieving parameters; lm()
- analyzing quality; resid()
- plotting prediction and confidence intervals
non-linear regression
- set up a formula
- initiate with starting values
- plot results
MIC as alternative to Pearson Correlation

Progress Notes Friday

Selected objectives we covered during the workshop:

principles of PCA
interpreting a PCA as a projection
calculated PCA for crabs morphometric data and for gene expression profiles
discussed the output of prcomp()
plotted rotations and discussed their interpretation ("eigengenes")

parallel coordinate plots for high-dimensional data matplot()

normalizing data (subtract mean and divide by standard deviation)

plotting an empty frame and using points(), lines(), text() or other functions to plot elements under detailed, individual control

showing categories and features on a plot
- using color
- using shapes
- using size
- using text()

discussed confounding factors and how to recognize them through PCA

used PCA / Correlations etc. to discover similar elements (e.g. genes with similar expression profiles)

correlating against models (e.g. sine wave) as an alternative to PCA

using "embedding" (package "tsne") as alternative to correlations

clustering
- ...needs a measurable notion of "similarity"
- ... using distance "metrics" such euclidian, 1-correlation ... many more

hierarchical clustering
- different linkage algorithms possible
- results in a dendrogram
- needs to cut dendrogram to define clusters and retrieve elements

many alternatives, need to try more than one, experiment and find (ideally) some external way to validate (e.g. functional annotation)

discussed ways to interact with plots (pick points with identify() and locate())

discussed principles of Hypothesis testing
- Error types
- pValues
- significance levels
- multiple testing problem

examples from GEO dataset analysis
discussed simulation / permutation testing as alternatives for statistical tests

EDA

Slides

EDA.pdf (pdf of slides)

Scripts

EDA.R (the main script for this session)
SubsettingQuizAnswers.R (Answers, in case you didn't already write them in your script)
PlottingReference.R

Data

GvHD.txt

Resources

Software

Links

Resources

Regression

Slides

EDA_Regression.pdf (pdf of slides)

Scripts

EDA_Regression.R (the main script for this session)

Dimension Reduction

Slides

EDA_DimensionReduction.pdf (pdf of slides)

Scripts

EDA_DimensionReduction.R

Data

Logcho_237_4class.txt

Resources

Clustering

Slides

EDA_Clustering.pdf (pdf of slides)

Scripts

EDA_ClusteringExpressionData.R

Data

GSE26922.dat (Fallback data)

Hypothesis Testing

Slides

EDA_HypothesisTesting.pdf (pdf of slides)

Scripts

EDA_HypothesisTesting.R

Resources

Generally useful links

Help and Information

The R help mailing list: https://stat.ethz.ch/mailman/listinfo/r-help
Rseek: the specialized search engine for R topics: http://rseek.org/
R questions on stackoverflow: http://stackoverflow.com/questions/tagged/r
The Comprehensive R Archive Network CRAN: http://cran.r-project.org/
The CRAN task-view collection: http://cran.r-project.org/web/views/
Bioconductor task views: http://www.bioconductor.org/packages/release/BiocViews.html

Resources

Weissgerber_(2015)_BeyondBarcharts.pdf

Notes

Retrieved from "http://steipe.biochemistry.utoronto.ca/abc/index.php?title=Workshops/Saskatoon_2015-Exploratory_Data_Analysis&oldid=7637"

Difference between revisions of "Workshops/Saskatoon 2015-Exploratory Data Analysis"

Latest revision as of 19:04, 23 August 2015

Contents

Schedule

General Resources

Progress Notes Thursday

Progress Notes Friday

EDA

Software

Regression

Dimension Reduction

Clustering

Hypothesis Testing

Generally useful links

Notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools

@@ Line 17: / Line 17: @@
-<table width="50%">
+<table width="90%" cellpadding="5">
 <tr class="sh">
@@ Line 28: / Line 28: @@
 <tr class="s2">
-<td height="72">09:00 &mdash; 10:30</td>
+<td height="72">09:00 &ndash; 10:30</td>
-<td height="72"<br />Lecture and practicals: EDA</td>
+<td height="72">Lecture and practicals: EDA</td>
-<td height="72"<br />Lecture and practicals: Dimension reduction</td>
+<td height="72">Lecture and practicals: Dimension reduction</td>
 </tr>
 <tr class="s1">
-<td height="24">10:30 &mdash; 11:00</td>
+<td height="24">10:30 &ndash; 11:00</td>
 <td height="24" colspan="2">Coffee break</td>
 </tr>
 <tr class="s2">
-<td height="72">11:00 &mdash; 12:30</td>
+<td height="72">11:00 &ndash; 12:30</td>
 <td height="72">Lecture and practicals: EDA</td>
 <td height="72">Lecture and practicals: Clustering</td>
@@ Line 45: / Line 45: @@
 <tr class="s1">
-<td height="48">12:30 &mdash; 13:30</td>
+<td height="48">12:30 &ndash; 13:30</td>
 <td height="48" colspan="2">Lunch break</td>
 </tr>
 <tr class="s2">
-<td height="72">13:30 &mdash; 15:00</td>
+<td height="72">13:30 &ndash; 15:00</td>
 <td height="72">Lecture and practicals: Software development</td>
 <td height="72">Lecture and practicals:Clustering</td>
@@ Line 56: / Line 56: @@
 <tr class="s1">
-<td height="24">15:00 &mdash; 15:30</td>
+<td height="24">15:00 &ndash; 15:30</td>
 <td height="24" colspan="2">Coffee break</td>
 </tr>
 <tr class="s2">
-<td height="72">13:30 &mdash; 15:00</td>
+<td height="72">13:30 &ndash; 15:00</td>
 <td height="72">Lecture and practicals: Regression</td>
 <td height="72">Lecture and practicals: Hypothesis testing</td>
@@ Line 73: / Line 73: @@
 &nbsp;
-==Module 1==
+==General Resources==
+*[[Media:R_refcard-data-mining.pdf|A concise reference card of '''R''' functions for data mining (pdf)]]
+==Progress Notes Thursday==
+Selected objectives we covered during the workshop:
+* subsetting
+** selecting rows and columns by "index"
+** ... by rowname or columnname as string or vector of strings
+** ... using the $ sign for individual columns of a dataframe
+** using order() to get values sorted by some property
+* filtering
+** finding elements that contain a string with grep() (and using that to select rows)
+** finding elements that match a logical expression, such as ==, <, > etc.
+----
+* simple descriptive statistics
+** mean() / median()
+** using as.numeric(), as.logical() etc. to force evaluation as a  particular type
+** sd() / IQR()  / summary()
+** theoretical and empirical quantiles; quantile()
+* random numbers and seeded random numbers; set.seed()
+* normally distributed random numbers; rnorm()
+* simple plots
+**  abline() to draw lines on plots with parameters h= ... or v = ...
+----
+* scatterplot
+** empty plots and overplotting with
+*** points()
+*** lines()
+*** segments()
+*** text()
+* boxplot
+* barplot
+* colors
+** color names
+** colors as hexcodes
+** color palettes
+** transparency
+* lines
+** linetypes (lty=) / line width (lwd=)
+* plotting characters (pch=)
+* hexbin package
+----
+* synthetic data is useful
+* linear regression
+** retrieving parameters;  lm()
+** analyzing quality;  resid()
+**  plotting prediction and confidence intervals
+* non-linear regression
+** set up a formula
+** initiate with starting values
+** plot results
+* MIC as alternative to Pearson Correlation
+&nbsp;
+==Progress Notes Friday==
+Selected objectives we covered during the workshop:
+* principles of PCA
+* interpreting a PCA as a projection
+* calculated PCA for crabs morphometric data and for gene expression profiles
+* discussed the output of prcomp()
+* plotted rotations and discussed their interpretation ("eigengenes")
+* parallel coordinate plots for high-dimensional data matplot()
+* normalizing data (subtract mean and divide by standard deviation)
+* plotting an empty frame and using points(), lines(), text() or other functions to plot elements under detailed, individual control
+* showing categories and features on a plot
+** using color
+** using shapes
+** using size
+** using text()
+* discussed confounding factors and how to recognize them through PCA
+* used PCA / Correlations etc. to discover similar elements (e.g. genes with similar expression profiles)
+* correlating against models (e.g. sine wave) as an alternative to PCA
+* using "embedding" (package "tsne") as alternative to correlations
+----
+* clustering
+** ...needs a measurable notion of "similarity"
+** ... using distance "metrics" such euclidian, 1-correlation ... many more
+* hierarchical clustering
+** different linkage algorithms possible
+** results in a dendrogram
+** needs to cut dendrogram to define clusters and retrieve elements
+* many alternatives, need to try more than one, experiment and find (ideally) some external way to validate (e.g. functional annotation)
+* discussed ways to interact with plots (pick points with identify() and locate())
+----
+* discussed principles of Hypothesis testing
+** Error types
+** pValues
+** significance levels
+** multiple testing problem
+* examples from GEO dataset analysis
+* discussed simulation / permutation testing as alternatives for statistical tests
+&nbsp;
+==EDA==
+;Slides
+*[[Media:EDA.pdf|EDA.pdf (pdf of slides)]]
+;Scripts
+*[[Media:EDA.R|'''EDA.R''' (the main script for this session)]]
+*<small>&nbsp;&nbsp;&nbsp;&nbsp;[[Media:SubsettingQuizAnswers.R|SubsettingQuizAnswers.R]]&nbsp;&nbsp;(Answers, in case you didn't already write them in your script)</small>
+*[[Media:PlottingReference.R|PlottingReference.R]]
+;Data
+*[[Media:GvHD.txt|GvHD.txt]]
+;Resources
+&nbsp;
+==Software==
+;Links
+*[[Regular_Expressions]]
+*[[Software_Development]]
+*[[R_knitr]]
+;Resources
+*[https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects '''Using Projects with R Studio''']
+*[http://software-carpentry.org/ '''Software Carpentry''']
+**[http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745 Best Practices for Scientific Computing, Wilson ''et al.'', PLoS Biology, Jan. 2014]
+*[https://support.rstudio.com/hc/en-us/articles/200532077-Version-Control-with-Git-and-SVN '''Version control in R Studio''']
+&nbsp;
+==Regression==
+;Slides
+*[[Media:EDA_Regression.pdf|EDA_Regression.pdf (pdf of slides)]]
 ;Scripts
-<!-- *[[Media:ScriptTemplate.R|ScriptTemplate.R]] -->
+*[[Media:EDA_Regression.R|'''EDA_Regression.R''' (the main script for this session)]]
 &nbsp;
+==Dimension Reduction==
+;Slides
+*[[Media:EDA_DimensionReduction.pdf|EDA_DimensionReduction.pdf (pdf of slides)]]
+;Scripts
+*[[Media:EDA_DimensionReduction.R|EDA_DimensionReduction.R]]
+;Data
+*[[Media:Logcho_237_4class.txt|Logcho_237_4class.txt]]
+;Resources
+&nbsp;
+==Clustering==
+;Slides
+*[[Media:EDA_Clustering.pdf|EDA_Clustering.pdf (pdf of slides)]]
+;Scripts
+*[[Media:EDA_ClusteringExpressionData.R|EDA_ClusteringExpressionData.R]]
+;Data
+*[[Media:GSE26922.dat|GSE26922.dat (Fallback data)]]
+&nbsp;
+==Hypothesis Testing==
+;Slides
+*[[Media:EDA_HypothesisTesting.pdf|EDA_HypothesisTesting.pdf (pdf of slides)]]
+;Scripts
+*[[Media:EDA_HypothesisTesting.R|EDA_HypothesisTesting.R]]
+;Resources
+*[[Media:Tan_2015-NGSdifferentialTranscription.pdf|Tan_2015-NGSdifferentialTranscription.pdf]]
+*[[Media:ErroneusAnalysesOfSignificance-NatureNeuroscience2011.pdf|ErroneusAnalysesOfSignificance-NatureNeuroscience2011.pdf]]
+&nbsp;
+==Generally useful links==
+;Help and Information
+* The '''R''' help mailing list:  https://stat.ethz.ch/mailman/listinfo/r-help
+* '''Rseek''': the specialized search engine for '''R''' topics: http://rseek.org/
+* '''R''' questions on stackoverflow:   http://stackoverflow.com/questions/tagged/r
+* The Comprehensive '''R''' Archive Network '''CRAN''':  http://cran.r-project.org/
+* The '''CRAN''' task-view collection: http://cran.r-project.org/web/views/
+* '''Bioconductor''' task views:  http://www.bioconductor.org/packages/release/BiocViews.html
+;Resources
+*[[Media:Weissgerber_(2015)_BeyondBarcharts.pdf|Weissgerber_(2015)_BeyondBarcharts.pdf]]
+&nbsp;
 ==Notes==