Workshops/Saskatoon 2015-Exploratory Data Analysis

From "A B C"
Jump to navigation Jump to search

Introduction to Exploratory Data Analysis with R



 

Schedule

Please note: this schedule is a rough guideline only, we will be very flexible to adapt to class needs as we proceed.


Time Thursday's Activities Friday
09:00 – 10:30 Lecture and practicals: EDA Lecture and practicals: Dimension reduction
10:30 – 11:00 Coffee break
11:00 – 12:30 Lecture and practicals: EDA Lecture and practicals: Clustering
12:30 – 13:30 Lunch break
13:30 – 15:00 Lecture and practicals: Software development Lecture and practicals:Clustering
15:00 – 15:30 Coffee break
13:30 – 15:00 Lecture and practicals: Regression Lecture and practicals: Hypothesis testing


 


General Resources



Progress Notes Thursday

Selected objectives we covered during the workshop:

  • subsetting
    • selecting rows and columns by "index"
    • ... by rowname or columnname as string or vector of strings
    • ... using the $ sign for individual columns of a dataframe
    • using order() to get values sorted by some property
  • filtering
    • finding elements that contain a string with grep() (and using that to select rows)
    • finding elements that match a logical expression, such as ==, <, > etc.




  • simple descriptive statistics
    • mean() / median()
    • using as.numeric(), as.logical() etc. to force evaluation as a particular type
    • sd() / IQR() / summary()
    • theoretical and empirical quantiles; quantile()
  • random numbers and seeded random numbers; set.seed()
  • normally distributed random numbers; rnorm()
  • simple plots
    • abline() to draw lines on plots with parameters h= ... or v = ...




  • scatterplot
    • empty plots and overplotting with
      • points()
      • lines()
      • segments()
      • text()
  • boxplot
  • barplot
  • colors
    • color names
    • colors as hexcodes
    • color palettes
    • transparency
  • lines
    • linetypes (lty=) / line width (lwd=)
  • plotting characters (pch=)
  • hexbin package




  • synthetic data is useful
  • linear regression
    • retrieving parameters; lm()
    • analyzing quality; resid()
    • plotting prediction and confidence intervals
  • non-linear regression
    • set up a formula
    • initiate with starting values
    • plot results
  • MIC as alternative to Pearson Correlation


 

EDA

Slides
Scripts


Data


Resources


 

Software

Links


Resources


 

Regression

Slides
Scripts


 


Dimension Reduction

Slides


Scripts


Data


Resources


 

Clustering

Slides


Scripts


Data


 


Hypothesis Testing

Slides


Scripts


Resources

 


Generally useful links

Help and Information


Resources


 


Notes