Expected Preparations:
|
|||||||
|
|||||||
Keywords: Subsetting concept; the [ ] operator; the [ [ ]] operator; the $ operator; filtering | |||||||
|
|||||||
Objectives:
This unit will …
|
Outcomes:
After working through this unit you …
|
||||||
|
|||||||
Deliverables: Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit. Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these. Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page. |
|||||||
|
|||||||
Evaluation: NA: This unit is not evaluated for course marks. |
Subsetting and filtering are among the most important operations with data. R provides powerful syntax for these operations. Learn about and practice them in this unit.
Task…
R-Exercise_BasicSetup
project in RStudio if
you don’t already have it open.init()
as instructed after the project has
loaded.plasmidData
data frame that you worked
with in the RPR-Objects-Data_frames
unit, (if it is not still defined in your Workspace; or use the code
below):plasmidData <- data.frame(Name = c("pUC19", "pBR322", "pACYC184", "pMAL-p5x"),
Size = c(2686, 4361, 4245, 5752),
Marker = c("Amp", "Amp, Tet", "Cam", "Amp"),
Ori = c("ColE1", "ColE1", "p15A", "pMB1"),
Sites = c("EcoRI, SacI, SmaI, BamHI, HindIII",
"EcoRI, ClaI, HindIII",
"ClaI, HindIII",
"SacI, AvaI, HindIII"))
We have encountered “subsetting” before, but we really need to discuss this in more detail. It is one of the most important topics of R since it is indispensable to select, transform, and otherwise modify data to prepare it for analysis. You have seen that we use square brackets to indicate individual elements in vectors and matrices. These square brackets are actually “operators”, and you can find more information about them in the help pages:
> ?"[" # Note that you need quotation marks around the operator for this.
Note especially:
[ ]
“extracts” one or more elements defined within the
brackets;[[ ]]
“extracts” a single element defined within the
brackets;$
“extracts” a single named element.“Elements” are not necessarily scalars, but can apply to a row, column, or more complex data structure. But a “single element” can’t be a range, or collection.
Here are some examples of subsetting data from the
plasmidData
data frame we constructed previously. For the
most part, this is review:
plasmidData[1, ]
plasmidData[2, ]
# we can extract more than one row by specifying
# the rows we want in a vector ...
plasmidData[c(1, 2), ]
# ... this works in any order ...
plasmidData[c(3, 1), ]
# ... and for any number of rows ...
plasmidData[c(1, 2, 1, 2, 1, 2), ]
# Same for columns
plasmidData[ , 2 ]
# We can select rows and columns by name if a name has been defined...
plasmidData[, "Name"]
plasmidData$Name # different syntax, same thing. This is the syntax I use most frequently.
# Watch this!
plasmidData$Name[plasmidData$Ori != "ColE1"]
# What happened here?
# plasmidData$Ori != "ColE1" is a logical expression, it gives a vector of TRUE/FALSE values:
plasmidData$Ori != "ColE1"
# ... insert this vector into the square brackets. R then returns all rows for
# which the corresponding vector element is TRUE.
# With this, we can "filter" for values
plasmidData$Size > 3000
plasmidData$Name[plasmidData$Size > 3000]
# Any operation that has TRUE or FALSE as a result can be used for filtering:
# - the equality operators == and !=
# - the comparison operators >, <, >=, and <=
# - %in%
# - grepl()
# - as.logical()
# plasmids that have only the Amp marker
plasmidData[plasmidData$Marker == "Amp", ]
# plasmids that don't have the Amp marker
plasmidData[plasmidData$Marker != "Amp", ] # is this correct?
# No! We need to search for the string "Amp" _within_ the elements of
# plasmidData$Marker. grep() makes this possible:
grep("Amp", plasmidData$Marker)
plasmidData[- grep("Amp", plasmidData$Marker), ] # note the "-" sign: _exclude_
# rows with this index
# We can use the same idea - accessing the rows of a dataframe by
# a vector of indexes - when we want to "sort" an object
# by some value. The function order() is used to return values
# that are sorted. Remember this: not sort() but order().
order(plasmidData$Size)
plasmidData[order(plasmidData$Size), ]
# grep() matches substrings in strings and returns a vector of indices
grep("Tet", plasmidData$Marker)
plasmidData[grep("Tet", plasmidData$Marker), ]
plasmidData[grep("Tet", plasmidData$Marker), "Ori"]
Elements that can be extracted from an object also can be replaced. Simply assign the new value to the element.
( x <- sample(1:10) )
x[4] <- 99
x
( x <- x[order(x)] )
Try your own subsetting ideas. Play with this. I find that even seasoned investigators have problems with subsetting their data and if you become comfortable with the many ways of subsetting, you will be ahead of the game right away.
Task…
Practice, practice, practice. You need to develop an intuition about subsetting. This will help you tremendously to understand our code examples - but it will also help you develop ways to think about data.
R-Exercise_BasicSetup
project contains a file
subsettingPractice.R
If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.
Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.
[END]