Difference between revisions of "RPR-Subsetting"

From "A B C"
Jump to navigation Jump to search
m
m
Line 1: Line 1:
 
<div id="ABC">
 
<div id="ABC">
<div style="padding:5px; border:1px solid #000000; background-color:#f4d7b7; font-size:300%; font-weight:400; color: #000000; width:100%;">
+
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce; font-size:300%; font-weight:400; color: #000000; width:100%;">
 
Subsetting and filtering R objects
 
Subsetting and filtering R objects
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#f4d7b7; font-size:30%; font-weight:200; color: #000000; ">
+
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#b3dbce; font-size:30%; font-weight:200; color: #000000; ">
 
(Subsetting with the [], [[]], and $ operators, filtering)
 
(Subsetting with the [], [[]], and $ operators, filtering)
 
</div>
 
</div>
Line 10: Line 10:
  
  
<div style="padding:5px; border:1px solid #000000; background-color:#f4d7b733; font-size:85%;">
+
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce33; font-size:85%;">
 
<div style="font-size:118%;">
 
<div style="font-size:118%;">
 
<b>Abstract:</b><br />
 
<b>Abstract:</b><br />
Line 38: Line 38:
 
<b>Deliverables:</b><br />
 
<b>Deliverables:</b><br />
 
<section begin=deliverables />
 
<section begin=deliverables />
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-time_management" -->
+
<ul>
 
<li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li>
 
<li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li>
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-journal" -->
 
 
<li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li>
 
<li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li>
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-insights" -->
 
 
<li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li>
 
<li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li>
 +
</ul>
 
<section end=deliverables />
 
<section end=deliverables />
 
<!-- ============================  -->
 
<!-- ============================  -->
Line 49: Line 48:
 
<section begin=prerequisites />
 
<section begin=prerequisites />
 
<b>Prerequisites:</b><br />
 
<b>Prerequisites:</b><br />
<!-- included from "./data/ABC-unit_components.txt", section: "notes-prerequisites" -->
 
 
This unit builds on material covered in the following prerequisite units:<br />
 
This unit builds on material covered in the following prerequisite units:<br />
 
*[[RPR-Objects-Lists|RPR-Objects-Lists (R "Lists")]]
 
*[[RPR-Objects-Lists|RPR-Objects-Lists (R "Lists")]]
Line 59: Line 57:
  
  
{{REVISE}}
 
  
 
{{Smallvspace}}
 
{{Smallvspace}}
Line 69: Line 66:
  
  
 +
=== Evaluation ===
 +
<b>Evaluation: NA</b><br />
 +
<div style="margin-left: 2rem;">This unit is not evaluated for course marks.</div>
 
== Contents ==
 
== Contents ==
<!-- included from "./components/RPR-Subsetting.components.txt", section: "contents" -->
 
  
 
{{task| 1=
 
{{task| 1=
 
* Load the <code>R-Exercise_BasicSetup</code> project in RStudio if you don't already have it open.
 
* Load the <code>R-Exercise_BasicSetup</code> project in RStudio if you don't already have it open.
 
* Type <code>init()</code> as instructed after the project has loaded.
 
* Type <code>init()</code> as instructed after the project has loaded.
* Recreate the <code>plasmidData</code> data frame, if it is not still defined in your Workspace.
+
* Recreate the <code>plasmidData</code> data frame that you worked with in the [[RPR-Objects-Data_frames]] unit, (if it is not still defined in your Workspace):
 +
<pre>
 +
plasmidData <- data.frame(Name = c("pUC19", "pBR322", "pACYC184", "pMAL-p5x"),
 +
                          Size = c(2686, 4361, 4245, 5752),
 +
                          Marker = c("Amp", "Amp, Tet", "Cam", "Amp"),
 +
                          Ori = c("ColE1", "ColE1", "p15A", "pMB1"),
 +
                          Sites = c("EcoRI, SacI, SmaI, BamHI, HindIII",
 +
                                    "EcoRI, ClaI, HindIII",
 +
                                    "ClaI, HindIII",
 +
                                    "SacI, AvaI, HindIII"))
 +
</pre>
 
* Continue below.
 
* Continue below.
 
}}
 
}}
Line 84: Line 93:
 
We have encountered "subsetting" before, but we really need to discuss this in more detail. It is one of the most important topics of '''R''' since it is indispensable to select, transform, and otherwise modify data to prepare it for analysis. You have seen that we use square brackets to indicate individual elements in vectors and matrices. These square brackets are actually "operators", and you can find more information about them in the help  pages:
 
We have encountered "subsetting" before, but we really need to discuss this in more detail. It is one of the most important topics of '''R''' since it is indispensable to select, transform, and otherwise modify data to prepare it for analysis. You have seen that we use square brackets to indicate individual elements in vectors and matrices. These square brackets are actually "operators", and you can find more information about them in the help  pages:
  
<source lang="rsplus">
+
<pre>
 
> ?"["    # Note that you need quotation marks around the operator for this.
 
> ?"["    # Note that you need quotation marks around the operator for this.
</source>
+
</pre>
  
 
Note especially:
 
Note especially:
* <code>[ ]</code> "extracts" one or more elements defined within the brackets;
+
- <code>[ ]</code> "extracts" one or more elements defined within the brackets;
* <code>[[ ]]</code> "extracts" a single element defined within the brackets;
+
- <code>[[ ]]</code> "extracts" a single element defined within the brackets;
* <code>$</code> "extracts" a single <u>named</u> element.
+
- <code>$</code> "extracts" a single <u>named</u> element.
  
 
"Elements" are not necessarily scalars, but can apply to a row, column, or more complex data structure. But a "single element" can't be a range, or collection.
 
"Elements" are not necessarily scalars, but can apply to a row, column, or more complex data structure. But a "single element" can't be a range, or collection.
Line 99: Line 108:
 
Here are some examples of subsetting data from the <code>plasmidData</code> data frame we constructed previously. For the most part, this is review:
 
Here are some examples of subsetting data from the <code>plasmidData</code> data frame we constructed previously. For the most part, this is review:
  
<source lang="rsplus">
+
<pre>
 
plasmidData[1, ]
 
plasmidData[1, ]
 
plasmidData[2, ]
 
plasmidData[2, ]
Line 125: Line 134:
 
plasmidData$Name[plasmidData$Ori != "ColE1"]
 
plasmidData$Name[plasmidData$Ori != "ColE1"]
 
# What happened here?
 
# What happened here?
# plasmidData$Ori != "ColE1" is a logical expression, it gives a vector of TRUE/FALSE values
+
# plasmidData$Ori != "ColE1" is a logical expression, it gives a vector of TRUE/FALSE values:
 
plasmidData$Ori != "ColE1"
 
plasmidData$Ori != "ColE1"
  
# We insert this vector into the square brackets. R then returns all rows for
+
# ... insert this vector into the square brackets. R then returns all rows for
# which the vector is TRUE.
+
# which the corresponding vector element is TRUE.
  
# In this way we can "filter" for values
+
# With this, we can "filter" for values
 
plasmidData$Size > 3000
 
plasmidData$Size > 3000
 
plasmidData$Name[plasmidData$Size > 3000]
 
plasmidData$Name[plasmidData$Size > 3000]
 +
 +
# Any operation that has TRUE or FALSE as a result can be used for filtering:
 +
#  - the equality operators == and !=
 +
#  - the comparison operators  >, <, >=, and <=
 +
#  - %in%
 +
#  - grepl()
 +
#  - as.logical()
 +
 +
# plasmids that have only the Amp marker
 +
plasmidData[plasmidData$Marker == "Amp"]
 +
 +
# plasmids that don't have the Amp marker
 +
plasmidData[plasmidData$Marker == "Amp"]
 +
 +
  
 
# This principle is what we use when we want to "sort" an object
 
# This principle is what we use when we want to "sort" an object
Line 146: Line 170:
 
plasmidData[grep("Tet", plasmidData$Marker), "Ori"]
 
plasmidData[grep("Tet", plasmidData$Marker), "Ori"]
  
</source>
+
</pre>
  
 
Elements that can be extracted from an object also can be replaced. Simply assign the new value to the element.
 
Elements that can be extracted from an object also can be replaced. Simply assign the new value to the element.
  
<source lang="rsplus">
+
<pre>
 
( x <- sample(1:10) )
 
( x <- sample(1:10) )
 
x[4] <- 99
 
x[4] <- 99
Line 156: Line 180:
 
( x <- x[order(x)] )
 
( x <- x[order(x)] )
  
</source>
+
</pre>
  
 
Try your own subsetting ideas. Play with this. I find that even seasoned investigators have problems with subsetting their data and if you become comfortable with the many ways of subsetting, you will be ahead of the game right away.
 
Try your own subsetting ideas. Play with this. I find that even seasoned investigators have problems with subsetting their data and if you become comfortable with the many ways of subsetting, you will be ahead of the game right away.
Line 164: Line 188:
 
===Subsetting practice===
 
===Subsetting practice===
  
{{task| 1=
+
{{task|1=
 +
Practice, practice, [https://www.gocomics.com/sarahs-scribbles/2017/12/20 practice]. You need to develop an intuition about subsetting. This will help you tremendously to understand our code examples - but it will also help you develop ways to '''think''' about data.
 +
 
 
* The <code>R-Exercise_BasicSetup</code> project contains a file <code>subsettingPractice.R</code>
 
* The <code>R-Exercise_BasicSetup</code> project contains a file <code>subsettingPractice.R</code>
 
* Open the file and work through it.
 
* Open the file and work through it.
 +
 
}}
 
}}
 
 
 
== Self-evaluation ==
 
<!--
 
=== Question 1===
 
 
Question ...
 
 
<div class="toccolours mw-collapsible mw-collapsed" style="width:800px">
 
Answer ...
 
<div class="mw-collapsible-content">
 
Answer ...
 
 
</div>
 
  </div>
 
 
  {{Vspace}}
 
 
-->
 
== Notes ==
 
<!-- included from "./components/RPR-Subsetting.components.txt", section: "notes" -->
 
<!-- included from "./data/ABC-unit_components.txt", section: "notes" -->
 
<references />
 
== Further reading, links and resources ==
 
<!-- {{#pmid: 19957275}} -->
 
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 
  
 
{{Vspace}}
 
{{Vspace}}
  
 
<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_ask" -->
 
 
----
 
 
{{Vspace}}
 
 
<b>If in doubt, ask!</b> If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
 
 
----
 
 
{{Vspace}}
 
  
 
<div class="about">
 
<div class="about">
Line 220: Line 207:
 
:2017-08-05
 
:2017-08-05
 
<b>Modified:</b><br />
 
<b>Modified:</b><br />
:2018-05-05
+
:2020-09-18
 
<b>Version:</b><br />
 
<b>Version:</b><br />
:1.0.1
+
:1.0.2
 
<b>Version history:</b><br />
 
<b>Version history:</b><br />
 +
*1.0.2 Maintenance
 
*1.0.1 Maintenance
 
*1.0.1 Maintenance
 
*1.0 Completed to first live version
 
*1.0 Completed to first live version
 
*0.1 Material collected from previous tutorial
 
*0.1 Material collected from previous tutorial
 
</div>
 
</div>
[[Category:ABC-units]]
 
<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_footer" -->
 
  
 
{{CC-BY}}
 
{{CC-BY}}
  
 +
[[Category:ABC-units]]
 
</div>
 
</div>
 
<!-- [END] -->
 
<!-- [END] -->

Revision as of 06:50, 19 September 2020

Subsetting and filtering R objects

(Subsetting with the [], [[]], and $ operators, filtering)


 


Abstract:

Subsetting and filtering are among the most important operations with data. R provides powerful syntax for these operations. Learn about and practice them in this unit.


Objectives:
This unit will ...

  • ... introduce subsetting principles;
  • ... practice them on data;

Outcomes:
After working through this unit you ...

  • ... can subset and filter data according to six different principles.

Deliverables:

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Prerequisites:
This unit builds on material covered in the following prerequisite units:


 



 



 


Evaluation

Evaluation: NA

This unit is not evaluated for course marks.

Contents

Task:

  • Load the R-Exercise_BasicSetup project in RStudio if you don't already have it open.
  • Type init() as instructed after the project has loaded.
  • Recreate the plasmidData data frame that you worked with in the RPR-Objects-Data_frames unit, (if it is not still defined in your Workspace):
plasmidData <- data.frame(Name = c("pUC19", "pBR322", "pACYC184", "pMAL-p5x"),
                          Size = c(2686, 4361, 4245, 5752),
                          Marker = c("Amp", "Amp, Tet", "Cam", "Amp"),
                          Ori = c("ColE1", "ColE1", "p15A", "pMB1"),
                          Sites = c("EcoRI, SacI, SmaI, BamHI, HindIII",
                                    "EcoRI, ClaI, HindIII",
                                    "ClaI, HindIII",
                                    "SacI, AvaI, HindIII"))
  • Continue below.


Subsetting

We have encountered "subsetting" before, but we really need to discuss this in more detail. It is one of the most important topics of R since it is indispensable to select, transform, and otherwise modify data to prepare it for analysis. You have seen that we use square brackets to indicate individual elements in vectors and matrices. These square brackets are actually "operators", and you can find more information about them in the help pages:

> ?"["     # Note that you need quotation marks around the operator for this.

Note especially: - [ ] "extracts" one or more elements defined within the brackets; - [[ ]] "extracts" a single element defined within the brackets; - $ "extracts" a single named element.

"Elements" are not necessarily scalars, but can apply to a row, column, or more complex data structure. But a "single element" can't be a range, or collection.


 

Here are some examples of subsetting data from the plasmidData data frame we constructed previously. For the most part, this is review:

plasmidData[1, ]
plasmidData[2, ]

# we can extract more than one row by specifying
# the rows we want in a vector ...
plasmidData[c(1, 2), ]

# ... this works in any order ...
plasmidData[c(3, 1), ]

# ... and for any number of rows ...
plasmidData[c(1, 2, 1, 2, 1, 2), ]


# Same for columns
plasmidData[ , 2 ]

# We can select rows and columns by name if a name has been defined...
plasmidData[, "Name"]
plasmidData$Name      # different syntax, same thing. This is the syntax I use most frequently.


# Watch this!
plasmidData$Name[plasmidData$Ori != "ColE1"]
# What happened here?
# plasmidData$Ori != "ColE1" is a logical expression, it gives a vector of TRUE/FALSE values:
plasmidData$Ori != "ColE1"

# ... insert this vector into the square brackets. R then returns all rows for
# which the corresponding vector element is TRUE.

# With this, we can "filter" for values
plasmidData$Size > 3000
plasmidData$Name[plasmidData$Size > 3000]

# Any operation that has TRUE or FALSE as a result can be used for filtering:
#   - the equality operators == and !=
#   - the comparison operators  >, <, >=, and <=
#   - %in%
#   - grepl()
#   - as.logical()

# plasmids that have only the Amp marker
plasmidData[plasmidData$Marker == "Amp"]

# plasmids that don't have the Amp marker
plasmidData[plasmidData$Marker == "Amp"]



# This principle is what we use when we want to "sort" an object
# by some value. The function order() is used to return values
# that are sorted. Remember this: not sort() but order().
order(plasmidData$Size)
plasmidData[order(plasmidData$Size), ]

# grep() matches substrings in strings and returns a vector of indices
grep("Tet", plasmidData$Marker)
plasmidData[grep("Tet", plasmidData$Marker), ]
plasmidData[grep("Tet", plasmidData$Marker), "Ori"]

Elements that can be extracted from an object also can be replaced. Simply assign the new value to the element.

( x <- sample(1:10) )
x[4] <- 99
x
( x <- x[order(x)] )

Try your own subsetting ideas. Play with this. I find that even seasoned investigators have problems with subsetting their data and if you become comfortable with the many ways of subsetting, you will be ahead of the game right away.


 

Subsetting practice

Task:
Practice, practice, practice. You need to develop an intuition about subsetting. This will help you tremendously to understand our code examples - but it will also help you develop ways to think about data.

  • The R-Exercise_BasicSetup project contains a file subsettingPractice.R
  • Open the file and work through it.


 


About ...
 
Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2020-09-18

Version:

1.0.2

Version history:

  • 1.0.2 Maintenance
  • 1.0.1 Maintenance
  • 1.0 Completed to first live version
  • 0.1 Material collected from previous tutorial

CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.