Difference between revisions of "RPR-OBJECTS-Data frames"
m |
m (Boris moved page RPR-Objects-Data frames to RPR-OBJECTS-Data frames) |
||
(7 intermediate revisions by the same user not shown) | |||
Line 31: | Line 31: | ||
After working through this unit you ... | After working through this unit you ... | ||
* ... know how to create and manipulate data frames; | * ... know how to create and manipulate data frames; | ||
+ | * ... can access and change individual elements; | ||
* ... can extract rows, columns, and append new data rows; | * ... can extract rows, columns, and append new data rows; | ||
</td> | </td> | ||
Line 39: | Line 40: | ||
<b>Deliverables:</b><br /> | <b>Deliverables:</b><br /> | ||
<section begin=deliverables /> | <section begin=deliverables /> | ||
− | < | + | <ul> |
− | + | <li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li> | |
− | < | + | <li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li> |
− | + | <li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li> | |
− | < | + | </ul> |
− | |||
<section end=deliverables /> | <section end=deliverables /> | ||
<!-- ============================ --> | <!-- ============================ --> | ||
Line 50: | Line 50: | ||
<section begin=prerequisites /> | <section begin=prerequisites /> | ||
<b>Prerequisites:</b><br /> | <b>Prerequisites:</b><br /> | ||
− | + | This unit builds on material covered in the following prerequisite units:<br /> | |
− | This unit builds on material covered in the following prerequisite units: | ||
*[[RPR-Objects-Vectors|RPR-Objects-Vectors (R scalars and vectors)]] | *[[RPR-Objects-Vectors|RPR-Objects-Vectors (R scalars and vectors)]] | ||
<section end=prerequisites /> | <section end=prerequisites /> | ||
Line 69: | Line 68: | ||
+ | === Evaluation === | ||
+ | <b>Evaluation: NA</b><br /> | ||
+ | <div style="margin-left: 2rem;">This unit is not evaluated for course marks.</div> | ||
== Contents == | == Contents == | ||
− | |||
{{task| 1= | {{task| 1= | ||
Line 81: | Line 82: | ||
===Data frames=== | ===Data frames=== | ||
− | Data frames are | + | Data frames are the most frequently used type of data object for bioinformatics in R; they emulate our mental model of data in a spreadsheet and can be used to implement entity-relationship datamodels. They are more flexible than vectors or matrices, but they are easier to work with than lists. |
− | Usually the result of reading external data from an input file is a data frame. The file below is included with the <code>R-Exercise-BasicSetup</code> project files - it is called <code>plasmidData.tsv</code>, and you can click on it in the Files Pane to open and inspect it. | + | Usually the result of reading external data from an input file is a data frame. The file below is included with the <code>R-Exercise-BasicSetup</code> project files - it is called <code>plasmidData.tsv</code>,<ref>The two most important formats for generic text-based datafiles are "'''tab'''"-separated values (<code>.tsv</code>) and "'''comma'''"-separated values (<code>.csv</code>).</ref> and you can click on it in the Files Pane to open and inspect it. |
Name Size Marker Ori Sites | Name Size Marker Ori Sites | ||
Line 90: | Line 91: | ||
pACYC184 4245 Tet, Cam p15A ClaI, HindIII | pACYC184 4245 Tet, Cam p15A ClaI, HindIII | ||
− | This data set uses tabs as | + | This data set uses tabs as value separators and it has a header line. Similar files can be exported from Excel or other spreadsheet programs. Note that the file contains commas '''within''' fields. Read this as a data frame as follows: |
− | < | + | <pre> |
( plasmidData <- read.table("plasmidData.tsv", | ( plasmidData <- read.table("plasmidData.tsv", | ||
− | sep="\t", | + | sep = "\t", |
− | header=TRUE | + | header = TRUE ) |
− | |||
objectInfo(plasmidData) | objectInfo(plasmidData) | ||
− | </ | + | </pre> |
− | |||
− | |||
You can view the data frame contents by clicking on the spreadsheet icon behind its name in the Environment Pane. | You can view the data frame contents by clicking on the spreadsheet icon behind its name in the Environment Pane. | ||
Line 108: | Line 106: | ||
===Basic operations=== | ===Basic operations=== | ||
− | Here are some basic operations with the data frame. Try them and experiment. If you break | + | Here are some basic operations with the data frame. Try them, and experiment. If you break the object by mistake, you can just recreate it by reading the source file again: |
− | < | + | <pre> |
− | rownames(plasmidData) <- plasmidData[ , 1] # | + | rownames(plasmidData) <- plasmidData[ , 1] # assigns the contents of column 1 as rownames |
nrow(plasmidData) | nrow(plasmidData) | ||
ncol(plasmidData) | ncol(plasmidData) | ||
Line 123: | Line 121: | ||
plasmidData["pBR322", ] # retrieve one row: different syntax, same thing | plasmidData["pBR322", ] # retrieve one row: different syntax, same thing | ||
− | plasmidData[ , | + | ( s <- plasmidData["pBR322", "Size"] ) # one element |
− | plasmidData[ , "Size"] # | + | plasmidData["pBR322", "Size"] <- "???" # change one element |
− | + | plasmidData["pBR322", ] # Note that this is noew a string, not a number | |
− | + | objectInfo(plasmidData) # In fact, the assignment has changed the | |
+ | # type of the the whole column. Remember: | ||
+ | # in a data.frame, all elements of one column | ||
+ | # have the same type. | ||
plasmidData <- plasmidData[-2, ] # remove one row | plasmidData <- plasmidData[-2, ] # remove one row | ||
Line 139: | Line 140: | ||
Marker = "Amp", | Marker = "Amp", | ||
Ori = "pMB1", | Ori = "pMB1", | ||
− | Sites = "SacI, AvaI | + | Sites = "SacI, AvaI, HindIII")) |
− | |||
objectInfo(plasmidData) | objectInfo(plasmidData) | ||
− | </ | + | ( x <- plasmidData[ , 2] ) # retrieve one column by index |
+ | plasmidData[ , "Size"] # retrieve one column by name | ||
+ | objectInfo(plasmidData) # now a vector! | ||
+ | |||
+ | # That may be surprising behaviour. When you retrieve a single column from a | ||
+ | # dataframe it is (silently) turned into a vector (unless you explicitly | ||
+ | # tell R not to do that - e.g. plasmidData[ , "Size", drop = FALSE]). To make the | ||
+ | # nature of this data as a vector more expolicit, I usually use a different | ||
+ | # and equivalent syntax: the "$" operator | ||
+ | |||
+ | plasmidData$Size | ||
+ | objectInfo(plasmidData$Size) | ||
+ | |||
+ | # Note: the $ operator always returns a vector. And, the column name is _NOT_ | ||
+ | # placed in quotation marks. This is the syntax we usually will use throughout | ||
+ | # the course. | ||
+ | |||
+ | </pre> | ||
{{task|1= | {{task|1= | ||
Line 154: | Line 171: | ||
{{Vspace}} | {{Vspace}} | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Notes == | == Notes == | ||
− | |||
− | |||
<references /> | <references /> | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
{{Vspace}} | {{Vspace}} | ||
− | |||
− | |||
− | |||
− | |||
− | |||
<div class="about"> | <div class="about"> | ||
Line 204: | Line 185: | ||
:2017-08-05 | :2017-08-05 | ||
<b>Modified:</b><br /> | <b>Modified:</b><br /> | ||
− | : | + | :2020-09-18 |
<b>Version:</b><br /> | <b>Version:</b><br /> | ||
− | :1 | + | :1.1 |
<b>Version history:</b><br /> | <b>Version history:</b><br /> | ||
+ | *1.1 Remove stringsAsFactors, no longer an issue | ||
*1.0.1 Maintenance | *1.0.1 Maintenance | ||
*1.0 Completed to first live version | *1.0 Completed to first live version | ||
*0.1 Material collected from previous tutorial | *0.1 Material collected from previous tutorial | ||
</div> | </div> | ||
− | |||
− | |||
{{CC-BY}} | {{CC-BY}} | ||
+ | [[Category:ABC-units]] | ||
+ | {{UNIT}} | ||
+ | {{LIVE}} | ||
</div> | </div> | ||
<!-- [END] --> | <!-- [END] --> |
Latest revision as of 01:06, 6 September 2021
R "data frames""
(R data frames)
Abstract:
Introduction to data frames: how to create, and modify them and how to retrieve data.
Objectives:
|
Outcomes:
|
Deliverables:
- Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
- Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
- Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.
Prerequisites:
This unit builds on material covered in the following prerequisite units:
Evaluation
Evaluation: NA
Contents
Task:
- Load the
R-Exercise_BasicSetup
project in RStudio if you don't already have it open. - Type
init()
as instructed after the project has loaded. - Continue below.
Data frames
Data frames are the most frequently used type of data object for bioinformatics in R; they emulate our mental model of data in a spreadsheet and can be used to implement entity-relationship datamodels. They are more flexible than vectors or matrices, but they are easier to work with than lists.
Usually the result of reading external data from an input file is a data frame. The file below is included with the R-Exercise-BasicSetup
project files - it is called plasmidData.tsv
,[1] and you can click on it in the Files Pane to open and inspect it.
Name Size Marker Ori Sites pUC19 2686 Amp ColE1 EcoRI, SacI, SmaI, BamHI, XbaI, PstI, HindIII pBR322 4361 Amp, Tet ColE1 EcoRI, ClaI, HindIII pACYC184 4245 Tet, Cam p15A ClaI, HindIII
This data set uses tabs as value separators and it has a header line. Similar files can be exported from Excel or other spreadsheet programs. Note that the file contains commas within fields. Read this as a data frame as follows:
( plasmidData <- read.table("plasmidData.tsv", sep = "\t", header = TRUE ) objectInfo(plasmidData)
You can view the data frame contents by clicking on the spreadsheet icon behind its name in the Environment Pane.
Basic operations
Here are some basic operations with the data frame. Try them, and experiment. If you break the object by mistake, you can just recreate it by reading the source file again:
rownames(plasmidData) <- plasmidData[ , 1] # assigns the contents of column 1 as rownames nrow(plasmidData) ncol(plasmidData) objectInfo(plasmidData) x <- plasmidData[2, ] # assign one row to a variable objectInfo(x) # This is also a data frame! One row. It has to be, because # it contains elements of type chr and of type int! plasmidData["pBR322", ] # retrieve one row: different syntax, same thing ( s <- plasmidData["pBR322", "Size"] ) # one element plasmidData["pBR322", "Size"] <- "???" # change one element plasmidData["pBR322", ] # Note that this is noew a string, not a number objectInfo(plasmidData) # In fact, the assignment has changed the # type of the the whole column. Remember: # in a data.frame, all elements of one column # have the same type. plasmidData <- plasmidData[-2, ] # remove one row objectInfo(plasmidData) plasmidData <- rbind(plasmidData, x) # add it back at the end objectInfo(plasmidData) # add a new row from scratch: plasmidData <- rbind(plasmidData, data.frame(Name = "pMAL-p5x", Size = 5752, Marker = "Amp", Ori = "pMB1", Sites = "SacI, AvaI, HindIII")) objectInfo(plasmidData) ( x <- plasmidData[ , 2] ) # retrieve one column by index plasmidData[ , "Size"] # retrieve one column by name objectInfo(plasmidData) # now a vector! # That may be surprising behaviour. When you retrieve a single column from a # dataframe it is (silently) turned into a vector (unless you explicitly # tell R not to do that - e.g. plasmidData[ , "Size", drop = FALSE]). To make the # nature of this data as a vector more expolicit, I usually use a different # and equivalent syntax: the "$" operator plasmidData$Size objectInfo(plasmidData$Size) # Note: the $ operator always returns a vector. And, the column name is _NOT_ # placed in quotation marks. This is the syntax we usually will use throughout # the course.
Task:
The rowname of the new row of plasmidData
is now "1". It should be "pMAL-p5x". Fix this.
Notes
- ↑ The two most important formats for generic text-based datafiles are "tab"-separated values (
.tsv
) and "comma"-separated values (.csv
).
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2017-08-05
Modified:
- 2020-09-18
Version:
- 1.1
Version history:
- 1.1 Remove stringsAsFactors, no longer an issue
- 1.0.1 Maintenance
- 1.0 Completed to first live version
- 0.1 Material collected from previous tutorial
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.