Difference between revisions of "RPR-OBJECTS-Data frames"
m |
m |
||
Line 1: | Line 1: | ||
<div id="ABC"> | <div id="ABC"> | ||
− | <div style="padding:5px; border:1px solid #000000; background-color:# | + | <div style="padding:5px; border:1px solid #000000; background-color:#b3dbce; font-size:300%; font-weight:400; color: #000000; width:100%;"> |
R "data frames"" | R "data frames"" | ||
− | <div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:# | + | <div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#b3dbce; font-size:30%; font-weight:200; color: #000000; "> |
(R data frames) | (R data frames) | ||
</div> | </div> | ||
Line 10: | Line 10: | ||
− | <div style="padding:5px; border:1px solid #000000; background-color:# | + | <div style="padding:5px; border:1px solid #000000; background-color:#b3dbce33; font-size:85%;"> |
<div style="font-size:118%;"> | <div style="font-size:118%;"> | ||
<b>Abstract:</b><br /> | <b>Abstract:</b><br /> | ||
Line 60: | Line 60: | ||
− | |||
{{Smallvspace}} | {{Smallvspace}} | ||
Line 82: | Line 81: | ||
===Data frames=== | ===Data frames=== | ||
− | Data frames are probably the most important type of data object for bioinformatics in R; they emulate our mental model of data in a spreadsheet and can be used to implement datamodels. | + | Data frames are probably the most important type of data object for bioinformatics in R; they emulate our mental model of data in a spreadsheet and can be used to implement entity-relationship datamodels. |
Usually the result of reading external data from an input file is a data frame. The file below is included with the <code>R-Exercise-BasicSetup</code> project files - it is called <code>plasmidData.tsv</code>, and you can click on it in the Files Pane to open and inspect it. | Usually the result of reading external data from an input file is a data frame. The file below is included with the <code>R-Exercise-BasicSetup</code> project files - it is called <code>plasmidData.tsv</code>, and you can click on it in the Files Pane to open and inspect it. | ||
Line 94: | Line 93: | ||
<source lang="rsplus"> | <source lang="rsplus"> | ||
− | plasmidData <- read.table("plasmidData.tsv", sep="\t", header=TRUE, stringsAsFactors = FALSE) | + | ( plasmidData <- read.table("plasmidData.tsv", |
− | + | sep="\t", | |
+ | header=TRUE, | ||
+ | stringsAsFactors = FALSE) ) | ||
objectInfo(plasmidData) | objectInfo(plasmidData) | ||
</source> | </source> | ||
Line 102: | Line 103: | ||
You can view the data frame contents by clicking on the spreadsheet icon behind its name in the Environment Pane. | You can view the data frame contents by clicking on the spreadsheet icon behind its name in the Environment Pane. | ||
+ | |||
+ | {{Vspace}} | ||
===Basic operations=== | ===Basic operations=== | ||
Line 201: | Line 204: | ||
:2017-08-05 | :2017-08-05 | ||
<b>Modified:</b><br /> | <b>Modified:</b><br /> | ||
− | : | + | :2018-05-05 |
<b>Version:</b><br /> | <b>Version:</b><br /> | ||
− | :1.0 | + | :1.0.1 |
<b>Version history:</b><br /> | <b>Version history:</b><br /> | ||
+ | *1.0.1 Maintenance | ||
*1.0 Completed to first live version | *1.0 Completed to first live version | ||
*0.1 Material collected from previous tutorial | *0.1 Material collected from previous tutorial |
Revision as of 04:04, 8 May 2018
R "data frames""
(R data frames)
Abstract:
Introduction to data frames: how to create, and modify them and how to retrieve data.
Objectives:
|
Outcomes:
|
Deliverables:
- Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
- Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
- Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.
Prerequisites:
This unit builds on material covered in the following prerequisite units:
Contents
Contents
Task:
- Load the
R-Exercise_BasicSetup
project in RStudio if you don't already have it open. - Type
init()
as instructed after the project has loaded. - Continue below.
Data frames
Data frames are probably the most important type of data object for bioinformatics in R; they emulate our mental model of data in a spreadsheet and can be used to implement entity-relationship datamodels.
Usually the result of reading external data from an input file is a data frame. The file below is included with the R-Exercise-BasicSetup
project files - it is called plasmidData.tsv
, and you can click on it in the Files Pane to open and inspect it.
Name Size Marker Ori Sites pUC19 2686 Amp ColE1 EcoRI, SacI, SmaI, BamHI, XbaI, PstI, HindIII pBR322 4361 Amp, Tet ColE1 EcoRI, ClaI, HindIII pACYC184 4245 Tet, Cam p15A ClaI, HindIII
This data set uses tabs as column separators and it has a header line. Similar files can be exported from Excel or other spreadsheet programs. Read this as a data frame as follows:
( plasmidData <- read.table("plasmidData.tsv",
sep="\t",
header=TRUE,
stringsAsFactors = FALSE) )
objectInfo(plasmidData)
Note the argument stringsAsFactors = FALSE
. If this is TRUE
instead, R will convert all strings in the input to factors and this may lead to problems. Make it a habit to turn this behaviour off, you can always turn a column of strings into factors when you actually mean to have factors.
You can view the data frame contents by clicking on the spreadsheet icon behind its name in the Environment Pane.
Basic operations
Here are some basic operations with the data frame. Try them and experiment. If you break it by mistake, you can just recreate it by reading the source file again:
rownames(plasmidData) <- plasmidData[ , 1] # use column 1 as rownames
nrow(plasmidData)
ncol(plasmidData)
objectInfo(plasmidData)
x <- plasmidData[2, ] # assign one row to a variable
objectInfo(x) # This is also a data frame! One row. It has to be, because
# it contains elements of type chr and of type int!
plasmidData["pBR322", ] # retrieve one row: different syntax, same thing
plasmidData[ , 2] # retrieve one column
plasmidData[ , "Size"] # retrieve one column: same principle
plasmidData <- plasmidData[-2, ] # remove one row
objectInfo(plasmidData)
plasmidData <- rbind(plasmidData, x) # add it back at the end
objectInfo(plasmidData)
# add a new row from scratch:
plasmidData <- rbind(plasmidData, data.frame(Name = "pMAL-p5x",
Size = 5752,
Marker = "Amp",
Ori = "pMB1",
Sites = "SacI, AvaI, ..., HindIII",
stringsAsFactors = FALSE))
objectInfo(plasmidData)
Task:
The rowname of the new row of plasmidData
is now "1". It should be "pMAL-p5x". Fix this.
Self-evaluation
Notes
Further reading, links and resources
If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2017-08-05
Modified:
- 2018-05-05
Version:
- 1.0.1
Version history:
- 1.0.1 Maintenance
- 1.0 Completed to first live version
- 0.1 Material collected from previous tutorial
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.