Expected Preparations:
|
|||||||
|
|||||||
Keywords: R data frames | |||||||
|
|||||||
Objectives:
This unit will …
|
Outcomes:
After working through this unit you …
|
||||||
|
|||||||
Deliverables: Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit. Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these. Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page. |
|||||||
|
|||||||
Evaluation: NA: This unit is not evaluated for course marks. |
Introduction to data frames: how to create, and modify them and how to retrieve data.
Task…
R-Exercise_BasicSetup
project in RStudio if
you don’t already have it open.init()
as instructed after the project has
loaded.Data frames are the most frequently used type of data object for bioinformatics in R; they emulate our mental model of data in a spreadsheet and can be used to implement entity-relationship datamodels. They are more flexible than vectors or matrices, but they are easier to work with than lists.
Usually the result of reading external data from an input file is a
data frame. The file below is included with the
R-Exercise-BasicSetup
project files - it is called
plasmidData.tsv
,1 and you can click on it in the Files Pane
to open and inspect it.
Name Size Marker Ori Sites
pUC19 2686 Amp ColE1 EcoRI, SacI, SmaI, BamHI, XbaI, PstI, HindIII
pBR322 4361 Amp, Tet ColE1 EcoRI, ClaI, HindIII
pACYC184 4245 Tet, Cam p15A ClaI, HindIII
This data set uses tabs as value separators and it has a header line. Similar files can be exported from Excel or other spreadsheet programs. Note that the file contains commas within fields. Read this as a data frame as follows:
( plasmidData <- read.table("plasmidData.tsv",
sep = "\t",
header = TRUE ) )
objectInfo(plasmidData)
You can view the data frame contents by clicking on the spreadsheet icon behind its name in the Environment Pane.
Here are some basic operations with the data frame. Try them, and experiment. If you break the object by mistake, you can just recreate it by reading the source file again:
rownames(plasmidData) <- plasmidData[ , 1] # assigns the contents of column 1 as rownames
nrow(plasmidData)
ncol(plasmidData)
objectInfo(plasmidData)
x <- plasmidData[2, ] # assign one row to a variable
objectInfo(x) # This is also a data frame! One row. It has to be, because
# it contains elements of type chr and of type int!
plasmidData["pBR322", ] # retrieve one row: different syntax, same thing
( s <- plasmidData["pBR322", "Size"] ) # one element
plasmidData["pBR322", "Size"] <- "???" # change one element
plasmidData["pBR322", ] # Note that this is noew a string, not a number
objectInfo(plasmidData) # In fact, the assignment has changed the
# type of the the whole column. Remember:
# in a data.frame, all elements of one column
# have the same type.
plasmidData <- plasmidData[-2, ] # remove one row
objectInfo(plasmidData)
plasmidData <- rbind(plasmidData, x) # add it back at the end
objectInfo(plasmidData)
# add a new row from scratch:
plasmidData <- rbind(plasmidData, data.frame(Name = "pMAL-p5x",
Size = 5752,
Marker = "Amp",
Ori = "pMB1",
Sites = "SacI, AvaI, HindIII"))
objectInfo(plasmidData)
( x <- plasmidData[ , 2] ) # retrieve one column by index
plasmidData[ , "Size"] # retrieve one column by name
objectInfo(plasmidData) # now a vector!
# That may be surprising behaviour. When you retrieve a single column from a
# dataframe it is (silently) turned into a vector (unless you explicitly
# tell R not to do that - e.g. plasmidData[ , "Size", drop = FALSE]). To make the
# nature of this data as a vector more expolicit, I usually use a different
# and equivalent syntax: the "$" operator
plasmidData$Size
objectInfo(plasmidData$Size)
# Note: the $ operator always returns a vector. And, the column name is _NOT_
# placed in quotation marks. This is the syntax we usually will use throughout
# the course.
Task…
The rowname of the new row of plasmidData
is now “1”. It
should be “pMAL-p5x”. Fix this.
If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.
Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.
[END]
The two most important formats for generic text-based
datafiles are “tab”-separated values
(.tsv
) and “comma”-separated values
(.csv
).↩︎