Difference between revisions of "RPR-OBJECTS-Data frames"

Revision as of 22:42, 18 September 2020

R "data frames""

(R data frames)

Abstract:

Introduction to data frames: how to create, and modify them and how to retrieve data.

Objectives:
This unit will ...

... introduce R data frames;
... cover a number of basic operations.

Outcomes:
After working through this unit you ...

... know how to create and manipulate data frames;
... can access and change individual elements;
... can extract rows, columns, and append new data rows;

Deliverables:

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Prerequisites:
This unit builds on material covered in the following prerequisite units:

RPR-Objects-Vectors (R scalars and vectors)

rownames(plasmidData) <- plasmidData[ , 1]  # assigns the contents of column 1 as rownames
nrow(plasmidData)
ncol(plasmidData)
objectInfo(plasmidData)


x <- plasmidData[2, ]  # assign one row to a variable
objectInfo(x)  # This is also a data frame! One row. It has to be, because
               # it contains elements of type chr and of type int!

plasmidData["pBR322", ]  # retrieve one row: different syntax, same thing


plasmidData <- plasmidData[-2, ]  # remove one row
objectInfo(plasmidData)

plasmidData <- rbind(plasmidData, x)  # add it back at the end
objectInfo(plasmidData)

# add a new row from scratch:
plasmidData <- rbind(plasmidData, data.frame(Name = "pMAL-p5x",
                                                     Size = 5752,
                                                     Marker = "Amp",
                                                     Ori = "pMB1",
                                                     Sites = "SacI, AvaI, ..., HindIII",
                                                     stringsAsFactors = FALSE))
objectInfo(plasmidData)

( x <- plasmidData[ , 2] )    # retrieve one column by index
  plasmidData[ , "Size"]      # retrieve one column by name
objectInfo(plasmidData)       # now a vector!

# That may be surprising behaviour. When you retrieve a single column from a
# dataframe it is (silently) turned into a vector (unless you explicitly
# tell R not to do that - e.g. plasmidData[ , "Size", drop = FALSE]). To make the
# nature of this data as a vector more expolicit, I usually use a different
# and equivalent syntax:
( x <- plasmidData$Size )
objectInfo(x)

# Note: the $ operator always returns a vector. And, the column name is _NOT_
# placed in quotation marks.

Task:
The rowname of the new row of plasmidData is now "1". It should be "pMAL-p5x". Fix this.

Self-evaluation

Notes

↑ The two most important formats for generic text-based datafiles are "tab"-separated values (.tsv) and "comma"-separated values (.csv).

About ...

Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2020-09-18

Version:

1.1

Version history:

1.1 Remove stringsAsFactors, no longer an issue
1.0.1 Maintenance
1.0 Completed to first live version
0.1 Material collected from previous tutorial

This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.

[1] The two most important formats for generic text-based datafiles are "tab"-separated values (.tsv) and "comma"-separated values (.csv).

[1]

@@ Line 1: / Line 1: @@
 <div id="ABC">
-<div style="padding:5px; border:1px solid #000000; background-color:#f4d7b7; font-size:300%; font-weight:400; color: #000000; width:100%;">
+<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce; font-size:300%; font-weight:400; color: #000000; width:100%;">
 R "data frames""
-<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#f4d7b7; font-size:30%; font-weight:200; color: #000000; ">
+<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#b3dbce; font-size:30%; font-weight:200; color: #000000; ">
 (R data frames)
 </div>
@@ Line 10: / Line 10: @@
-<div style="padding:5px; border:1px solid #000000; background-color:#f4d7b733; font-size:85%;">
+<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce33; font-size:85%;">
 <div style="font-size:118%;">
 <b>Abstract:</b><br />
@@ Line 31: / Line 31: @@
 After working through this unit you ...
 * ... know how to create and manipulate data frames;
+* ... can access and change individual elements;
 * ... can extract rows, columns, and append new data rows;
 </td>
@@ Line 39: / Line 40: @@
 <b>Deliverables:</b><br />
 <section begin=deliverables />
-<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-time_management" -->
+<ul>
 <li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li>
-<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-journal" -->
 <li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li>
-<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-insights" -->
 <li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li>
+</ul>
 <section end=deliverables />
 <!-- ============================  -->
@@ Line 50: / Line 50: @@
 <section begin=prerequisites />
 <b>Prerequisites:</b><br />
-<!-- included from "./data/ABC-unit_components.txt", section: "notes-prerequisites" -->
 This unit builds on material covered in the following prerequisite units:<br />
 *[[RPR-Objects-Vectors|RPR-Objects-Vectors (R scalars and vectors)]]
@@ Line 60: / Line 59: @@
-{{REVISE}}
 {{Smallvspace}}
@@ Line 70: / Line 68: @@
+=== Evaluation ===
+<b>Evaluation: NA</b><br />
+<div style="margin-left: 2rem;">This unit is not evaluated for course marks.</div>
 == Contents ==
-<!-- included from "./components/RPR-Objects-Data_frames.components.txt", section: "contents" -->
 {{task| 1=
@@ Line 82: / Line 82: @@
 ===Data frames===
-Data frames are probably the most important type of data object for bioinformatics in R; they emulate our mental model of data in a spreadsheet and can be used to implement entity-relationship datamodels.
+Data frames are  the most frequently used type of data object for bioinformatics in R; they emulate our mental model of data in a spreadsheet and can be used to implement entity-relationship datamodels. They are more flexible than vectors or matrices, but they are easier to work with than lists.
-Usually the result of reading external data from an input file is a data frame. The file below is included with the <code>R-Exercise-BasicSetup</code> project files - it is called <code>plasmidData.tsv</code>, and you can click on it in the Files Pane to open and inspect it.
+Usually the result of reading external data from an input file is a data frame. The file below is included with the <code>R-Exercise-BasicSetup</code> project files - it is called <code>plasmidData.tsv</code>,<ref>The two most important formats for generic text-based datafiles are "'''tab'''"-separated values (<code>.tsv</code>) and "'''comma'''"-separated values (<code>.csv</code>).</ref> and you can click on it in the Files Pane to open and inspect it.
   Name	Size	Marker	Ori	Sites
@@ Line 91: / Line 91: @@
   pACYC184	4245	Tet, Cam	p15A	ClaI, HindIII
-This data set uses tabs as column separators and it has a header line. Similar files can be exported from Excel or other spreadsheet programs. Read this as a data frame as follows:
+This data set uses tabs as value separators and it has a header line. Similar files can be exported from Excel or other spreadsheet programs. Note that the file contains commas '''within''' fields. Read this as a data frame as follows:
-<source lang="rsplus">
+<pre>
 ( plasmidData <- read.table("plasmidData.tsv",
-                             sep="\t",
+                             sep = "\t",
-                             header=TRUE,
+                             header = TRUE )
-                            stringsAsFactors = FALSE) )
 objectInfo(plasmidData)
-</source>
+</pre>
-Note the argument {{c|stringsAsFactors {{=}} FALSE}}. If this is {{c|TRUE}} instead, '''R''' will convert all strings in the input to factors and this may lead to problems. Make it a habit to turn this behaviour off, you can always turn a column of strings into factors when you actually mean to have factors.
 You can view the data frame contents by clicking on the spreadsheet icon behind its name in the Environment Pane.
@@ Line 109: / Line 106: @@
 ===Basic operations===
-Here are some basic operations with the data frame. Try them and experiment. If you break it by mistake, you can just recreate it by reading the source file again:
+Here are some basic operations with the data frame. Try them, and experiment. If you break the object by mistake, you can just recreate it by reading the source file again:
-<source lang="rsplus">
+<pre>
-rownames(plasmidData) <- plasmidData[ , 1]  # use column 1 as rownames
+rownames(plasmidData) <- plasmidData[ , 1]  # assigns the contents of column 1 as rownames
 nrow(plasmidData)
 ncol(plasmidData)
@@ Line 123: / Line 120: @@
 plasmidData["pBR322", ]  # retrieve one row: different syntax, same thing
-plasmidData[ , 2]       # retrieve one column
-plasmidData[ , "Size"]  # retrieve one column: same principle
@@ Line 144: / Line 137: @@
 objectInfo(plasmidData)
-</source>
+( x <- plasmidData[ , 2] )    # retrieve one column by index
+  plasmidData[ , "Size"]      # retrieve one column by name
+objectInfo(plasmidData)       # now a vector!
+# That may be surprising behaviour. When you retrieve a single column from a
+# dataframe it is (silently) turned into a vector (unless you explicitly
+# tell R not to do that - e.g. plasmidData[ , "Size", drop = FALSE]). To make the
+# nature of this data as a vector more expolicit, I usually use a different
+# and equivalent syntax:
+( x <- plasmidData$Size )
+objectInfo(x)
+# Note: the $ operator always returns a vector. And, the column name is _NOT_
+# placed in quotation marks.
+</pre>
 {{task|1=
@@ Line 154: / Line 162: @@
 {{Vspace}}
 == Self-evaluation ==
@@ Line 174: / Line 181: @@
 -->
 == Notes ==
-<!-- included from "./components/RPR-Objects-Data_frames.components.txt", section: "notes" -->
-<!-- included from "./data/ABC-unit_components.txt", section: "notes" -->
 <references />
-== Further reading, links and resources ==
-<!-- {{#pmid: 19957275}} -->
-<!-- {{WWW|WWW_GMOD}} -->
-<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 {{Vspace}}
-<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_ask" -->
-----
-{{Vspace}}
-<b>If in doubt, ask!</b> If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
-----
-{{Vspace}}
 <div class="about">
@@ Line 205: / Line 194: @@
 :2017-08-05
 <b>Modified:</b><br />
-:2018-05-05
+:2020-09-18
 <b>Version:</b><br />
-:1.0.1
+:1.1
 <b>Version history:</b><br />
+*1.1 Remove stringsAsFactors, no longer an issue
 *1.0.1 Maintenance
 *1.0 Completed to first live version
 *0.1 Material collected from previous tutorial
 </div>
-[[Category:ABC-units]]
-<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_footer" -->
 {{CC-BY}}
+[[Category:ABC-units]]
 </div>
 <!-- [END] -->

Difference between revisions of "RPR-OBJECTS-Data frames"

Revision as of 22:42, 18 September 2020

Contents

Evaluation

Contents

Data frames

Basic operations

Self-evaluation

Notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools