Difference between revisions of "RPR-Data-Import"

From "A B C"
Jump to navigation Jump to search
m
m
Line 28: Line 28:
 
== Abstract ==
 
== Abstract ==
 
<section begin=abstract />
 
<section begin=abstract />
<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "abstract" -->
+
<!-- included from "./components/RPR-Data-Import.components.txt", section: "abstract" -->
 
Practical for data import from text files, spreadsheets and Web pages.
 
Practical for data import from text files, spreadsheets and Web pages.
 
<section end=abstract />
 
<section end=abstract />
Line 37: Line 37:
 
== This unit ... ==
 
== This unit ... ==
 
=== Prerequisites ===
 
=== Prerequisites ===
<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "prerequisites" -->
+
<!-- included from "./components/RPR-Data-Import.components.txt", section: "prerequisites" -->
 
*[[RPR-Introduction|RPR-Introduction (Introduction to R)]]
 
*[[RPR-Introduction|RPR-Introduction (Introduction to R)]]
  
Line 44: Line 44:
  
 
=== Objectives ===
 
=== Objectives ===
<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "objectives" -->
+
<!-- included from "./components/RPR-Data-Import.components.txt", section: "objectives" -->
 
...
 
...
  
Line 51: Line 51:
  
 
=== Outcomes ===
 
=== Outcomes ===
<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "outcomes" -->
+
<!-- included from "./components/RPR-Data-Import.components.txt", section: "outcomes" -->
 
...
 
...
  
Line 58: Line 58:
  
 
=== Deliverables ===
 
=== Deliverables ===
<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "deliverables" -->
+
<!-- included from "./components/RPR-Data-Import.components.txt", section: "deliverables" -->
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-time_management" -->
+
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-time_management" -->
 
*<b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
 
*<b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-journal" -->
+
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-journal" -->
 
*<b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.
 
*<b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-insights" -->
+
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-insights" -->
 
*<b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].
 
*<b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].
 
{{Vspace}}
 
 
 
=== Evaluation ===
 
<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "evaluation" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "eval-none" -->
 
<b>Evaluation: NA</b><br />
 
:This unit is not evaluated for course marks.
 
  
 
{{Vspace}}
 
{{Vspace}}
Line 81: Line 72:
 
<div id="BIO">
 
<div id="BIO">
 
== Contents ==
 
== Contents ==
<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "contents" -->
+
<!-- included from "./components/RPR-Data-Import.components.txt", section: "contents" -->
 
* Line endings
 
* Line endings
 
* Unstructured text files, headers, skip, omit, and rownames
 
* Unstructured text files, headers, skip, omit, and rownames
Line 147: Line 138:
  
 
== Notes ==
 
== Notes ==
<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "notes" -->
+
<!-- included from "./components/RPR-Data-Import.components.txt", section: "notes" -->
<!-- included from "ABC-unit_components.wtxt", section: "notes" -->
+
<!-- included from "./data/ABC-unit_components.txt", section: "notes" -->
 
<references />
 
<references />
  
Line 157: Line 148:
 
<div id="ABC-unit-framework">
 
<div id="ABC-unit-framework">
 
== Self-evaluation ==
 
== Self-evaluation ==
<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "self-evaluation" -->
+
<!-- included from "./components/RPR-Data-Import.components.txt", section: "self-evaluation" -->
 
<!--
 
<!--
 
=== Question 1===
 
=== Question 1===
Line 182: Line 173:
  
  
<!-- included from "ABC-unit_components.wtxt", section: "ABC-unit_ask" -->
+
<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_ask" -->
  
 
----
 
----
Line 210: Line 201:
 
</div>
 
</div>
 
[[Category:ABC-units]]
 
[[Category:ABC-units]]
<!-- included from "ABC-unit_components.wtxt", section: "ABC-unit_footer" -->
+
<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_footer" -->
  
 
{{CC-BY}}
 
{{CC-BY}}

Revision as of 01:26, 6 January 2018

Importing data in R


 

Keywords:  Data import from unstructured text, structured text, spreadsheets, online repositories, and Web pages


 



 


Caution!

This unit is under development. There is some contents here but it is incomplete and/or may change significantly: links may lead to nowhere, the contents is likely going to be rearranged, and objectives, deliverables etc. may be incomplete or missing. Do not work with this material until it is updated to "live" status.


 


Abstract

Practical for data import from text files, spreadsheets and Web pages.


 


This unit ...

Prerequisites


 


Objectives

...


 


Outcomes

...


 


Deliverables

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.


 


Contents

  • Line endings
  • Unstructured text files, headers, skip, omit, and rownames
  • Caution with stringsAsFactors
  • Caution with coerced data
  • Text files with keywords
  • Text-files: changing state
  • Text-files: csv and tsv (and Excel sheets)
  • Text-files: slurping, chunking and streaming
  • curl
  • httr GET and POST
  • XML - libraries and cpath
  • Binary Data
  • R objects: save() and load()


Regex for screenscraping example:

Screenscraping

Write a PHP program that screenscrapes resolution data for a protein structure file from the PDB.

 Here is a link to a PDB record to illustrate the URL format.
   Fetch the contents of the URL into a string. Use a regex that captures the data you want to retrieve as part of some unique pattern in the HTML source. Print.
   ;The regex:
:/\s*(\d+\.\d+)/ *   identifying tag for the information we are looking for, ...
   *\s*   ... probably followed by whitespace, ...
   *(\d+\.\d+)   ... the "payload" of the match: one or more digits, a literal dot and and one or more digits.
   ;The code:
   <source lang="PHP">
   <?php
 $URLpath = "http://www.pdb.org/pdb/explore/explore.do?structureId=";
 $PDBid = "2imm";
 $source = file_get_contents($URLpath . $PDBid);
preg_match('/\s*?(\d+\.\d+)/', $source, $resolution);
 print($resolution[1]);
 ?>





 


Further reading, links and resources

 


Notes


 


Self-evaluation

 



 




 

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.



 

About ...
 
Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-09-17

Modified:

2017-09-18

Version:

0.2

Version history:

  • 0.1 Contents outline
  • 0.1 First stub

CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.