Difference between revisions of "RPR-Data-Import"

Latest revision as of 01:41, 23 September 2020

Importing data in R

(Data import from unstructured text, structured text, spreadsheets, online repositories, and Web pages)

Abstract:

Practical for data import from text files, spreadsheets and Web pages.

Objectives:
...

Outcomes:
...

Deliverables:

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.

Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.

Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Prerequisites:

RPR-Introduction (Introduction to R)

This page is not currently being maintained since it is not part of active learning sections.

Line endings
Unstructured text files, headers, skip, omit, and rownames
Caution with stringsAsFactors
Caution with coerced data
Text files with keywords
Text-files: changing state
Text-files: csv and tsv (and Excel sheets)
Text-files: slurping, chunking and streaming
curl
httr GET and POST
XML - libraries and cpath
Binary Data
text objects: readLines() and writeLines()
R objects: save() and load(); saveRDS() and readRDS().

Regex for screenscraping example:

Screenscraping

Write a PHP program that screenscrapes resolution data for a protein structure file from the PDB.

 Here is a link to a PDB record to illustrate the URL format.

   Fetch the contents of the URL into a string. Use a regex that captures the data you want to retrieve as part of some unique pattern in the HTML source. Print.

   ;The regex:

:/\s*(\d+\.\d+)/ * identifying tag for the information we are looking for, ...

   *\s*   ... probably followed by whitespace, ...
   *(\d+\.\d+)   ... the "payload" of the match: one or more digits, a literal dot and and one or more digits.

   ;The code:
   <source lang="PHP">
   <?php
 $URLpath = "http://www.pdb.org/pdb/explore/explore.do?structureId=";
 $PDBid = "2imm";
 $source = file_get_contents($URLpath . $PDBid);

preg_match('/\s*?(\d+\.\d+)/', $source, $resolution);

 print($resolution[1]);
 ?>

Self-evaluation

Notes

About ...

Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-09-17

Modified:

2018-01-31

Version:

1.0

Version history:

1.0 first live version
0.2 Contents outline
0.1 First stub

This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.

This page is not currently being maintained since it is not part of active learning sections.

Difference between revisions of "RPR-Data-Import"

Latest revision as of 01:41, 23 September 2020

Contents

Evaluation

Contents

Screenscraping

Self-evaluation

Further reading, links and resources

Notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools

@@ Line 1: / Line 1: @@
-<div id="BIO">
+<div id="ABC">
-  <div class="b1">
+<div style="padding:5px; border:1px solid #000000; background-color:#f2fafa; font-size:300%; font-weight:400; color: #000000; width:100%;">
 Importing data in R
-  </div>
+<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#f2fafa; font-size:30%; font-weight:200; color: #000000; ">
+(Data import from unstructured text, structured text, spreadsheets, online repositories, and Web pages)
-  {{Vspace}}
+</div>
-<div class="keywords">
-<b>Keywords:</b>&nbsp;
-Practical for data import
 </div>
-{{Vspace}}
+{{Smallvspace}}
-__TOC__
-{{Vspace}}
-{{STUB}}
-{{Vspace}}
+<div style="padding:5px; border:1px solid #000000; background-color:#f2fafa33; font-size:85%;">
+<div style="font-size:118%;">
+<b>Abstract:</b><br />
-</div>
-<div id="ABC-unit-framework">
-== Abstract ==
 <section begin=abstract />
-<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "abstract" -->
 Practical for data import from text files, spreadsheets and Web pages.
 <section end=abstract />
+</div>
+<!-- ============================  -->
+<hr>
+<table>
+<tr>
+<td style="padding:10px;">
+<b>Objectives:</b><br />
+...
+</td>
+<td style="padding:10px;">
+<b>Outcomes:</b><br />
+...
+</td>
+</tr>
+</table>
+<!-- ============================  -->
+<hr>
+<b>Deliverables:</b><br />
+<section begin=deliverables />
+<li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li>
+<li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li>
+<li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li>
+<section end=deliverables />
+<!-- ============================  -->
+<hr>
+<section begin=prerequisites />
+<b>Prerequisites:</b><br />
+*[[RPR-Introduction|RPR-Introduction (Introduction to R)]]
+<section end=prerequisites />
+<!-- ============================  -->
+</div>
-{{Vspace}}
+{{Smallvspace}}
-== This unit ... ==
+{{SLEEP}}
-=== Prerequisites ===
-<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "prerequisites" -->
-*[[RPR-Introduction|RPR-Introduction (Introduction to R)]]
-{{Vspace}}
+{{Smallvspace}}
-=== Objectives ===
+__TOC__
-<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "objectives" -->
-...
 {{Vspace}}
-=== Outcomes ===
+=== Evaluation ===
-<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "outcomes" -->
+<b>Evaluation: NA</b><br />
-...
+<div style="margin-left: 2rem;">This unit is not evaluated for course marks.</div>
+== Contents ==
+* Line endings
+* Unstructured text files, headers, skip, omit, and rownames
+* Caution with stringsAsFactors
+* Caution with coerced data
+* Text files with keywords
+* Text-files: changing state
+* Text-files: csv and tsv (and Excel sheets)
+* Text-files: slurping, chunking and streaming
+* curl
+* httr GET and POST
+* XML - libraries and cpath
+* Binary Data
+* text objects: readLines() and writeLines()
+* R objects: save() and load(); saveRDS() and readRDS().
-{{Vspace}}
+Regex for screenscraping example:
-=== Deliverables ===
+===Screenscraping===
-<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "deliverables" -->
-<!-- included from "ABC-unit_components.wtxt", section: "deliverables-time_management" -->
-*<b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
-<!-- included from "ABC-unit_components.wtxt", section: "deliverables-journal" -->
-*<b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.
-<!-- included from "ABC-unit_components.wtxt", section: "deliverables-insights" -->
-*<b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].
-{{Vspace}}
+Write a PHP program that screenscrapes resolution data for a protein structure file from the PDB.
+  Here is a [http://www.pdb.org/pdb/explore/explore.do?structureId=2imm '''link to a PDB record'''] to illustrate the URL format.
-=== Evaluation ===
+  <div class="mw-collapsible-content exercise-box">
-<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "evaluation" -->
+    <div class="mw-collapsible mw-collapsed" data-expandtext="Solution" data-collapsetext="Collapse" style="width:90%; padding:10px; margin:5px; border:solid 1px #99999;">
-<!-- included from "ABC-unit_components.wtxt", section: "eval-none" -->
+    Fetch the contents of the URL into a string. Use a regex that captures the data you want to retrieve as part of some unique pattern in the HTML source. Print.
-<b>Evaluation: NA</b><br />
-:This unit is not evaluated for course marks.
-{{Vspace}}
+  <div class="mw-collapsible-content exercise-box">
+    ;The regex:
+    :<code>/<td id="se_xrayResolution">\s*(\d+\.\d+)/</code>
+    *<code><td id="se_xrayResolution"></code>&nbsp;&nbsp;&nbsp;<small>identifying tag for the information we are looking for, ...</small>
+    *<code>\s*</code>&nbsp;&nbsp;&nbsp;<small>... probably followed by whitespace, ...</small>
+    *<code>(\d+\.\d+)</code>&nbsp;&nbsp;&nbsp;<small>... the "payload" of the match: one or more digits, a literal dot and and one or more digits.</small>
-</div>
+    ;The code:
-<div id="BIO">
+    <source lang="PHP">
-== Contents ==
+    <?php
-<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "contents" -->
+  $URLpath = "http://www.pdb.org/pdb/explore/explore.do?structureId=";
-...
+  $PDBid = "2imm";
+  $source = file_get_contents($URLpath . $PDBid);
+  preg_match('/<td id="se_xrayResolution">\s*?(\d+\.\d+)/', $source, $resolution);
+  print($resolution[1]);
+  ?>
-{{Vspace}}
-== Further reading, links and resources ==
-<!-- Formatting exqmples:
-{{#pmid: 19957275}}
-<div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div>
--->
-{{Vspace}}
-== Notes ==
-<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "notes" -->
-<!-- included from "ABC-unit_components.wtxt", section: "notes" -->
-<references />
-{{Vspace}}
-</div>
-<div id="ABC-unit-framework">
 == Self-evaluation ==
-<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "self-evaluation" -->
 <!--
 === Question 1===
@@ Line 124: / Line 134: @@
 -->
+== Further reading, links and resources ==
+<!-- Formatting exqmples:
+{{#pmid: 19957275}}
+<div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div>
+-->
+== Notes ==
+<references />
 {{Vspace}}
-{{Vspace}}
-<!-- included from "ABC-unit_components.wtxt", section: "ABC-unit_ask" -->
-----
-{{Vspace}}
-<b>If in doubt, ask!</b> If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
-----
-{{Vspace}}
 <div class="about">
@@ Line 152: / Line 153: @@
 :2017-09-17
 <b>Modified:</b><br />
-:2017-09-18
+:2018-01-31
 <b>Version:</b><br />
-:0.1
+:1.0
 <b>Version history:</b><br />
+*1.0 first live version
+*0.2 Contents outline
 *0.1 First stub
 </div>
-[[Category:ABC-units]]
+<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_footer" -->
-<!-- included from "ABC-unit_components.wtxt", section: "ABC-unit_footer" -->
 {{CC-BY}}
+[[Category:ABC-units]]
+{{UNIT}}
+{{SLEEP}}
 </div>
 <!-- [END] -->