Difference between revisions of "RPR-Data-Import"

From "A B C"
Jump to navigation Jump to search
m
m
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
<div id="BIO">
+
<div id="ABC">
  <div class="b1">
+
<div style="padding:5px; border:1px solid #000000; background-color:#f2fafa; font-size:300%; font-weight:400; color: #000000; width:100%;">
 
Importing data in R
 
Importing data in R
  </div>
+
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#f2fafa; font-size:30%; font-weight:200; color: #000000; ">
 
+
(Data import from unstructured text, structured text, spreadsheets, online repositories, and Web pages)
  {{Vspace}}
+
</div>
 
 
<div class="keywords">
 
<b>Keywords:</b>&nbsp;
 
Data import from unstructured text, structured text, spreadsheets, online repositories, and Web pages
 
 
</div>
 
</div>
  
{{Vspace}}
+
{{Smallvspace}}
 
 
  
__TOC__
 
  
{{Vspace}}
+
<div style="padding:5px; border:1px solid #000000; background-color:#f2fafa33; font-size:85%;">
 
+
<div style="font-size:118%;">
 
+
<b>Abstract:</b><br />
{{DEV}}
 
 
 
{{Vspace}}
 
 
 
 
 
</div>
 
<div id="ABC-unit-framework">
 
== Abstract ==
 
 
<section begin=abstract />
 
<section begin=abstract />
<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "abstract" -->
 
 
Practical for data import from text files, spreadsheets and Web pages.
 
Practical for data import from text files, spreadsheets and Web pages.
 
<section end=abstract />
 
<section end=abstract />
 
+
</div>
{{Vspace}}
+
<!-- ============================  -->
 
+
<hr>
 
+
<table>
== This unit ... ==
+
<tr>
=== Prerequisites ===
+
<td style="padding:10px;">
<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "prerequisites" -->
+
<b>Objectives:</b><br />
 +
...
 +
</td>
 +
<td style="padding:10px;">
 +
<b>Outcomes:</b><br />
 +
...
 +
</td>
 +
</tr>
 +
</table>
 +
<!-- ============================ -->
 +
<hr>
 +
<b>Deliverables:</b><br />
 +
<section begin=deliverables />
 +
<li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li>
 +
<li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li>
 +
<li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li>
 +
<section end=deliverables />
 +
<!-- ============================  -->
 +
<hr>
 +
<section begin=prerequisites />
 +
<b>Prerequisites:</b><br />
 
*[[RPR-Introduction|RPR-Introduction (Introduction to R)]]
 
*[[RPR-Introduction|RPR-Introduction (Introduction to R)]]
 +
<section end=prerequisites />
 +
<!-- ============================  -->
 +
</div>
  
{{Vspace}}
+
{{Smallvspace}}
  
  
=== Objectives ===
+
{{SLEEP}}
<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "objectives" -->
 
...
 
  
{{Vspace}}
+
{{Smallvspace}}
  
  
=== Outcomes ===
+
__TOC__
<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "outcomes" -->
 
...
 
 
 
{{Vspace}}
 
 
 
 
 
=== Deliverables ===
 
<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "deliverables" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-time_management" -->
 
*<b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-journal" -->
 
*<b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-insights" -->
 
*<b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].
 
  
 
{{Vspace}}
 
{{Vspace}}
Line 70: Line 62:
  
 
=== Evaluation ===
 
=== Evaluation ===
<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "evaluation" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "eval-none" -->
 
 
<b>Evaluation: NA</b><br />
 
<b>Evaluation: NA</b><br />
:This unit is not evaluated for course marks.
+
<div style="margin-left: 2rem;">This unit is not evaluated for course marks.</div>
 
 
{{Vspace}}
 
 
 
 
 
</div>
 
<div id="BIO">
 
 
== Contents ==
 
== Contents ==
<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "contents" -->
 
 
* Line endings
 
* Line endings
 
* Unstructured text files, headers, skip, omit, and rownames
 
* Unstructured text files, headers, skip, omit, and rownames
Line 94: Line 77:
 
* XML - libraries and cpath
 
* XML - libraries and cpath
 
* Binary Data
 
* Binary Data
* R objects: save() and load()
+
* text objects: readLines() and writeLines()
 +
* R objects: save() and load(); saveRDS() and readRDS().
 +
 
 +
 
 +
Regex for screenscraping example:
 +
 
 +
===Screenscraping===
 +
 
 +
Write a PHP program that screenscrapes resolution data for a protein structure file from the PDB.
 +
 
 +
  Here is a [http://www.pdb.org/pdb/explore/explore.do?structureId=2imm '''link to a PDB record'''] to illustrate the URL format.
 +
 
 +
  <div class="mw-collapsible-content exercise-box">
 +
    <div class="mw-collapsible mw-collapsed" data-expandtext="Solution" data-collapsetext="Collapse" style="width:90%; padding:10px; margin:5px; border:solid 1px #99999;">
 +
    Fetch the contents of the URL into a string. Use a regex that captures the data you want to retrieve as part of some unique pattern in the HTML source. Print.
 +
 
 +
  <div class="mw-collapsible-content exercise-box">
 +
    ;The regex:
 +
    :<code>/<td id="se_xrayResolution">\s*(\d+\.\d+)/</code>
  
{{Vspace}}
+
    *<code><td id="se_xrayResolution"></code>&nbsp;&nbsp;&nbsp;<small>identifying tag for the information we are looking for, ...</small>
 +
    *<code>\s*</code>&nbsp;&nbsp;&nbsp;<small>... probably followed by whitespace, ...</small>
 +
    *<code>(\d+\.\d+)</code>&nbsp;&nbsp;&nbsp;<small>... the "payload" of the match: one or more digits, a literal dot and and one or more digits.</small>
  
 +
    ;The code:
 +
    <source lang="PHP">
 +
    <?php
 +
  $URLpath = "http://www.pdb.org/pdb/explore/explore.do?structureId=";
 +
  $PDBid = "2imm";
 +
  $source = file_get_contents($URLpath . $PDBid);
 +
  preg_match('/<td id="se_xrayResolution">\s*?(\d+\.\d+)/', $source, $resolution);
 +
  print($resolution[1]);
 +
  ?>
  
== Further reading, links and resources ==
 
<!-- Formatting exqmples:
 
{{#pmid: 19957275}}
 
<div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div>
 
-->
 
  
{{Vspace}}
 
  
  
== Notes ==
 
<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "notes" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "notes" -->
 
<references />
 
  
{{Vspace}}
 
  
  
</div>
 
<div id="ABC-unit-framework">
 
 
== Self-evaluation ==
 
== Self-evaluation ==
<!-- included from "../components/RPR-Data-Import.components.wtxt", section: "self-evaluation" -->
 
 
<!--
 
<!--
 
=== Question 1===
 
=== Question 1===
Line 136: Line 134:
  
 
-->
 
-->
 +
== Further reading, links and resources ==
 +
<!-- Formatting exqmples:
 +
{{#pmid: 19957275}}
 +
<div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div>
 +
-->
 +
== Notes ==
 +
<references />
  
 
{{Vspace}}
 
{{Vspace}}
  
 
 
{{Vspace}}
 
 
 
<!-- included from "ABC-unit_components.wtxt", section: "ABC-unit_ask" -->
 
 
----
 
 
{{Vspace}}
 
 
<b>If in doubt, ask!</b> If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
 
 
----
 
 
{{Vspace}}
 
  
 
<div class="about">
 
<div class="about">
Line 164: Line 153:
 
:2017-09-17
 
:2017-09-17
 
<b>Modified:</b><br />
 
<b>Modified:</b><br />
:2017-09-18
+
:2018-01-31
 
<b>Version:</b><br />
 
<b>Version:</b><br />
:0.2
+
:1.0
 
<b>Version history:</b><br />
 
<b>Version history:</b><br />
*0.1 Contents outline
+
*1.0 first live version
 +
*0.2 Contents outline
 
*0.1 First stub
 
*0.1 First stub
 
</div>
 
</div>
[[Category:ABC-units]]
+
<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_footer" -->
<!-- included from "ABC-unit_components.wtxt", section: "ABC-unit_footer" -->
 
  
 
{{CC-BY}}
 
{{CC-BY}}
  
 +
[[Category:ABC-units]]
 +
{{UNIT}}
 +
{{SLEEP}}
 
</div>
 
</div>
 
<!-- [END] -->
 
<!-- [END] -->

Latest revision as of 01:41, 23 September 2020

Importing data in R

(Data import from unstructured text, structured text, spreadsheets, online repositories, and Web pages)


 


Abstract:

Practical for data import from text files, spreadsheets and Web pages.


Objectives:
...

Outcomes:
...


Deliverables:

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

  • Prerequisites:


     


    This page is not currently being maintained since it is not part of active learning sections.


     



     


    Evaluation

    Evaluation: NA

    This unit is not evaluated for course marks.

    Contents

    • Line endings
    • Unstructured text files, headers, skip, omit, and rownames
    • Caution with stringsAsFactors
    • Caution with coerced data
    • Text files with keywords
    • Text-files: changing state
    • Text-files: csv and tsv (and Excel sheets)
    • Text-files: slurping, chunking and streaming
    • curl
    • httr GET and POST
    • XML - libraries and cpath
    • Binary Data
    • text objects: readLines() and writeLines()
    • R objects: save() and load(); saveRDS() and readRDS().


    Regex for screenscraping example:

    Screenscraping

    Write a PHP program that screenscrapes resolution data for a protein structure file from the PDB.

     Here is a link to a PDB record to illustrate the URL format.
    
       Fetch the contents of the URL into a string. Use a regex that captures the data you want to retrieve as part of some unique pattern in the HTML source. Print.
    
       ;The regex:
    
    :/\s*(\d+\.\d+)/ *   identifying tag for the information we are looking for, ...
       *\s*   ... probably followed by whitespace, ...
       *(\d+\.\d+)   ... the "payload" of the match: one or more digits, a literal dot and and one or more digits.
    
       ;The code:
       <source lang="PHP">
       <?php
     $URLpath = "http://www.pdb.org/pdb/explore/explore.do?structureId=";
     $PDBid = "2imm";
     $source = file_get_contents($URLpath . $PDBid);
    
    preg_match('/\s*?(\d+\.\d+)/', $source, $resolution);
     print($resolution[1]);
     ?>
    




    Self-evaluation

    Further reading, links and resources

    Notes


     


    About ...
     
    Author:

    Boris Steipe <boris.steipe@utoronto.ca>

    Created:

    2017-09-17

    Modified:

    2018-01-31

    Version:

    1.0

    Version history:

    • 1.0 first live version
    • 0.2 Contents outline
    • 0.1 First stub

    CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.

    This page is not currently being maintained since it is not part of active learning sections.