Difference between revisions of "RPR-Data-Import"
Jump to navigation
Jump to search
Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.
m (Created page with "<div id="BIO"> <div class="b1"> Importing data in R </div> {{Vspace}} <div class="keywords"> <b>Keywords:</b> Practical for data import </div> {{Vspace}} __...") |
m |
||
(7 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | <div id=" | + | <div id="ABC"> |
− | + | <div style="padding:5px; border:1px solid #000000; background-color:#f2fafa; font-size:300%; font-weight:400; color: #000000; width:100%;"> | |
Importing data in R | Importing data in R | ||
− | + | <div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#f2fafa; font-size:30%; font-weight:200; color: #000000; "> | |
− | + | (Data import from unstructured text, structured text, spreadsheets, online repositories, and Web pages) | |
− | + | </div> | |
− | |||
− | <div | ||
− | |||
− | |||
</div> | </div> | ||
− | {{ | + | {{Smallvspace}} |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | <div style="padding:5px; border:1px solid #000000; background-color:#f2fafa33; font-size:85%;"> | |
− | + | <div style="font-size:118%;"> | |
− | + | <b>Abstract:</b><br /> | |
− | < | ||
− | <div | ||
− | |||
<section begin=abstract /> | <section begin=abstract /> | ||
− | |||
Practical for data import from text files, spreadsheets and Web pages. | Practical for data import from text files, spreadsheets and Web pages. | ||
<section end=abstract /> | <section end=abstract /> | ||
+ | </div> | ||
+ | <!-- ============================ --> | ||
+ | <hr> | ||
+ | <table> | ||
+ | <tr> | ||
+ | <td style="padding:10px;"> | ||
+ | <b>Objectives:</b><br /> | ||
+ | ... | ||
+ | </td> | ||
+ | <td style="padding:10px;"> | ||
+ | <b>Outcomes:</b><br /> | ||
+ | ... | ||
+ | </td> | ||
+ | </tr> | ||
+ | </table> | ||
+ | <!-- ============================ --> | ||
+ | <hr> | ||
+ | <b>Deliverables:</b><br /> | ||
+ | <section begin=deliverables /> | ||
+ | <li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li> | ||
+ | <li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li> | ||
+ | <li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li> | ||
+ | <section end=deliverables /> | ||
+ | <!-- ============================ --> | ||
+ | <hr> | ||
+ | <section begin=prerequisites /> | ||
+ | <b>Prerequisites:</b><br /> | ||
+ | *[[RPR-Introduction|RPR-Introduction (Introduction to R)]] | ||
+ | <section end=prerequisites /> | ||
+ | <!-- ============================ --> | ||
+ | </div> | ||
− | {{ | + | {{Smallvspace}} |
− | + | {{SLEEP}} | |
− | |||
− | |||
− | |||
− | {{ | + | {{Smallvspace}} |
− | + | __TOC__ | |
− | |||
− | |||
{{Vspace}} | {{Vspace}} | ||
− | === | + | === Evaluation === |
− | < | + | <b>Evaluation: NA</b><br /> |
− | + | <div style="margin-left: 2rem;">This unit is not evaluated for course marks.</div> | |
+ | == Contents == | ||
+ | * Line endings | ||
+ | * Unstructured text files, headers, skip, omit, and rownames | ||
+ | * Caution with stringsAsFactors | ||
+ | * Caution with coerced data | ||
+ | * Text files with keywords | ||
+ | * Text-files: changing state | ||
+ | * Text-files: csv and tsv (and Excel sheets) | ||
+ | * Text-files: slurping, chunking and streaming | ||
+ | * curl | ||
+ | * httr GET and POST | ||
+ | * XML - libraries and cpath | ||
+ | * Binary Data | ||
+ | * text objects: readLines() and writeLines() | ||
+ | * R objects: save() and load(); saveRDS() and readRDS(). | ||
− | |||
+ | Regex for screenscraping example: | ||
− | === | + | ===Screenscraping=== |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | Write a PHP program that screenscrapes resolution data for a protein structure file from the PDB. | |
+ | Here is a [http://www.pdb.org/pdb/explore/explore.do?structureId=2imm '''link to a PDB record'''] to illustrate the URL format. | ||
− | = | + | <div class="mw-collapsible-content exercise-box"> |
− | + | <div class="mw-collapsible mw-collapsed" data-expandtext="Solution" data-collapsetext="Collapse" style="width:90%; padding:10px; margin:5px; border:solid 1px #99999;"> | |
− | < | + | Fetch the contents of the URL into a string. Use a regex that captures the data you want to retrieve as part of some unique pattern in the HTML source. Print. |
− | |||
− | |||
− | + | <div class="mw-collapsible-content exercise-box"> | |
+ | ;The regex: | ||
+ | :<code>/<td id="se_xrayResolution">\s*(\d+\.\d+)/</code> | ||
+ | *<code><td id="se_xrayResolution"></code> <small>identifying tag for the information we are looking for, ...</small> | ||
+ | *<code>\s*</code> <small>... probably followed by whitespace, ...</small> | ||
+ | *<code>(\d+\.\d+)</code> <small>... the "payload" of the match: one or more digits, a literal dot and and one or more digits.</small> | ||
− | + | ;The code: | |
− | < | + | <source lang="PHP"> |
− | = | + | <?php |
− | + | $URLpath = "http://www.pdb.org/pdb/explore/explore.do?structureId="; | |
− | .. | + | $PDBid = "2imm"; |
+ | $source = file_get_contents($URLpath . $PDBid); | ||
+ | preg_match('/<td id="se_xrayResolution">\s*?(\d+\.\d+)/', $source, $resolution); | ||
+ | print($resolution[1]); | ||
+ | ?> | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Self-evaluation == | == Self-evaluation == | ||
− | |||
<!-- | <!-- | ||
=== Question 1=== | === Question 1=== | ||
Line 124: | Line 134: | ||
--> | --> | ||
+ | == Further reading, links and resources == | ||
+ | <!-- Formatting exqmples: | ||
+ | {{#pmid: 19957275}} | ||
+ | <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> | ||
+ | --> | ||
+ | == Notes == | ||
+ | <references /> | ||
{{Vspace}} | {{Vspace}} | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
<div class="about"> | <div class="about"> | ||
Line 152: | Line 153: | ||
:2017-09-17 | :2017-09-17 | ||
<b>Modified:</b><br /> | <b>Modified:</b><br /> | ||
− | : | + | :2018-01-31 |
<b>Version:</b><br /> | <b>Version:</b><br /> | ||
− | :0 | + | :1.0 |
<b>Version history:</b><br /> | <b>Version history:</b><br /> | ||
+ | *1.0 first live version | ||
+ | *0.2 Contents outline | ||
*0.1 First stub | *0.1 First stub | ||
</div> | </div> | ||
− | + | <!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_footer" --> | |
− | <!-- included from "ABC-unit_components. | ||
{{CC-BY}} | {{CC-BY}} | ||
+ | [[Category:ABC-units]] | ||
+ | {{UNIT}} | ||
+ | {{SLEEP}} | ||
</div> | </div> | ||
<!-- [END] --> | <!-- [END] --> |
Latest revision as of 01:41, 23 September 2020
Importing data in R
(Data import from unstructured text, structured text, spreadsheets, online repositories, and Web pages)
Abstract:
Practical for data import from text files, spreadsheets and Web pages.
Objectives: |
Outcomes: |
Deliverables:
Prerequisites:
This page is not currently being maintained since it is not part of active learning sections.
Contents
Evaluation
Evaluation: NA
This unit is not evaluated for course marks.
Contents
- Line endings
- Unstructured text files, headers, skip, omit, and rownames
- Caution with stringsAsFactors
- Caution with coerced data
- Text files with keywords
- Text-files: changing state
- Text-files: csv and tsv (and Excel sheets)
- Text-files: slurping, chunking and streaming
- curl
- httr GET and POST
- XML - libraries and cpath
- Binary Data
- text objects: readLines() and writeLines()
- R objects: save() and load(); saveRDS() and readRDS().
Regex for screenscraping example:
Screenscraping
Write a PHP program that screenscrapes resolution data for a protein structure file from the PDB.
Here is a link to a PDB record to illustrate the URL format.
Fetch the contents of the URL into a string. Use a regex that captures the data you want to retrieve as part of some unique pattern in the HTML source. Print.
;The regex::
/\s*(\d+\.\d+)/
*
identifying tag for the information we are looking for, ...
*\s*
... probably followed by whitespace, ... *(\d+\.\d+)
... the "payload" of the match: one or more digits, a literal dot and and one or more digits.
;The code: <source lang="PHP"> <?php $URLpath = "http://www.pdb.org/pdb/explore/explore.do?structureId="; $PDBid = "2imm"; $source = file_get_contents($URLpath . $PDBid);preg_match('/\s*?(\d+\.\d+)/', $source, $resolution);
print($resolution[1]); ?>
Self-evaluation
Further reading, links and resources
Notes
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2017-09-17
Modified:
- 2018-01-31
Version:
- 1.0
Version history:
- 1.0 first live version
- 0.2 Contents outline
- 0.1 First stub
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.
This page is not currently being maintained since it is not part of active learning sections.