Difference between revisions of "RPR-Data-Import"
m |
m |
||
Line 1: | Line 1: | ||
<div id="ABC"> | <div id="ABC"> | ||
− | <div style="padding:5px; border:1px solid #000000; background-color:# | + | <div style="padding:5px; border:1px solid #000000; background-color:#b3dbce; font-size:300%; font-weight:400; color: #000000; width:100%;"> |
Importing data in R | Importing data in R | ||
− | <div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:# | + | <div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#b3dbce; font-size:30%; font-weight:200; color: #000000; "> |
(Data import from unstructured text, structured text, spreadsheets, online repositories, and Web pages) | (Data import from unstructured text, structured text, spreadsheets, online repositories, and Web pages) | ||
</div> | </div> | ||
Line 10: | Line 10: | ||
− | <div style="padding:5px; border:1px solid #000000; background-color:# | + | <div style="padding:5px; border:1px solid #000000; background-color:#b3dbce33; font-size:85%;"> |
<div style="font-size:118%;"> | <div style="font-size:118%;"> | ||
<b>Abstract:</b><br /> | <b>Abstract:</b><br /> | ||
Line 54: | Line 54: | ||
− | |||
{{Smallvspace}} | {{Smallvspace}} | ||
Line 167: | Line 166: | ||
:2017-09-17 | :2017-09-17 | ||
<b>Modified:</b><br /> | <b>Modified:</b><br /> | ||
− | : | + | :2018-01-31 |
<b>Version:</b><br /> | <b>Version:</b><br /> | ||
− | :0 | + | :1.0 |
<b>Version history:</b><br /> | <b>Version history:</b><br /> | ||
− | *0. | + | *1.0 first live version |
+ | *0.2 Contents outline | ||
*0.1 First stub | *0.1 First stub | ||
</div> | </div> |
Revision as of 02:39, 3 February 2018
Importing data in R
(Data import from unstructured text, structured text, spreadsheets, online repositories, and Web pages)
Abstract:
Practical for data import from text files, spreadsheets and Web pages.
Objectives: |
Outcomes: |
Deliverables:
- Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
- Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
- Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.
Prerequisites:
Contents
Contents
- Line endings
- Unstructured text files, headers, skip, omit, and rownames
- Caution with stringsAsFactors
- Caution with coerced data
- Text files with keywords
- Text-files: changing state
- Text-files: csv and tsv (and Excel sheets)
- Text-files: slurping, chunking and streaming
- curl
- httr GET and POST
- XML - libraries and cpath
- Binary Data
- R objects: save() and load()
Regex for screenscraping example:
Screenscraping
Write a PHP program that screenscrapes resolution data for a protein structure file from the PDB.
Here is a link to a PDB record to illustrate the URL format.
Fetch the contents of the URL into a string. Use a regex that captures the data you want to retrieve as part of some unique pattern in the HTML source. Print.
;The regex::
/\s*(\d+\.\d+)/
*
identifying tag for the information we are looking for, ...
*\s*
... probably followed by whitespace, ... *(\d+\.\d+)
... the "payload" of the match: one or more digits, a literal dot and and one or more digits.
;The code: <source lang="PHP"> <?php $URLpath = "http://www.pdb.org/pdb/explore/explore.do?structureId="; $PDBid = "2imm"; $source = file_get_contents($URLpath . $PDBid);preg_match('/\s*?(\d+\.\d+)/', $source, $resolution);
print($resolution[1]); ?>
Self-evaluation
Notes
Further reading, links and resources
If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2017-09-17
Modified:
- 2018-01-31
Version:
- 1.0
Version history:
- 1.0 first live version
- 0.2 Contents outline
- 0.1 First stub
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.