Difference between revisions of "RPR-Data-Import"

From "A B C"
Jump to navigation Jump to search
m
m
Line 1: Line 1:
 
<div id="ABC">
 
<div id="ABC">
<div style="padding:5px; border:1px solid #000000; background-color:#d9ead5; font-size:300%; font-weight:400; color: #000000; width:100%;">
+
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce; font-size:300%; font-weight:400; color: #000000; width:100%;">
 
Importing data in R
 
Importing data in R
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#d9ead5; font-size:30%; font-weight:200; color: #000000; ">
+
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#b3dbce; font-size:30%; font-weight:200; color: #000000; ">
 
(Data import from unstructured text, structured text, spreadsheets, online repositories, and Web pages)
 
(Data import from unstructured text, structured text, spreadsheets, online repositories, and Web pages)
 
</div>
 
</div>
Line 10: Line 10:
  
  
<div style="padding:5px; border:1px solid #000000; background-color:#d9ead533; font-size:85%;">
+
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce33; font-size:85%;">
 
<div style="font-size:118%;">
 
<div style="font-size:118%;">
 
<b>Abstract:</b><br />
 
<b>Abstract:</b><br />
Line 54: Line 54:
  
  
{{DEV}}
 
  
 
{{Smallvspace}}
 
{{Smallvspace}}
Line 167: Line 166:
 
:2017-09-17
 
:2017-09-17
 
<b>Modified:</b><br />
 
<b>Modified:</b><br />
:2017-09-18
+
:2018-01-31
 
<b>Version:</b><br />
 
<b>Version:</b><br />
:0.2
+
:1.0
 
<b>Version history:</b><br />
 
<b>Version history:</b><br />
*0.1 Contents outline
+
*1.0 first live version
 +
*0.2 Contents outline
 
*0.1 First stub
 
*0.1 First stub
 
</div>
 
</div>

Revision as of 02:39, 3 February 2018

Importing data in R

(Data import from unstructured text, structured text, spreadsheets, online repositories, and Web pages)


 


Abstract:

Practical for data import from text files, spreadsheets and Web pages.


Objectives:
...

Outcomes:
...


Deliverables:

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Prerequisites:


 



 



 


Contents

  • Line endings
  • Unstructured text files, headers, skip, omit, and rownames
  • Caution with stringsAsFactors
  • Caution with coerced data
  • Text files with keywords
  • Text-files: changing state
  • Text-files: csv and tsv (and Excel sheets)
  • Text-files: slurping, chunking and streaming
  • curl
  • httr GET and POST
  • XML - libraries and cpath
  • Binary Data
  • R objects: save() and load()


Regex for screenscraping example:

Screenscraping

Write a PHP program that screenscrapes resolution data for a protein structure file from the PDB.

 Here is a link to a PDB record to illustrate the URL format.
   Fetch the contents of the URL into a string. Use a regex that captures the data you want to retrieve as part of some unique pattern in the HTML source. Print.