RPR-Data-Import
Jump to navigation
Jump to search
Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.
Importing data in R
(Data import from unstructured text, structured text, spreadsheets, online repositories, and Web pages)
Abstract:
Practical for data import from text files, spreadsheets and Web pages.
Objectives: |
Outcomes: |
Deliverables:
Prerequisites:
This page is not currently being maintained since it is not part of active learning sections.
Contents
Evaluation
Evaluation: NA
This unit is not evaluated for course marks.
Contents
- Line endings
- Unstructured text files, headers, skip, omit, and rownames
- Caution with stringsAsFactors
- Caution with coerced data
- Text files with keywords
- Text-files: changing state
- Text-files: csv and tsv (and Excel sheets)
- Text-files: slurping, chunking and streaming
- curl
- httr GET and POST
- XML - libraries and cpath
- Binary Data
- text objects: readLines() and writeLines()
- R objects: save() and load(); saveRDS() and readRDS().
Regex for screenscraping example:
Screenscraping
Write a PHP program that screenscrapes resolution data for a protein structure file from the PDB.
Here is a link to a PDB record to illustrate the URL format.
Fetch the contents of the URL into a string. Use a regex that captures the data you want to retrieve as part of some unique pattern in the HTML source. Print.
;The regex::
/\s*(\d+\.\d+)/
*
identifying tag for the information we are looking for, ...
*\s*
... probably followed by whitespace, ... *(\d+\.\d+)
... the "payload" of the match: one or more digits, a literal dot and and one or more digits.
;The code: <source lang="PHP"> <?php $URLpath = "http://www.pdb.org/pdb/explore/explore.do?structureId="; $PDBid = "2imm"; $source = file_get_contents($URLpath . $PDBid);preg_match('/\s*?(\d+\.\d+)/', $source, $resolution);
print($resolution[1]); ?>
Self-evaluation
Further reading, links and resources
Notes
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2017-09-17
Modified:
- 2018-01-31
Version:
- 1.0
Version history:
- 1.0 first live version
- 0.2 Contents outline
- 0.1 First stub
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.
This page is not currently being maintained since it is not part of active learning sections.