RPR-Data-Import

From "A B C"
Jump to navigation Jump to search

Importing data in R

(Data import from unstructured text, structured text, spreadsheets, online repositories, and Web pages)


 


Abstract:

Practical for data import from text files, spreadsheets and Web pages.


Objectives:
...

Outcomes:
...


Deliverables:

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

  • Prerequisites:


     


    This page is not currently being maintained since it is not part of active learning sections.


     



     


    Evaluation

    Evaluation: NA

    This unit is not evaluated for course marks.

    Contents

    • Line endings
    • Unstructured text files, headers, skip, omit, and rownames
    • Caution with stringsAsFactors
    • Caution with coerced data
    • Text files with keywords
    • Text-files: changing state
    • Text-files: csv and tsv (and Excel sheets)
    • Text-files: slurping, chunking and streaming
    • curl
    • httr GET and POST
    • XML - libraries and cpath
    • Binary Data
    • text objects: readLines() and writeLines()
    • R objects: save() and load(); saveRDS() and readRDS().


    Regex for screenscraping example:

    Screenscraping

    Write a PHP program that screenscrapes resolution data for a protein structure file from the PDB.

     Here is a link to a PDB record to illustrate the URL format.
    
       Fetch the contents of the URL into a string. Use a regex that captures the data you want to retrieve as part of some unique pattern in the HTML source. Print.
    
       ;The regex:
    
    :/\s*(\d+\.\d+)/ *   identifying tag for the information we are looking for, ...
       *\s*   ... probably followed by whitespace, ...
       *(\d+\.\d+)   ... the "payload" of the match: one or more digits, a literal dot and and one or more digits.
    
       ;The code:
       <source lang="PHP">
       <?php
     $URLpath = "http://www.pdb.org/pdb/explore/explore.do?structureId=";
     $PDBid = "2imm";
     $source = file_get_contents($URLpath . $PDBid);
    
    preg_match('/\s*?(\d+\.\d+)/', $source, $resolution);
     print($resolution[1]);
     ?>
    




    Self-evaluation

    Further reading, links and resources

    Notes


     


    About ...
     
    Author:

    Boris Steipe <boris.steipe@utoronto.ca>

    Created:

    2017-09-17

    Modified:

    2018-01-31

    Version:

    1.0

    Version history:

    • 1.0 first live version
    • 0.2 Contents outline
    • 0.1 First stub

    CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.

    This page is not currently being maintained since it is not part of active learning sections.