Difference between revisions of "BIN-Storing data"

From "A B C"
Jump to navigation Jump to search
m
m
 
(26 intermediate revisions by the same user not shown)
Line 1: Line 1:
<div id="BIO">
+
<div id="ABC">
  <div class="b1">
+
<div style="padding:5px; border:4px solid #000000; background-color:#b3dbce; font-size:300%; font-weight:400; color: #000000; width:100%;">
 
Storing Data
 
Storing Data
  </div>
+
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#b3dbce; font-size:30%; font-weight:200; color: #000000; ">
 
+
(Representation of data; common data formats; implementing a data model; JSON.)
  {{Vspace}}
+
</div>
 
 
<div class="keywords">
 
<b>Keywords:</b>&nbsp;
 
Representation of data; common data formats
 
 
</div>
 
</div>
  
{{Vspace}}
+
{{Smallvspace}}
 
 
 
 
__TOC__
 
 
 
{{Vspace}}
 
 
 
 
 
{{DEV}}
 
 
 
{{Vspace}}
 
  
  
 +
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce33; font-size:85%;">
 +
<div style="font-size:118%;">
 +
<b>Abstract:</b><br />
 +
<section begin=abstract />
 +
This unit discusses options for storing and organizing data in a variety of formats, and develops R code for a protein data model, based on JSON formatted source data.
 +
<section end=abstract />
 +
</div>
 +
<!-- ============================  -->
 +
<hr>
 +
<table>
 +
<tr>
 +
<td style="padding:10px;">
 +
<b>Objectives:</b><br />
 +
This unit will ...
 +
* Introduce principles of storing data in different formats and principles of supporting storage, maintenance, and retrieval with different tools;
 +
* Implement relational data models as lists of data frames in R;
 +
* Develop code for this purpose and teach how to work with such code;
 +
* Find the closest homologue of Mbp1 in MYSPE.
 +
</td>
 +
<td style="padding:10px;">
 +
<b>Outcomes:</b><br />
 +
After working through this unit you ...
 +
* can recommend alternatives for storing data, depending on the context and objectives;
 +
* can create, edit and validate JSON formatted data;
 +
* have practiced creating a relational database as a list of dataframes;
 +
* can query the database directly, and via cross-referencing tables;
 +
* have discovered the closest homologue to yeast Mbp1 in MYSPE;
 +
* have added key information about this protein to the database;
 +
</td>
 +
</tr>
 +
</table>
 +
<!-- ============================  -->
 +
<hr>
 +
<b>Deliverables:</b><br />
 +
<section begin=deliverables />
 +
<ul>
 +
<li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li>
 +
<li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li>
 +
<li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li>
 +
</li>'''Your protein database''' Write a database-generating script that loads a protein database from JSON files.</li>
 +
</ul>
 +
<section end=deliverables />
 +
<!-- ============================  -->
 +
<hr>
 +
<section begin=prerequisites />
 +
<b>Prerequisites:</b><br />
 +
This unit builds on material covered in the following prerequisite units:<br />
 +
*[[BIN-Abstractions|BIN-Abstractions (Abstractions for Bioinformatics)]]
 +
<section end=prerequisites />
 +
<!-- ============================  -->
 
</div>
 
</div>
<div id="ABC-unit-framework">
 
== Abstract ==
 
<!-- included from "../components/BIN-Storing_data.components.wtxt", section: "abstract" -->
 
...
 
 
{{Vspace}}
 
  
 +
{{Smallvspace}}
  
== This unit ... ==
 
=== Prerequisites ===
 
<!-- included from "../components/BIN-Storing_data.components.wtxt", section: "prerequisites" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "notes-prerequisites" -->
 
You need to complete the following units before beginning this one:
 
*[[BIN-Abstractions]]
 
  
{{Vspace}}
 
  
 +
{{Smallvspace}}
  
=== Objectives ===
 
<!-- included from "../components/BIN-Storing_data.components.wtxt", section: "objectives" -->
 
...
 
  
{{Vspace}}
+
__TOC__
 
 
 
 
=== Outcomes ===
 
<!-- included from "../components/BIN-Storing_data.components.wtxt", section: "outcomes" -->
 
...
 
 
 
{{Vspace}}
 
 
 
 
 
=== Deliverables ===
 
<!-- included from "../components/BIN-Storing_data.components.wtxt", section: "deliverables" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-time_management" -->
 
*<b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-journal" -->
 
*<b>Journal</b>: Document your progress in your [[FND-Journal|course journal]].
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-insights" -->
 
*<b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|insights! page]].
 
  
 
{{Vspace}}
 
{{Vspace}}
Line 70: Line 75:
  
 
=== Evaluation ===
 
=== Evaluation ===
<!-- included from "../components/BIN-Storing_data.components.wtxt", section: "evaluation" -->
+
This learning unit can be evaluated for a maximum of 5 marks. If you want to submit the tasks for this unit for credit:
<!-- included from "ABC-unit_components.wtxt", section: "eval-none" -->
+
<ol>
<b>Evaluation: NA</b><br />
+
<li>Create a new page on the student Wiki as a subpage of your User Page.</li>
:This unit is not evaluated for course marks.
+
<li>There are a number of tasks in which you are explicitly asked you to submit code or other text for credit. Put all of these submission on this one page.</li>
 +
<li>When you are done with everything, go to the [https://q.utoronto.ca/courses/180416/assignments Quercus '''Assignments''' page] and open the first Learning Unit that you have not submitted yet. Paste the URL of your Wiki page into the form, and click on '''Submit Assignment'''.</li>
 +
</ol>
  
{{Vspace}}
+
Your link can be submitted only once and not edited. But you may change your Wiki page at any time. However only the last version before the due date will be marked. All later edits will be silently ignored.
  
 +
{{Smallvspace}}
  
</div>
 
<div id="BIO">
 
 
== Contents ==
 
== Contents ==
<!-- included from "../components/BIN-Storing_data.components.wtxt", section: "contents" -->
 
  
 
{{Task|1=
 
{{Task|1=
Line 91: Line 96:
  
  
 +
Any software project requires modelling on many levels - data-flow models, logic models, user interaction models and more. But all of these ultimately rely on a '''data model''' that defines how the world is going to be represented in the computer for the project's purpose. The process of abstraction of data entities and defining their relationships can (and should) take up a major part of the project definition, often taking several iterations until you get it right. Whether your data can be completely described, consistently stored and efficiently retrieved is determined to a large part by your data model.
  
<!-- ToDo:
+
{{WP|Database|Databases}} can take many forms, from memories in your brain, to shoe-cartons under your bed, to software applications on your computer, or warehouse-sized data centres. Fundamentally, these all do the same thing: collect information and make it available for retrieval.
Change R database example to a list with dataframes, and functions to add, delete and update. Keep syntax close to SQL.
 
-->
 
  
{{#lst:Data modelling|data_storage}}
+
Let us consider collecting information on APSES-domain transcription factors in various fungi, with the goal of being able to compare the transcription factors. Let's specify this as follows:
 +
 
 +
<div class="emphasis-box">Store data on APSES-domain proteins so that we can
 +
* cross reference the source databases;
 +
* study if they have the same features (e.g. domains);
 +
* and compare the features across species.
 +
</div>
 +
 
 +
The underlying information can easily be retrieved for a protein from its RefSeq or UniProt entry.
  
==Store Data==
 
  
{{Vspace}}
+
===Text files===
  
===A ''System'' datamodel for the YFO Cell cycle===
+
A first attempt to organize the data might be simply to write it down as text:
  
{{Vspace}}
+
name: Mbp1
 +
refseq ID: NP_010227
 +
uniprot ID: P39678
 +
species: Saccharomyces cerevisiae
 +
taxonomy ID: 4392
 +
sequence:
 +
MSNQIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKR
 +
TRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLF
 +
DFTQTDGSASPPPAPKHHHASKVDRKKAIRSASTSAIMETKRNNKKAEEN
 +
QFQSSKILGNPTAAPRKRGRPVGSTRGSRRKLGVNLQRSQSDMGFPRPAI
 +
PNSSISTTQLPSIRSTMGPQSPTLGILEEERHDSRQQQPQQNNSAQFKEI
 +
DLEDGLSSDVEPSQQLQQVFNQNTGFVPQQQSSLIQTQQTESMATSVSSS
 +
PSLPTSPGDFADSNPFEERFPGGGTSPIISMIPRYPVTSRPQTSDINDKV
 +
NKYLSKLVDYFISNEMKSNKSLPQVLLHPPPHSAPYIDAPIDPELHTAFH
 +
WACSMGNLPIAEALYEAGTSIRSTNSQGQTPLMRSSLFHNSYTRRTFPRI
 +
FQLLHETVFDIDSQSQTVIHHIVKRKSTTPSAVYYLDVVLSKIKDFSPQY
 +
RIELLLNTQDKNGDTALHIASKNGDVVFFNTLVKMGALTTISNKEGLTAN
 +
EIMNQQYEQMMIQNGTNQHVNSSNTDLNIHVNTNNIETKNDVNSMVIMSP
 +
VSPSDYITYPSQIATNISRNIPNVVNSMKQMASIYNDLHEQHDNEIKSLQ
 +
KTLKSISKTKIQVSLKTLEVLKESSKDENGEAQTNDDFEILSRLQEQNTK
 +
KLRKRLIRYKRLIKQKLEYRQTVLLNKLIEDETQATTNNTVEKDNNTLER
 +
LELAQELTMLQLQRKNKLSSLVKKFEDNAKIHKYRRIIREGTEMNIEEVD
 +
SSLDVILQTLIANNNKNKGAEQIITISNANSHA
 +
length: 833
 +
Kila-N domain: 21-93
 +
Ankyrin domains: 369-455, 505-549
  
[[Image:SystemDB_Datamodel.png|frame|Entity-Relationship Diagram (ERD) for a data model that stores data for a systems project. Entity names are at the top of each box, attributes are listed below. If you think of the entity as a table, its attributes are the column names and each row stores the values for one particular instance. Semantically related entities are shaded in similar colours; this helps to make the design-principles visible, but one must be careful not to overdo the use of colour. As with all graphical elements in information design: "less is more". All relationships link to unique primary keys and are thus 1 to (0, n): i.e. as attributes, the relationship does not have to exist but there could be many, as the primary key, exactly one key must exist. The diagram was drawn as a "Google presentation" slide and you can [https://docs.google.com/presentation/d/1_nYWiwQc-9Z4anUAwOeVqWXgXIvM1Zl_AffMTM_No2Q/edit?usp=sharing view it on the Web] and make a copy for your own purposes.]]
+
  ...
  
Here is a first version of a systems data model, based on what we discussed in class:
 
* The '''<code>feature</code> table''' is at the centre. This was not intentional, but emerged from iterating the model through a number of revisions. It emphasizes that the main purpose of this model is to integrate and annotate data from various sources. ''Feature'' in the way we understand it here is an abstraction of quite disparate categories of information items. This includes domain annotations, system functions, literature references, and cross-references to other databases. The ''type'' attribute will require some thought: this attribute really needs a "controlled vocabulary" to ensure that the same category is described consistently with the same string ("PubMed"? "Lit."? "reference"?). There are a number of ways to achieve this, the best way<ref>Relational databases like MySQL, PostgresQL, and MariaDB offer the datatype "Enum" for this purpose but this is an inferior solution. Enum types need to be fixed at the time the schema is created, they can't store information about their semantics, i.e. how the keywords are defined and when they should be used, and they can't be used in more than one table, since they are metadata of one particular column.</ref> is to store these types in their own "reference" table '''<code>type</code>''' - and link to that table via a ''foreign key''.
 
* The '''<code>protein</code> table''' is at the centre. Its ''primary key'' is a unique integer. We store the NCBI RefSeq ID and the Uniprot ID in the table. We would not call these "foreign keys", since the information they reference is not in our schema but at the NCBI resp. EBI. For example, we can't guarantee that they are unique keys.
 
* The '''<code>taxonomy</code> table''' holds information about species. We use the NCBI taxonomy ID as its ''primary key''. The same key appears in the protein table as the ''foreign key'' that links the protein with its proper taxonomy information. This is an instance where the data model actually does not describe reality well. The problem is that particular proteins that we might retrieve from database searches will often be annotated to a specific ''strain'' of a ''species'' – and there is no easy way to reference strains to species. We'll see whether this turns out to be a problem in practice. But it may be that an additional table may be required that stores parent/child relationships of the taxonomic tree of life.
 
* The '''<code>protein_feature</code> table''' links a  protein with all the features that we annotate it with. ''start'' and ''end'' coordinates identify the region of the sequence we have annotated.
 
* The '''<code>...Annotation</code> tables''' link feature-information with annotated entities.
 
* Should the '''<code>system</code> table''' have its own taxonomy attribute? Or should the species in which the system is observed be inferred from the component protein's '''<code>taxonomy.ID</code>'''? What do you think? I decided not to add a taxonomy attribute to that table. How would you argue for or against this decision?
 
* The '''<code>component</code> table''' links proteins that collaborate together as a system. There is an implicit assumption in this model that only proteins are system components (and not e.g. RNA), and that components are atomic (i.e. we can't link to subsystems). How would you change the model to accommodate more realistic biological complexity?
 
  
{{Vspace}}
+
... and save it all in one large text file and whenever you need to look something up, you just open the file, look for e.g. the name of the protein and read what's there. Or - for a more structured approach, you could put this into several files in a folder.<ref>Your operating system can help you keep the files organized. The "file system" '''is''' a database.</ref> This is a perfectly valid approach and for some applications it might not be worth the effort to think more deeply about how to structure the data, and store it in a way that it is robust and scales easily to large datasets. Alas, small projects have a tendency to grow into large projects and if you work in this way, it's almost guaranteed that you will end up doing many things by hand that could easily be automated. Imagine asking questions like:
 +
* How many proteins do I have?
 +
* What's the sequence of the Kila-N domain?
 +
* What percentage of my proteins have an Ankyrin domain?
 +
* Or two ...?
  
===Implementing the Data Model in R===
+
Answering these questions "by hand" is possible, but tedious.
  
{{Vspace}}
+
===Spreadsheets===
 +
[[Image:DB_Excel-spreadsheet.jpg|frame|Data for three yeast APSES domain proteins in an Excel spreadsheet.]]
  
To actually implement the data model in '''R''' we will create the tables as data frames, and we will collect them in a list. We don't '''have''' to keep the tables in a list - we could also keep them as independent objects, but with a table named "protein" we should be worried of inadvertently overwriting the table. A list is neater, more flexible (we might want to have several of these), it reflects our intent about the model better, and doesn't require very much more typing. For now, to keep things simple, we will implement only two tables: '''<code>protein</code>''' and '''<code>taxonomy</code>'''. We'll add the rest when we actually need them.
+
Many serious researchers keep their project data in spreadsheets. Often they use Excel, or an alternative like the free [https://www.openoffice.org/product/calc.html OpenOffice Calc], or [https://www.google.ca/sheets/about/ Google Sheets], both of which are compatible with Excel and have some interesting advantages. Here, all your data is in one place, easy to edit. You can even do simple calculations - although you should '''never''' use Excel for statistics<ref>For real: Excel is miserable and often wrong on statistics, and it makes horrible, ugly plots. See [http://www.practicalstats.com/xlstats/excelstats.html here] and [http://www.burns-stat.com/documents/tutorials/spreadsheet-addiction/ here] why Excel problems are not merely cosmetic.</ref>. You could answer ''What percentage of my proteins have an Ankyrin domain?'' quite easily<ref>At the bottom of the window there is a menu that says "sum = ..." by default. This provides simple calculations on the selected range. Set the choice to "count", select all Ankyrin domain entries, and the count shows you how many cells actually have a value.</ref>.
  
{{Vspace}}
+
There are two major downsides to spreadsheets. For one, complex queries need programming. There is no way around this. You '''can''' program inside Excel with ''Visual Basic''. But you might as well export your data so you can work on it with a "real" programming language. The other thing is that Excel does not scale very well. Once you have more than a hundred proteins in your spreadsheet, you can see how finding anything can become tedious.
  
{{R-unit|BIN-Storing_data}}
+
However, just because Excel was built for business applications, and designed for use by office assistants, does not mean it is intrinsically unsuitable for our domain. It's important to be pragmatic, not dogmatic, when choosing tools: choose according to your real requirements. Sometimes "quick and dirty" is just fine, because quick.
  
 
{{Vspace}}
 
{{Vspace}}
  
As you see we can edit any component  of our data model directly by assigning new values to the element. But in general that's a really bad idea, since it is far too easy to bring the whole model into an inconsistent state. It's much better to write functions that ''get'' and ''set'' data elements, not only to keep the data consistent, but also because it is much easier, if the model ever changes, to simply edit the function code, rather than having to repeat every single data entry by hand.
+
===R===
  
What would an <code>setData</code> function have to look like? It should
+
'''R''' can keep complex data in data frames and lists. If we do data analysis with '''R''', we have to load the data first. We can use any of the <code>read.table()</code> functions for structured data, read lines of raw text with <code>readLines()</code>, or slurp in entire files with <code>scan()</code>. Convenient packages exist to parse structured data like XML or JSON and import it. But we could also keep the data in an '''R''' object in the first place that we can read from disk, analyze, modify, and write back. In this case, '''R''' becomes our database engine.
*create a new entry if the requested row of a table does not exist yet;
 
*update data if the protein exists;
 
*perform consistency checks (i.e. check that the data has the correct type);
 
*perform sanity checks (i.e. check that data values fall into the expected range);
 
*perform completeness checks (i.e. handle  incomplete data)
 
  
Let's start simple, and create a '''set-''' function to
+
<pre>
add new values to existing sequence data. Also, for clarity, we'll forgo many checks. The first thing we should do is to add the actual sequence.
+
# Sample construction of an R database table as a dataframe
  
We only entered a placeholder for the sequence field.
+
# Data for the Mbp1 protein
Sequences come in many different flavours when we copy them from a Webpage:
+
proteins <- data.frame(
there can be whitespace, carriage returns, numbers, they can be upper-case, lower-case
+
    name    = "Mbp1",
mixed-case ...
+
    refSeq  = "NP_010227",
 +
    uniProt  = "P39678",
 +
    species  = "Saccharomyces cerevisiae",
 +
    taxId    = "4392",
 +
    sequence = paste(
 +
                    "MSNQIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKR",
 +
                    "TRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLF",
 +
                    "DFTQTDGSASPPPAPKHHHASKVDRKKAIRSASTSAIMETKRNNKKAEEN",
 +
                    "QFQSSKILGNPTAAPRKRGRPVGSTRGSRRKLGVNLQRSQSDMGFPRPAI",
 +
                    "PNSSISTTQLPSIRSTMGPQSPTLGILEEERHDSRQQQPQQNNSAQFKEI",
 +
                    "DLEDGLSSDVEPSQQLQQVFNQNTGFVPQQQSSLIQTQQTESMATSVSSS",
 +
                    "PSLPTSPGDFADSNPFEERFPGGGTSPIISMIPRYPVTSRPQTSDINDKV",
 +
                    "NKYLSKLVDYFISNEMKSNKSLPQVLLHPPPHSAPYIDAPIDPELHTAFH",
 +
                    "WACSMGNLPIAEALYEAGTSIRSTNSQGQTPLMRSSLFHNSYTRRTFPRI",
 +
                    "FQLLHETVFDIDSQSQTVIHHIVKRKSTTPSAVYYLDVVLSKIKDFSPQY",
 +
                    "RIELLLNTQDKNGDTALHIASKNGDVVFFNTLVKMGALTTISNKEGLTAN",
 +
                    "EIMNQQYEQMMIQNGTNQHVNSSNTDLNIHVNTNNIETKNDVNSMVIMSP",
 +
                    "VSPSDYITYPSQIATNISRNIPNVVNSMKQMASIYNDLHEQHDNEIKSLQ",
 +
                    "KTLKSISKTKIQVSLKTLEVLKESSKDENGEAQTNDDFEILSRLQEQNTK",
 +
                    "KLRKRLIRYKRLIKQKLEYRQTVLLNKLIEDETQATTNNTVEKDNNTLER",
 +
                    "LELAQELTMLQLQRKNKLSSLVKKFEDNAKIHKYRRIIREGTEMNIEEVD",
 +
                    "SSLDVILQTLIANNNKNKGAEQIITISNANSHA",
 +
                    sep=""),
 +
    seqLen  = 833,
 +
    KilAN    = "21-93",
 +
    Ankyrin  = "369-455, 505-549")
  
What we want in our sequence data field is one string
+
# add data for the Swi4 protein
that contains the entire sequence, and nothing but
+
proteins <- rbind(proteins,
upper-case, amino-acid code letters.
+
                  data.frame(
 +
    name    = "Swi4",
 +
    refSeq  = "NP_011036",
 +
    uniProt  = "P25302",
 +
    species  = "Saccharomyces cerevisiae",
 +
    taxId    = "4392",
 +
    sequence = paste(
 +
                    "MPFDVLISNQKDNTNHQNITPISKSVLLAPHSNHPVIEIATYSETDVYEC",
 +
                    "YIRGFETKIVMRRTKDDWINITQVFKIAQFSKTKRTKILEKESNDMQHEK",
 +
                    "VQGGYGRFQGTWIPLDSAKFLVNKYEIIDPVVNSILTFQFDPNNPPPKRS",
 +
                    "KNSILRKTSPGTKITSPSSYNKTPRKKNSSSSTSATTTAANKKGKKNASI",
 +
                    "NQPNPSPLQNLVFQTPQQFQVNSSMNIMNNNDNHTTMNFNNDTRHNLINN",
 +
                    "ISNNSNQSTIIQQQKSIHENSFNNNYSATQKPLQFFPIPTNLQNKNVALN",
 +
                    "NPNNNDSNSYSHNIDNVINSSNNNNNGNNNNLIIVPDGPMQSQQQQQHHH",
 +
                    "EYLTNNFNHSMMDSITNGNSKKRRKKLNQSNEQQFYNQQEKIQRHFKLMK",
 +
                    "QPLLWQSFQNPNDHHNEYCDSNGSNNNNNTVASNGSSIEVFSSNENDNSM",
 +
                    "NMSSRSMTPFSAGNTSSQNKLENKMTDQEYKQTILTILSSERSSDVDQAL",
 +
                    "LATLYPAPKNFNINFEIDDQGHTPLHWATAMANIPLIKMLITLNANALQC",
 +
                    "NKLGFNCITKSIFYNNCYKENAFDEIISILKICLITPDVNGRLPFHYLIE",
 +
                    "LSVNKSKNPMIIKSYMDSIILSLGQQDYNLLKICLNYQDNIGNTPLHLSA",
 +
                    "LNLNFEVYNRLVYLGASTDILNLDNESPASIMNKFNTPAGGSNSRNNNTK",
 +
                    "ADRKLARNLPQKNYYQQQQQQQQPQNNVKIPKIIKTQHPDKEDSTADVNI",
 +
                    "AKTDSEVNESQYLHSNQPNSTNMNTIMEDLSNINSFVTSSVIKDIKSTPS",
 +
                    "KILENSPILYRRRSQSISDEKEKAKDNENQVEKKKDPLNSVKTAMPSLES",
 +
                    "PSSLLPIQMSPLGKYSKPLSQQINKLNTKVSSLQRIMGEEIKNLDNEVVE",
 +
                    "TESSISNNKKRLITIAHQIEDAFDSVSNKTPINSISDLQSRIKETSSKLN",
 +
                    "SEKQNFIQSLEKSQALKLATIVQDEESKVDMNTNSSSHPEKQEDEEPIPK",
 +
                    "STSETSSPKNTKADAKFSNTVQESYDVNETLRLATELTILQFKRRMTTLK",
 +
                    "ISEAKSKINSSVKLDKYRNLIGITIENIDSKLDDIEKDLRANA",
 +
                    sep=""),
 +
    seqLen  = 1093,
 +
    KilAN    = "56-122",
 +
    Ankyrin  = "516-662")
 +
    )
  
We'll need to look at how we work with strings in '''R''', how we identify and work with patterns in strings. This is a great time to introduce regular expressions.
+
# how many proteins?
 +
nrow(proteins)
  
{{Vspace}}
+
#what are their names?
 +
proteins[,"name"]
  
 +
# how many do not have an Ankyrin domain?
 +
sum(proteins[,"Ankyrin"] == "")
  
 +
# save it to file
 +
saveRDS(proteins, file="proteinData.rds")
  
===Updating the database===
+
# delete it from memory
 +
rm(proteins)
  
{{Vspace}}
+
# check...
 +
proteins  # ... yes, it's gone
  
{{task|1 =
+
# read it back in:
* study the code in the <code>Updating the database</code> section of the '''R''' script
+
proteins <- readRDS("proteinData.rds")
}}
 
  
{{Vspace}}
+
# did this work?
 +
sum(proteins[ , "seqLen"])  # 1926 amino acids
  
===Add "your" YFO Sequence===
 
  
{{Vspace}}
+
# [END]
 +
</pre>
  
{{task|1=
 
  
;Add the YFO Mbp1 protein data to the database:
+
&nbsp;
  
# Copy the '''code template''' to add a new protein and its taxonomy entry into the script file <code>myCode.R</code> that you created at the very beginning.
+
The third way to use '''R''' for data is to connect it to a "real" database:
# Add your protein to the database by editing a copy of the code template in your script file. Ask on the mailing list if you don't know how (but be specific) or if you don't know how to find particular information items.
+
*a relational database like {{WP|mySQL}}, {{WP|MariaDB}}, or {{WP|PostgreSQL}};
# Add the taxonomy entry to the taxonomy table, again simply modifying a copy of the code template in your own script file.
+
*an object/document database like {{WP|MongoDB};
 +
* or even a graph-database like {{WP|Neo4j}}.
  
}}
+
'''R''' "drivers" are available for all of these. However all of these require installing extra software on your computer: the actual database, which runs as an independent application. If you need a rock-solid database with guaranteed integrity, multi-user support, {{WP|ACID}} transactional gurantees, industry standard performance, and scalability to even '''very''' large datasets, don't think of rolling your own solution. One of the above is the way to go.
  
 
{{Vspace}}
 
{{Vspace}}
  
 +
===MySQL and friends===
  
 +
[[Image:DB_MySQL-Workbench.jpg|frame|A "Schema" for a ''table'' that stores data for APSES domain proteins. This is a screenshot of the free MySQL Workbench application.]]
 +
MySQL is a free, open '''relational database''' that powers some of the largest corporations as well as some of the smallest laboratories. It is based on a client-server model. The database engine runs as a daemon in the background and waits for connection attempts. When a connection is established, the '''server''' process establishes a communication session with the client. The client sends requests, and the server responds. One can do this interactively, by running the '''client''' program <tt>/usr/local/mysql/bin/mysql</tt> (on Unix systems). Or, when you are using a program such as '''R''', Python, Perl, etc. you use the appropriate method calls or functions&mdash;the driver&mdash;to establish the connection.
  
 +
These types of databases use their own language to describe actions: {{WP|SQL|'''SQL'''}} - which handles data definition, data manipulation, and data control.
 +
 +
Just for illustration, the Figure above shows a table for our APSES domain protein data, built as a table in the [https://www.mysql.com/products/workbench/ MySQL workbench] application and presented as an Entity Relationship Diagram (ERD). There is only one entity though - the protein "table". The application can generate the actual code that implements this model on a SQL compliant database:
  
&nbsp;
 
{{task|1=
 
  
;Create your own version of the protein database by adding all the genes from YFO that you have discovered with the PSI-BLAST search for the APSES domain. Save it.
+
<pre>
 +
CREATE TABLE IF NOT EXISTS `mydb`.`proteins` (
 +
  `name` VARCHAR(20) NULL,
 +
  `refSeq` VARCHAR(20) NOT NULL,
 +
  `uniProt` VARCHAR(20) NULL,
 +
  `species` VARCHAR(45) NOT NULL COMMENT ' ',
 +
  `taxId` VARCHAR(10) NULL,
 +
  `sequence` BLOB NULL,
 +
  `seqLen` INT NULL,
 +
  `KilA-N` VARCHAR(45) NULL,
 +
  `Ankyrin` VARCHAR(45) NULL,
 +
  PRIMARY KEY (`refSeq`))
 +
ENGINE = InnoDB
  
}}
+
</pre>
  
  
 +
This looks at least as complicated as putting the model into '''R''' in the first place. Why then would we do this, if we need to load it into '''R''' for analysis anyway? There are several important reasons.
  
{{Vspace}}
+
*Scalability: these systems are built to work with '''very''' large datasets and optimized for performance. In theory '''R''' has very good performance with large data objects, but not so when the data becomes larger than what the computer can keep in memory all at once.
 +
*Concurrency: when several users need to access the data potentially at the same time, you '''must''' use a "real" database system. Handling problems of concurrent access is what they are made for.
 +
*ACID compliance. {{WP|ACID}} describes four aspects that make a database robust, these are crucial for situations in which you have only partial control over your system or its input, and they would be quite laborious to implement for your hand built '''R''' data model:
 +
**'''A'''tomicity: Atomicity requires that each transaction is handled "indivisibly": it either succeeds fully, with all requested elements, or not at all.
 +
**'''C'''onsistency: Consistency requires that any transaction will bring the database from one valid state to another. In particular any data-validation rules have to be enforced.
 +
**'''I'''solation: Isolation ensures that any concurrent execution of transactions results in the exact same database state as if transactions would have been executed serially, one after the other.
 +
**'''D'''urability: The Durability requirement ensures that a committed transaction remains permanently committed, even in the event that the database crashes or later errors occur. You can think of this like an "autosave" function on every operation.
  
 +
All the database systems I have mentioned above are ACID compliant<ref>For a list of relational Database Management Systems, see {{WP|https://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems|here}}.</ref>.
  
== Further reading, links and resources ==
+
Incidentally - RStudio has inbuilt database support via the '''Connections''' tab of the top-right pane. Read more about database connections directly from RStudio [https://db.rstudio.com/rstudio/connections/ '''here'''].
<!-- {{#pmid: 19957275}} -->
 
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 
  
 
{{Vspace}}
 
{{Vspace}}
  
 
+
==Store Data==
== Notes ==
 
<!-- included from "../components/BIN-Storing_data.components.wtxt", section: "notes" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "notes" -->
 
<references />
 
  
 
{{Vspace}}
 
{{Vspace}}
  
 +
===A protein datamodel===
  
</div>
+
{{Vspace}}
<div id="ABC-unit-framework">
+
<table>
== Self-evaluation ==
+
<tr>
<!-- included from "../components/BIN-Storing_data.components.wtxt", section: "self-evaluation" -->
+
<td>
<!--
 
=== Question 1===
 
  
Question ...
+
[[File:ProteinDBschema.svg|width=500px|link=https://docs.google.com/presentation/d/13vWaVcFpWEOGeSNhwmqugj2qTQuH1eZROgxWdHGEMr0]]
  
<div class="toccolours mw-collapsible mw-collapsed" style="width:800px">
+
</td>
Answer ...
+
</tr>
<div class="mw-collapsible-content">
+
<tr>
Answer ...
+
<td><small>
 +
Entity-Relationship Diagram (ERD) for a protein data model that includes protein, taxonomy and feature annotations. Entity names are at the top of each box, attributes are listed below. If you think of an entity as an R dataframe, or a spreadsheet table, the attributes are the column names and each specific instance of an entity fills one row (or "record").All relationships link to unique primary keys and are thus 1 to (0, n). The diagram was drawn as a "Google slide" and you can [https://docs.google.com/presentation/d/13vWaVcFpWEOGeSNhwmqugj2qTQuH1eZROgxWdHGEMr0 view it on the Web] and, if you have a Google account, you can make a copy to use for your own purposes.
 +
</small></td>
 +
</tr>
 +
</table>
  
</div>
+
{{Vspace}}
  </div>
 
  
  {{Vspace}}
+
===Implementing the Data Model in R===
 
 
-->
 
  
 
{{Vspace}}
 
{{Vspace}}
  
 
+
To actually implement the data model in '''R''' we will create  tables as data frames, and we will collect them in a list. We don't '''have''' to keep the tables in a list - we could also keep them as independent objects, but a list is neater, more flexible (we might want to have several of these), it reflects our intent about the model better, and doesn't require very much more typing.
  
 
{{Vspace}}
 
{{Vspace}}
  
 
+
{{ABC-unit|BIN-Storing_data.R}}
<!-- included from "ABC-unit_components.wtxt", section: "ABC-unit_ask" -->
 
 
 
----
 
  
 
{{Vspace}}
 
{{Vspace}}
  
<b>If in doubt, ask!</b> If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
 
  
----
+
== Further reading, links and resources ==
 +
*{{WP|Database normalization}}
 +
<div class="reference-box">[https://www.lucidchart.com/pages/database-diagram/database-models Overview of data model types] (Lucidchart)</div>
 +
== Notes ==
 +
<references />
  
 
{{Vspace}}
 
{{Vspace}}
 +
  
 
<div class="about">
 
<div class="about">
Line 263: Line 375:
 
:2017-08-05
 
:2017-08-05
 
<b>Modified:</b><br />
 
<b>Modified:</b><br />
:2017-08-05
+
:2020-10-07
 
<b>Version:</b><br />
 
<b>Version:</b><br />
:0.1
+
:1.2
 
<b>Version history:</b><br />
 
<b>Version history:</b><br />
*0.1 First stub
+
*1.2 Edit policy update
 +
*1.1 Update to new marking scheme
 +
*1.0.1 Add link to data model types overview
 +
*1.0  First Live version
 +
*0.1   First stub
 
</div>
 
</div>
[[Category:ABC-units]]
 
<!-- included from "ABC-unit_components.wtxt", section: "ABC-unit_footer" -->
 
  
 
{{CC-BY}}
 
{{CC-BY}}
  
 +
[[Category:ABC-units]]
 +
{{EVAL}}
 +
{{LIVE}}
 +
{{EVAL}}
 
</div>
 
</div>
 
<!-- [END] -->
 
<!-- [END] -->

Latest revision as of 05:01, 10 October 2020

Storing Data

(Representation of data; common data formats; implementing a data model; JSON.)


 


Abstract:

This unit discusses options for storing and organizing data in a variety of formats, and develops R code for a protein data model, based on JSON formatted source data.


Objectives:
This unit will ...

  • Introduce principles of storing data in different formats and principles of supporting storage, maintenance, and retrieval with different tools;
  • Implement relational data models as lists of data frames in R;
  • Develop code for this purpose and teach how to work with such code;
  • Find the closest homologue of Mbp1 in MYSPE.

Outcomes:
After working through this unit you ...

  • can recommend alternatives for storing data, depending on the context and objectives;
  • can create, edit and validate JSON formatted data;
  • have practiced creating a relational database as a list of dataframes;
  • can query the database directly, and via cross-referencing tables;
  • have discovered the closest homologue to yeast Mbp1 in MYSPE;
  • have added key information about this protein to the database;

Deliverables:

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.
  • Your protein database Write a database-generating script that loads a protein database from JSON files.

Prerequisites:
This unit builds on material covered in the following prerequisite units:


 



 



 


Evaluation

This learning unit can be evaluated for a maximum of 5 marks. If you want to submit the tasks for this unit for credit:

  1. Create a new page on the student Wiki as a subpage of your User Page.
  2. There are a number of tasks in which you are explicitly asked you to submit code or other text for credit. Put all of these submission on this one page.
  3. When you are done with everything, go to the Quercus Assignments page and open the first Learning Unit that you have not submitted yet. Paste the URL of your Wiki page into the form, and click on Submit Assignment.

Your link can be submitted only once and not edited. But you may change your Wiki page at any time. However only the last version before the due date will be marked. All later edits will be silently ignored.


 

Contents

Task:


 


Any software project requires modelling on many levels - data-flow models, logic models, user interaction models and more. But all of these ultimately rely on a data model that defines how the world is going to be represented in the computer for the project's purpose. The process of abstraction of data entities and defining their relationships can (and should) take up a major part of the project definition, often taking several iterations until you get it right. Whether your data can be completely described, consistently stored and efficiently retrieved is determined to a large part by your data model.

Databases can take many forms, from memories in your brain, to shoe-cartons under your bed, to software applications on your computer, or warehouse-sized data centres. Fundamentally, these all do the same thing: collect information and make it available for retrieval.

Let us consider collecting information on APSES-domain transcription factors in various fungi, with the goal of being able to compare the transcription factors. Let's specify this as follows:

Store data on APSES-domain proteins so that we can
  • cross reference the source databases;
  • study if they have the same features (e.g. domains);
  • and compare the features across species.

The underlying information can easily be retrieved for a protein from its RefSeq or UniProt entry.


Text files

A first attempt to organize the data might be simply to write it down as text:

name: Mbp1
refseq ID: NP_010227
uniprot ID: P39678
species: Saccharomyces cerevisiae
taxonomy ID: 4392
sequence:
MSNQIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKR
TRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLF
DFTQTDGSASPPPAPKHHHASKVDRKKAIRSASTSAIMETKRNNKKAEEN
QFQSSKILGNPTAAPRKRGRPVGSTRGSRRKLGVNLQRSQSDMGFPRPAI
PNSSISTTQLPSIRSTMGPQSPTLGILEEERHDSRQQQPQQNNSAQFKEI
DLEDGLSSDVEPSQQLQQVFNQNTGFVPQQQSSLIQTQQTESMATSVSSS
PSLPTSPGDFADSNPFEERFPGGGTSPIISMIPRYPVTSRPQTSDINDKV
NKYLSKLVDYFISNEMKSNKSLPQVLLHPPPHSAPYIDAPIDPELHTAFH
WACSMGNLPIAEALYEAGTSIRSTNSQGQTPLMRSSLFHNSYTRRTFPRI
FQLLHETVFDIDSQSQTVIHHIVKRKSTTPSAVYYLDVVLSKIKDFSPQY
RIELLLNTQDKNGDTALHIASKNGDVVFFNTLVKMGALTTISNKEGLTAN
EIMNQQYEQMMIQNGTNQHVNSSNTDLNIHVNTNNIETKNDVNSMVIMSP
VSPSDYITYPSQIATNISRNIPNVVNSMKQMASIYNDLHEQHDNEIKSLQ
KTLKSISKTKIQVSLKTLEVLKESSKDENGEAQTNDDFEILSRLQEQNTK
KLRKRLIRYKRLIKQKLEYRQTVLLNKLIEDETQATTNNTVEKDNNTLER
LELAQELTMLQLQRKNKLSSLVKKFEDNAKIHKYRRIIREGTEMNIEEVD
SSLDVILQTLIANNNKNKGAEQIITISNANSHA
length: 833
Kila-N domain: 21-93
Ankyrin domains: 369-455, 505-549
...


... and save it all in one large text file and whenever you need to look something up, you just open the file, look for e.g. the name of the protein and read what's there. Or - for a more structured approach, you could put this into several files in a folder.[1] This is a perfectly valid approach and for some applications it might not be worth the effort to think more deeply about how to structure the data, and store it in a way that it is robust and scales easily to large datasets. Alas, small projects have a tendency to grow into large projects and if you work in this way, it's almost guaranteed that you will end up doing many things by hand that could easily be automated. Imagine asking questions like:

  • How many proteins do I have?
  • What's the sequence of the Kila-N domain?
  • What percentage of my proteins have an Ankyrin domain?
  • Or two ...?

Answering these questions "by hand" is possible, but tedious.

Spreadsheets

Data for three yeast APSES domain proteins in an Excel spreadsheet.

Many serious researchers keep their project data in spreadsheets. Often they use Excel, or an alternative like the free OpenOffice Calc, or Google Sheets, both of which are compatible with Excel and have some interesting advantages. Here, all your data is in one place, easy to edit. You can even do simple calculations - although you should never use Excel for statistics[2]. You could answer What percentage of my proteins have an Ankyrin domain? quite easily[3].

There are two major downsides to spreadsheets. For one, complex queries need programming. There is no way around this. You can program inside Excel with Visual Basic. But you might as well export your data so you can work on it with a "real" programming language. The other thing is that Excel does not scale very well. Once you have more than a hundred proteins in your spreadsheet, you can see how finding anything can become tedious.

However, just because Excel was built for business applications, and designed for use by office assistants, does not mean it is intrinsically unsuitable for our domain. It's important to be pragmatic, not dogmatic, when choosing tools: choose according to your real requirements. Sometimes "quick and dirty" is just fine, because quick.


 

R

R can keep complex data in data frames and lists. If we do data analysis with R, we have to load the data first. We can use any of the read.table() functions for structured data, read lines of raw text with readLines(), or slurp in entire files with scan(). Convenient packages exist to parse structured data like XML or JSON and import it. But we could also keep the data in an R object in the first place that we can read from disk, analyze, modify, and write back. In this case, R becomes our database engine.

# Sample construction of an R database table as a dataframe

# Data for the Mbp1 protein
proteins <- data.frame(
    name     = "Mbp1",
    refSeq   = "NP_010227",
    uniProt  = "P39678",
    species  = "Saccharomyces cerevisiae",
    taxId    = "4392",
    sequence = paste(
                    "MSNQIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKR",
                    "TRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLF",
                    "DFTQTDGSASPPPAPKHHHASKVDRKKAIRSASTSAIMETKRNNKKAEEN",
                    "QFQSSKILGNPTAAPRKRGRPVGSTRGSRRKLGVNLQRSQSDMGFPRPAI",
                    "PNSSISTTQLPSIRSTMGPQSPTLGILEEERHDSRQQQPQQNNSAQFKEI",
                    "DLEDGLSSDVEPSQQLQQVFNQNTGFVPQQQSSLIQTQQTESMATSVSSS",
                    "PSLPTSPGDFADSNPFEERFPGGGTSPIISMIPRYPVTSRPQTSDINDKV",
                    "NKYLSKLVDYFISNEMKSNKSLPQVLLHPPPHSAPYIDAPIDPELHTAFH",
                    "WACSMGNLPIAEALYEAGTSIRSTNSQGQTPLMRSSLFHNSYTRRTFPRI",
                    "FQLLHETVFDIDSQSQTVIHHIVKRKSTTPSAVYYLDVVLSKIKDFSPQY",
                    "RIELLLNTQDKNGDTALHIASKNGDVVFFNTLVKMGALTTISNKEGLTAN",
                    "EIMNQQYEQMMIQNGTNQHVNSSNTDLNIHVNTNNIETKNDVNSMVIMSP",
                    "VSPSDYITYPSQIATNISRNIPNVVNSMKQMASIYNDLHEQHDNEIKSLQ",
                    "KTLKSISKTKIQVSLKTLEVLKESSKDENGEAQTNDDFEILSRLQEQNTK",
                    "KLRKRLIRYKRLIKQKLEYRQTVLLNKLIEDETQATTNNTVEKDNNTLER",
                    "LELAQELTMLQLQRKNKLSSLVKKFEDNAKIHKYRRIIREGTEMNIEEVD",
                    "SSLDVILQTLIANNNKNKGAEQIITISNANSHA",
                    sep=""),
    seqLen   = 833,
    KilAN    = "21-93",
    Ankyrin  = "369-455, 505-549")

# add data for the Swi4 protein
proteins <- rbind(proteins,
                  data.frame(
    name     = "Swi4",
    refSeq   = "NP_011036",
    uniProt  = "P25302",
    species  = "Saccharomyces cerevisiae",
    taxId    = "4392",
    sequence = paste(
                    "MPFDVLISNQKDNTNHQNITPISKSVLLAPHSNHPVIEIATYSETDVYEC",
                    "YIRGFETKIVMRRTKDDWINITQVFKIAQFSKTKRTKILEKESNDMQHEK",
                    "VQGGYGRFQGTWIPLDSAKFLVNKYEIIDPVVNSILTFQFDPNNPPPKRS",
                    "KNSILRKTSPGTKITSPSSYNKTPRKKNSSSSTSATTTAANKKGKKNASI",
                    "NQPNPSPLQNLVFQTPQQFQVNSSMNIMNNNDNHTTMNFNNDTRHNLINN",
                    "ISNNSNQSTIIQQQKSIHENSFNNNYSATQKPLQFFPIPTNLQNKNVALN",
                    "NPNNNDSNSYSHNIDNVINSSNNNNNGNNNNLIIVPDGPMQSQQQQQHHH",
                    "EYLTNNFNHSMMDSITNGNSKKRRKKLNQSNEQQFYNQQEKIQRHFKLMK",
                    "QPLLWQSFQNPNDHHNEYCDSNGSNNNNNTVASNGSSIEVFSSNENDNSM",
                    "NMSSRSMTPFSAGNTSSQNKLENKMTDQEYKQTILTILSSERSSDVDQAL",
                    "LATLYPAPKNFNINFEIDDQGHTPLHWATAMANIPLIKMLITLNANALQC",
                    "NKLGFNCITKSIFYNNCYKENAFDEIISILKICLITPDVNGRLPFHYLIE",
                    "LSVNKSKNPMIIKSYMDSIILSLGQQDYNLLKICLNYQDNIGNTPLHLSA",
                    "LNLNFEVYNRLVYLGASTDILNLDNESPASIMNKFNTPAGGSNSRNNNTK",
                    "ADRKLARNLPQKNYYQQQQQQQQPQNNVKIPKIIKTQHPDKEDSTADVNI",
                    "AKTDSEVNESQYLHSNQPNSTNMNTIMEDLSNINSFVTSSVIKDIKSTPS",
                    "KILENSPILYRRRSQSISDEKEKAKDNENQVEKKKDPLNSVKTAMPSLES",
                    "PSSLLPIQMSPLGKYSKPLSQQINKLNTKVSSLQRIMGEEIKNLDNEVVE",
                    "TESSISNNKKRLITIAHQIEDAFDSVSNKTPINSISDLQSRIKETSSKLN",
                    "SEKQNFIQSLEKSQALKLATIVQDEESKVDMNTNSSSHPEKQEDEEPIPK",
                    "STSETSSPKNTKADAKFSNTVQESYDVNETLRLATELTILQFKRRMTTLK",
                    "ISEAKSKINSSVKLDKYRNLIGITIENIDSKLDDIEKDLRANA",
                    sep=""),
    seqLen   = 1093,
    KilAN    = "56-122",
    Ankyrin  = "516-662")
    )

# how many proteins?
nrow(proteins)

#what are their names?
proteins[,"name"]

# how many do not have an Ankyrin domain?
sum(proteins[,"Ankyrin"] == "")

# save it to file
saveRDS(proteins, file="proteinData.rds")

# delete it from memory
rm(proteins)

# check...
proteins  # ... yes, it's gone

# read it back in:
proteins <- readRDS("proteinData.rds")

# did this work?
sum(proteins[ , "seqLen"])   # 1926 amino acids


# [END]


 

The third way to use R for data is to connect it to a "real" database:

  • a relational database like mySQL, MariaDB, or PostgreSQL;
  • an object/document database like {{WP|MongoDB};
  • or even a graph-database like Neo4j.

R "drivers" are available for all of these. However all of these require installing extra software on your computer: the actual database, which runs as an independent application. If you need a rock-solid database with guaranteed integrity, multi-user support, ACID transactional gurantees, industry standard performance, and scalability to even very large datasets, don't think of rolling your own solution. One of the above is the way to go.


 

MySQL and friends

A "Schema" for a table that stores data for APSES domain proteins. This is a screenshot of the free MySQL Workbench application.

MySQL is a free, open relational database that powers some of the largest corporations as well as some of the smallest laboratories. It is based on a client-server model. The database engine runs as a daemon in the background and waits for connection attempts. When a connection is established, the server process establishes a communication session with the client. The client sends requests, and the server responds. One can do this interactively, by running the client program /usr/local/mysql/bin/mysql (on Unix systems). Or, when you are using a program such as R, Python, Perl, etc. you use the appropriate method calls or functions—the driver—to establish the connection.

These types of databases use their own language to describe actions: SQL - which handles data definition, data manipulation, and data control.

Just for illustration, the Figure above shows a table for our APSES domain protein data, built as a table in the MySQL workbench application and presented as an Entity Relationship Diagram (ERD). There is only one entity though - the protein "table". The application can generate the actual code that implements this model on a SQL compliant database:


CREATE TABLE IF NOT EXISTS `mydb`.`proteins` (
  `name` VARCHAR(20) NULL,
  `refSeq` VARCHAR(20) NOT NULL,
  `uniProt` VARCHAR(20) NULL,
  `species` VARCHAR(45) NOT NULL COMMENT '	',
  `taxId` VARCHAR(10) NULL,
  `sequence` BLOB NULL,
  `seqLen` INT NULL,
  `KilA-N` VARCHAR(45) NULL,
  `Ankyrin` VARCHAR(45) NULL,
  PRIMARY KEY (`refSeq`))
ENGINE = InnoDB


This looks at least as complicated as putting the model into R in the first place. Why then would we do this, if we need to load it into R for analysis anyway? There are several important reasons.

  • Scalability: these systems are built to work with very large datasets and optimized for performance. In theory R has very good performance with large data objects, but not so when the data becomes larger than what the computer can keep in memory all at once.
  • Concurrency: when several users need to access the data potentially at the same time, you must use a "real" database system. Handling problems of concurrent access is what they are made for.
  • ACID compliance. ACID describes four aspects that make a database robust, these are crucial for situations in which you have only partial control over your system or its input, and they would be quite laborious to implement for your hand built R data model:
    • Atomicity: Atomicity requires that each transaction is handled "indivisibly": it either succeeds fully, with all requested elements, or not at all.
    • Consistency: Consistency requires that any transaction will bring the database from one valid state to another. In particular any data-validation rules have to be enforced.
    • Isolation: Isolation ensures that any concurrent execution of transactions results in the exact same database state as if transactions would have been executed serially, one after the other.
    • Durability: The Durability requirement ensures that a committed transaction remains permanently committed, even in the event that the database crashes or later errors occur. You can think of this like an "autosave" function on every operation.

All the database systems I have mentioned above are ACID compliant[4].

Incidentally - RStudio has inbuilt database support via the Connections tab of the top-right pane. Read more about database connections directly from RStudio here.


 

Store Data

 

A protein datamodel

 

width=500px

Entity-Relationship Diagram (ERD) for a protein data model that includes protein, taxonomy and feature annotations. Entity names are at the top of each box, attributes are listed below. If you think of an entity as an R dataframe, or a spreadsheet table, the attributes are the column names and each specific instance of an entity fills one row (or "record").All relationships link to unique primary keys and are thus 1 to (0, n). The diagram was drawn as a "Google slide" and you can view it on the Web and, if you have a Google account, you can make a copy to use for your own purposes.


 

Implementing the Data Model in R

 

To actually implement the data model in R we will create tables as data frames, and we will collect them in a list. We don't have to keep the tables in a list - we could also keep them as independent objects, but a list is neater, more flexible (we might want to have several of these), it reflects our intent about the model better, and doesn't require very much more typing.


 

Task:

 
  • Open RStudio and load the ABC-units R project. If you have loaded it before, choose FileRecent projectsABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
  • Choose ToolsVersion ControlPull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
  • Type init() if requested.
  • Open the file BIN-Storing_data.R and follow the instructions.


 

Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.


 


 


Further reading, links and resources

Notes

  1. Your operating system can help you keep the files organized. The "file system" is a database.
  2. For real: Excel is miserable and often wrong on statistics, and it makes horrible, ugly plots. See here and here why Excel problems are not merely cosmetic.
  3. At the bottom of the window there is a menu that says "sum = ..." by default. This provides simple calculations on the selected range. Set the choice to "count", select all Ankyrin domain entries, and the count shows you how many cells actually have a value.
  4. For a list of relational Database Management Systems, see here.


 


About ...
 
Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2020-10-07

Version:

1.2

Version history:

  • 1.2 Edit policy update
  • 1.1 Update to new marking scheme
  • 1.0.1 Add link to data model types overview
  • 1.0 First Live version
  • 0.1 First stub

CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.