Difference between revisions of "Relational database example 1"

From "A B C"
Jump to navigation Jump to search
m
Line 34: Line 34:
  
 
To prepare -
 
To prepare -
* read the section on [[Database_introduction]] and make sure you understand the concepts of an ERD (Entity relationship diagram) and a database schema.
+
* read the section on [[Data modelling]] and make sure you understand the concepts of an ERD (Entity relationship diagram) and a database schema.
 
* read the section on [[MySQL_installation_notes]], install MySQL on your computer and start the <tt>mysqld</tt> database server.  
 
* read the section on [[MySQL_installation_notes]], install MySQL on your computer and start the <tt>mysqld</tt> database server.  
  
Line 128: Line 128:
 
or
 
or
 
  SELECT * FROM inx WHERE (int_A LIKE 'mal%' OR int_B LIKE 'mal%');
 
  SELECT * FROM inx WHERE (int_A LIKE 'mal%' OR int_B LIKE 'mal%');
 
  
 
== Further reading ==
 
== Further reading ==

Revision as of 12:46, 27 September 2015

Relational database example 1: task and model


The contents of this page has recently been imported from an older version of this Wiki. This page may contain outdated information, information that is irrelevant for this Wiki, information that needs to be differently structured, outdated syntax, and/or broken links. Use with caution!


Summary ...


Related Pages


 

Introductory reading



A worked database example

In this example, we will create a datamodel for a protein-protein interaction network, create the database as a MySQL database, load data into the database and perform interactive queries and updates.

To prepare -

  • read the section on Data modelling and make sure you understand the concepts of an ERD (Entity relationship diagram) and a database schema.
  • read the section on MySQL_installation_notes, install MySQL on your computer and start the mysqld database server.


Task 1: the data

  • Download the results to your computer as a tab-delimited file, name it mpidbRaw.txt.
  • The file is a tab-delimited text file and we would like to extract only two columns from it:
    • The gene-name part of "Alias(es) interactor A"
    • The gene-name part of "Alias(es) interactor B"
    • The number of different "Interaction detection method(s)" with which this interaction has been detetcted, as a stand-in for a confidence score
  • Without further explanation, here is a Perl one-line script to extract the interactions from the raw data. Copy it and enter it on the commandline.
    • perl -e '<STDIN>;while(<STDIN>){s/uniprotkb://g;split("\t");if($_[4]&&$_[5]){print"$_[4]\t$_[5]\t",scalar(split(":",$_[13])),"\n"}}' < mpidbRaw.txt > EcoIntx.txt
  • Without further explanation, here is a Perl one-line script to extract the gene name aliases from the interactions. Copy it and enter it on the commandline.
    • perl -e 'while(<STDIN>){split("\t");$g{$_[0]}++;$g{$_[1]}++;}foreach(keys(%g)){print"$_\n"};' < EcoIntx.txt > EcoGenes.txt
  • Check your results - they should look something like this:
shell> head EcoIntx.txt
glnL    glnL    1
kdpD    kdpD    1
zraS    zraR    2
creC    creB    2
torR    torS    2
yedW    yedV    2
hnr     arcB    1
cheY    arcB    1
basS    basR    2
rstB    rstA    2
shell> head EcoGenes.txt
cvpA
recX
rfaS
dapE
emrY
ycjZ
papK2
ccmD
yhgF
murE
shell>

Task 2: the schema

  • draw up an ERD for an interaction database based on this data. Your datamodel should be able to store information on each gene and on each interaction. This requires two separate tables.
  • define a schema for your datamodel. Remember to define a primary key - don't rely on either gene names or interactions to be unique.


Task 3: create the database

  • Refer to the explanations and examples in the Database_introduction section and run mysql on the commandline (as root) to CREATE a database called ecoint and GRANT all privileges on that database to a user named ecodev; also GRANT FILE privileges to that user.

Tip: Do not name your interaction table "int". This is a reserved word that would need to be quoted to be used in this way. Otherwise it will result in an error. Better: use a different table name.


Task 4: create the tables

  • run mysql on the commandline as user ecodevand USE the ecoint database. Use the CREATE TABLE command to create the tables you have defined in your schema. Use the DESCRIBE command to check that the tables exist and have the intended attributes. For example, one output could look like this:
mysql> describe gene;
+-----------+-------------+------+-----+---------+----------------+
| Field     | Type        | Null | Key | Default | Extra          |
+-----------+-------------+------+-----+---------+----------------+
| Gene_ID   | int(11)     | NO   | PRI | NULL    | auto_increment | 
| Gene_Name | varchar(10) | NO   |     | NULL    |                | 
+-----------+-------------+------+-----+---------+----------------+
2 rows in set (0.01 sec)



Task 5: load the data

  • Use the LOAD DATA INFILE filename INTO table (column_1, column_2, ...); syntax to load the data you have prepared into the database. Remember to enter the full path for your file and to quote the filename. Specify only the datacolumns you need - the identifiers (with autoincrement) will be automatically set. If you get a privilege error at that point, you probably forgot to give FILE privileges to your account.


On some Linux distros the default file permissions for mysql are very strict and set globally. If your data file is in a directory mysql doesn't like (for ubuntu8 it seems to be anything in /home), the load data operation might return something like "ERROR 13 (HY000): Can't get stat of /path/file..." even if you are logged into mysql as root or a user with read/write access to the file. It probably involves this bug.

An easier work around is to save your data somewhere in /tmp which circumvents error 13.

Task 6: try some SELECT statements

  • Use SELECT statements to verify that your data has been loaded. For example you could try the following (or similar, depending on the column/table names you have actually used):
SELECT * FROM gene WHERE Gene_Name='fimC';

or

SELECT * FROM inx WHERE weight > 6;

or

SELECT * FROM inx WHERE (int_A LIKE 'mal%' OR int_B LIKE 'mal%');

Further reading


   

Further reading and resources