Difference between revisions of "BIO systems project"

From "A B C"
Jump to navigation Jump to search
m
Line 7: Line 7:
 
 
 
 
  
This course gives you a broad overview of bioinformatics principles, but you should also strive to explore one aspect of the field more deeply.
+
This course gives you a broad overview of bioinformatics principles, but you should also strive to apply those principles towards a biological question.
  
'''For your term project I would like you to identify a defined biological "system" - a set of genes that collaborate towards a shared purpose.''' We start by looking  at biological processes, represented in the '''Gene Ontology''' (GO). From there we can find related processes, functions and cellular components.  The problem and reason why we need human intuition to work out a systems definition based on this kind of information is that there are more aspects to a system than just the actual function: genes that are responsible for substrate import, biosynthesis of cofactors, signalling, regulation, constructing scaffolds ''etc.'' may also be part of the system. On the other hand some genes participate in a central role in making the process possible, but they provide this service to many other systems as well and are actually parts of a distinct but collaborating system. Membrane transporters might be an obvious example.  
+
'''For your term project I would like you to define a biological "system" - a set of genes that collaborate towards a shared purpose.''' We start from a biological process, represented in the '''Gene Ontology''' (GO). From there we can use methods of function-annotation to identify related processes, functions and cellular components and the genes that are associated with them.  The goal of the project is, once such a list of genes is defined, to define the "system" in which those genes collaborate, and to sketch its architecture. This requires both "bottom up" procedures of gene discovery, and "top down" reasoning about ancillary functions such as substrate import, biosynthesis of cofactors, signalling, regulation, constructing scaffolds ''etc.'' that may not (yet) be represented in the list of genes, and to clearly identify, concepts such as ''purpose''; ''boundaries'' (i.e. which genes from the list are actually part of the system, and which ones are associated with the process, but should be considered to be outside of its boundaries, in a supporting role, a shared role, or simply part of distinct but collaborating system. Membrane transporters might be an obvious example.); ''interfaces'' and, the system's ''input'' and ''output''.
  
It is your task to manage this from the perspective of a biological expert and try to define inclusion/exclusion criteria as best as you can. While your "list of genes" is going to be interesting, compiling such lists can be automated. Thus the most valuable outcome of your project how you will address the task of defining your '''system boundaries'''.
+
It is your task to manage this from the perspective of a biological expert and try to define inclusion/exclusion criteria as best as you can. While your "list of genes" is going to be interesting, compiling such lists can be automated. Thus the most valuable outcome of your project how you will address the task of defining the conceptual aspects of the system and attempting to organize this into an architectural sketch.
  
 
In practice you should  
 
In practice you should  
* define the biological process you are interested in;
+
* choose a biological process you are interested in (I have provided a candidate list);
* collect all contributing genes as best you can, using a broad spectrum of literature comments and bioinformatics tools that we may have or have not covered in the course;
+
* collect all contributing genes<ref>I speak of ''genes'' here in a very informal sense, the system components may include genes, their encoded proteins, structural and regulatory RNA, metabolites, and even environmental signals.</ref> as best you can, using bioinformatics tools and literature annotations;
* develop unambiguous criteria for including or not including such genes in your system;
+
* develop unambiguous criteria for including or not including such genes in your system and annotating them;
* provide an annotated list of included genes, and ones that you have excluded; and
+
* list the conceptual roles in your system;
* carefully document your efforts and results: the datasources, what procedures have been applied, how the results been accessed, validated and interpreted...
+
* associate whatever genes you can with those roles and identify genes you were not able to associate with roles, and roles for which the associated genes are unknown;
 +
* carefully document your efforts and results: the datasources and literature, what procedures have been applied, how the results been accessed, validated and interpreted...
  
Ideally, your process would be defined at a level where the system that realizes it is comprised of some 20, 30 genes or so, not much more, to keep things manageable.
+
Ideally, your system would be defined at a level where the system that realizes it is comprised of some 20, components or so, not more, to keep things manageable.
 
   
 
   
  
Line 28: Line 29:
  
  
===Open topic===
+
===The project steps===
The function you choose is open. I have posted [[BIO_project_GO-term_table|a list of suggestions]]. However, you should ensure you don't choose the same process as someone else in the class.
 
  
  
===First stage: Choosing a suitable process (5 marks max.)===
 
  
To define a system, we will start from a biological process in the '''GO''' biological process ontology. I have excerpted a table of processes to get you started, explained the procedure in detail and worked it out in one example. You can find all of this here.
+
===First stage: process and genes(11 marks max.)===
  
*[[BIO_project_GO-term_table| '''Table of GO terms''' - choosing a process to define a system]]
+
To define a system, we will start from a biological process in the '''GO''' biological process ontology. I have excerpted a table of processes to get you started, and explained the procedure in detail. You can find the documentation and the table through the links below.
  
Note that you are not constrained to start from a process in that table. If you are determined to work on a different human system, you are welcome.
+
*[[BIO_project_GO-term_table|Notes on the table creation and recommendations how to use it.]
 +
*[http://steipe.biochemistry.utoronto.ca/abc/students/index.php/BCH441_2016_poject_GO_term_table '''Table of GO terms''' - use this to choose and "adopt" a process to define a system]
  
The page also links to an example page on my Student Wiki. The example page illustrates what I expect from you for full marks for this stage.
+
Note that you are not constrained to start from a process in that table. If you are determined to work on a different human system because you have particular knowledge about it, you may suggest this to me and perhaps we can add it to the table.
 +
 
 +
<!-- The page also links to a template page that is going to be useful to organize your efforts. I have also provided an example page on my student Wiki with a model solution for illustration. -->
  
  
Line 49: Line 51:
  
  
'''Keep your systems manageable.''' When considering how many genes are associated with a system, check the taxon section of the relevant GO terms' statistic on QuickGO. The number of genes involved in the process in humans is likely as large as the largest number for ANY species - although many of the human genes may not have been annotated for that process (yet). For example, if the mouse (mus musculus) has 20 annotated genes and humans have only two, that probably does not mean humans can achieve with only two genes that for which the mouse needs twenty. Part of the next stage will be to attempt "annotation transfer" between orthologues. You will need to consider the genes individually...
+
'''Keep your systems manageable.''' When considering how many genes are associated with a system, check the taxon section of the relevant GO terms' statistic on QuickGO. The number of genes involved in the process in humans is likely as large as the largest number for ANY species - although many of the human genes may not have been annotated for that process (yet). For example, if the mouse (mus musculus) has 20 annotated genes and humans have only two, that probably does not mean humans can achieve with only two genes that for which the mouse needs twenty. You will need to consider the genes individually...
  
 
'''Keep your systems simple.''' I would avoid choosing systems/processes that integrate sensory, nervous, hormonal and cellular components. This may become too complex. Narrowing it down, to a manageable "subsystem" is a valuable exercise in itself. Such a system may implement  
 
'''Keep your systems simple.''' I would avoid choosing systems/processes that integrate sensory, nervous, hormonal and cellular components. This may become too complex. Narrowing it down, to a manageable "subsystem" is a valuable exercise in itself. Such a system may implement  
Line 60: Line 62:
 
*mediating interactions with other systems,
 
*mediating interactions with other systems,
 
*or similar...
 
*or similar...
:<small>(I'm just throwing these terms out there but I think we probably need to work out a systems roles ontology (SyRO) for the next stage, to have some context against which we evaluate the individual genes' roles.)</small>
 
  
'''Spend some thought on naming your "system" well.''' For example a concept like ''immune response'' does not allude to '''why''' the system exists. I think naming the concept ''defense against pathogens'' captures this better.
 
  
:We actually have an interesting situation. It is common for science to ask '''how''' questions, not '''why''' questions, because the '''why''' questions are thought usually not to have a scientific answer, ''i.e.'' they are not well posed in the sense that an answer might not exist, might not be unique, or might not be verifiable as being an answer. But we have discussed that evolution works by selecting from (neutral) variation according to an organism's fitness function. This allows us to formulate an answer to a '''why''' question: a system exists '''because''' it improves the organism's fitness function<ref>Of course this is a simplification - a system might also exist because it is a vestige of evolutionary history. The textbook example we often consider for this case is the existence of whales' pelvic bones. Matters are not so simple however: as has been recently shown these may play a role in copulation ([http://www.ncbi.nlm.nih.gov/pubmed/25186496 PubMed]).</ref>. In general we have no way of quantifying the fitness function - it represents a very high-dimensional multi-parameter optimization problem. But what we '''can''' observe is the existence of purifying selection. This gives us a rigorous, testable, scientific perspective: a system exists '''because''' it does something which results in traces of selection.
+
'''Spend some thought on naming your "system" well.''' For example a concept like ''immune response'' does not allude to '''why''' the system exists. I think naming the concept ''defense against pathogens'' captures the purpose better and this will help you organize the components.
 +
 
 +
<!-- :We actually have an interesting situation. It is common for science to ask '''how''' questions, not '''why''' questions, because the '''why''' questions are thought usually not to have a scientific answer, ''i.e.'' they are not well posed in the sense that an answer might not exist, might not be unique, or might not be verifiable as being an answer. But we have discussed that evolution works by selecting from (neutral) variation according to an organism's fitness function. This allows us to formulate an answer to a '''why''' question: a system exists '''because''' it improves the organism's fitness function<ref>Of course this is a simplification - a system might also exist because it is a vestige of evolutionary history. The textbook example we often consider for this case is the existence of whales' pelvic bones. Matters are not so simple however: as has been recently shown these may play a role in copulation ([http://www.ncbi.nlm.nih.gov/pubmed/25186496 PubMed]).</ref>. In general we have no way of quantifying the fitness function - it represents a very high-dimensional multi-parameter optimization problem. But what we '''can''' observe is the existence of purifying selection. This gives us a rigorous, testable, scientific perspective: a system exists '''because''' it does something which results in traces of selection. -->
  
 
</div>
 
</div>
Line 104: Line 106:
 
&nbsp;
 
&nbsp;
  
===Second stage: Compiling a list of genes (12 marks max.)===
+
===Second stage: Compiling a list of genes (11 marks max.)===
  
 
The second stage of the project is for you to detail the roles that your system needs to work, and to associate genes with roles.  
 
The second stage of the project is for you to detail the roles that your system needs to work, and to associate genes with roles.  
Line 143: Line 145:
 
{{vspace}}
 
{{vspace}}
  
===Final stage: Documentation (9 marks max.)===
+
===Final stage: Documentation (4 marks max.)===
  
 
The documentation must fulfill '''two''' aspects.  
 
The documentation must fulfill '''two''' aspects.  
  
* First, your documentation must make your data and results '''reproducible'''. You need to specify the premises you started from and how you came up with them, and you need to specify the procedure through which you arrived at your conclusions. Put yourselves into the mind of a reviewer: are you providing enough information so that your (computational) steps can be reproduced? Are your source IDs specified? Your resources and programs? Have you made your R scripts available? The parameters for analysis?
+
* First, your documentation must make your data and results '''reproducible'''. You need to specify the premises you started from and how you came up with them, and you need to specify the procedure through which you arrived at your conclusions. Put yourselves into the mind of a reviewer: are you providing enough information so that your (computational) steps can be reproduced? Are your source IDs specified? Your resources and programs?  
  
 
* Second, your documentation must explain the rationale behind your procedure and conclusions. This is not so much ''what'' you did but ''why'' you did this, what was the logic behind a certain process or decision.
 
* Second, your documentation must explain the rationale behind your procedure and conclusions. This is not so much ''what'' you did but ''why'' you did this, what was the logic behind a certain process or decision.
Line 182: Line 184:
 
&nbsp;
 
&nbsp;
 
<div class="alert">
 
<div class="alert">
The '''function choice''' is due by the end of '''week 7'''.<br />
+
The project (like all class work) is due by the end of classes, December 6. 2016. If you need an extension you '''must''' contact me at least a day before the deadline. Please state briefly the requested duration of the extension. The extension request should not extend past the final exam date.
The '''compilation of the list of genes''' and '''documentation''' are due before the Exam. If you need time beyond that data, you must notify me before the exam.<br />
 
 
</div>
 
</div>
  
Line 190: Line 191:
  
 
===Late submissions===
 
===Late submissions===
The time of submission is recorded with your edits on the Wiki and can be identified in the '''View history''' tab of a page: I will consider the last edit before the submission deadline for marking. However, if you want me to consider a later edit instead (i.e. "late submission" with the appropriate penalties), send me an eMail to that effect. If you don't email me, your mark from an incomplete submission will stand.
+
The time of submission is recorded with your edits on the Wiki and can be identified in the '''View history''' tab of a page: I will consider the last edit before the submission deadline for marking. There will be no other "late deductions" applied.
 +
 
 +
<!-- However, if you want me to consider a later edit instead (i.e. "late submission" with the appropriate penalties), send me an eMail to that effect. If you don't email me, your mark from an incomplete submission will stand.
  
 
Please get your deliverables done early, I will be quite resistant to grant extensions for reasons that have to do with your normal, expected workload. If you want to, you can submit all phases of your project at any earlier date you choose - and get it done with. Be especially mindful of your other courses, and their midterm tests.  
 
Please get your deliverables done early, I will be quite resistant to grant extensions for reasons that have to do with your normal, expected workload. If you want to, you can submit all phases of your project at any earlier date you choose - and get it done with. Be especially mindful of your other courses, and their midterm tests.  
Line 202: Line 205:
 
* fourth day: 0.1
 
* fourth day: 0.1
 
* fifth day and later: 0  
 
* fifth day and later: 0  
 +
-->
  
 
&nbsp;
 
&nbsp;

Revision as of 23:44, 24 November 2016

Bioinformatics Project: Defining a System

   

This course gives you a broad overview of bioinformatics principles, but you should also strive to apply those principles towards a biological question.

For your term project I would like you to define a biological "system" - a set of genes that collaborate towards a shared purpose. We start from a biological process, represented in the Gene Ontology (GO). From there we can use methods of function-annotation to identify related processes, functions and cellular components and the genes that are associated with them. The goal of the project is, once such a list of genes is defined, to define the "system" in which those genes collaborate, and to sketch its architecture. This requires both "bottom up" procedures of gene discovery, and "top down" reasoning about ancillary functions such as substrate import, biosynthesis of cofactors, signalling, regulation, constructing scaffolds etc. that may not (yet) be represented in the list of genes, and to clearly identify, concepts such as purpose; boundaries (i.e. which genes from the list are actually part of the system, and which ones are associated with the process, but should be considered to be outside of its boundaries, in a supporting role, a shared role, or simply part of distinct but collaborating system. Membrane transporters might be an obvious example.); interfaces and, the system's input and output.

It is your task to manage this from the perspective of a biological expert and try to define inclusion/exclusion criteria as best as you can. While your "list of genes" is going to be interesting, compiling such lists can be automated. Thus the most valuable outcome of your project how you will address the task of defining the conceptual aspects of the system and attempting to organize this into an architectural sketch.

In practice you should

  • choose a biological process you are interested in (I have provided a candidate list);
  • collect all contributing genes[1] as best you can, using bioinformatics tools and literature annotations;
  • develop unambiguous criteria for including or not including such genes in your system and annotating them;
  • list the conceptual roles in your system;
  • associate whatever genes you can with those roles and identify genes you were not able to associate with roles, and roles for which the associated genes are unknown;
  • carefully document your efforts and results: the datasources and literature, what procedures have been applied, how the results been accessed, validated and interpreted...

Ideally, your system would be defined at a level where the system that realizes it is comprised of some 20, components or so, not more, to keep things manageable.



The project steps

First stage: process and genes(11 marks max.)

To define a system, we will start from a biological process in the GO biological process ontology. I have excerpted a table of processes to get you started, and explained the procedure in detail. You can find the documentation and the table through the links below.

Note that you are not constrained to start from a process in that table. If you are determined to work on a different human system because you have particular knowledge about it, you may suggest this to me and perhaps we can add it to the table.


More notes ...


Keep your systems manageable. When considering how many genes are associated with a system, check the taxon section of the relevant GO terms' statistic on QuickGO. The number of genes involved in the process in humans is likely as large as the largest number for ANY species - although many of the human genes may not have been annotated for that process (yet). For example, if the mouse (mus musculus) has 20 annotated genes and humans have only two, that probably does not mean humans can achieve with only two genes that for which the mouse needs twenty. You will need to consider the genes individually...

Keep your systems simple. I would avoid choosing systems/processes that integrate sensory, nervous, hormonal and cellular components. This may become too complex. Narrowing it down, to a manageable "subsystem" is a valuable exercise in itself. Such a system may implement

  • integrating input,
  • transmitting input signals to their effectors,
  • regulating the process,
  • providing resources,
  • defining setpoints,
  • assembling or disassembling the system,
  • mediating interactions with other systems,
  • or similar...


Spend some thought on naming your "system" well. For example a concept like immune response does not allude to why the system exists. I think naming the concept defense against pathogens captures the purpose better and this will help you organize the components.





 

Second stage: Compiling a list of genes (11 marks max.)

The second stage of the project is for you to detail the roles that your system needs to work, and to associate genes with roles.

On one hand, you need to figure out how your system comes into existence, how it accepts substrates and/or information, how it transforms this input and how its output is generated. Consider that whatever is switched on, needs to be switched off again. And think clearly about the ultimate point of the system: why is it being selected for in the first place. The Systems Roles Ontology may help you, and if it does not match your needs for your system, contact me and we will improve the ontology.

On the other hand, you need to collect genes that contribute to those roles. All tools of bioinformatics are fair game for this: finding homologs, looking for information in PubMed, looking for similarity in GO, querying pathway databases, asessing protein-protein interactions etc. etc. You will probably amass a significant number of genes. But then it becomes important to draw the line: which genes are at the centre of your system, and which genes should really be part of something else. As you make these decisions and shape the boundaries, you should maintain an in and out list: genes that you keep in the system, genes that you declare as being outside and a note on why you made that decision. The latter is most important. At first, the goal is to describe the system, but the ultimate goal is to abstract the decision making process and automate it.

Just like defining how to tie a tie.


 


More notes ...


Review your system concepts. ...


Add genes ...


Compare against the role ontology. ...


Identify "adjacent" systems. ...


Define your system's boundaries. ...



 

Final stage: Documentation (4 marks max.)

The documentation must fulfill two aspects.

  • First, your documentation must make your data and results reproducible. You need to specify the premises you started from and how you came up with them, and you need to specify the procedure through which you arrived at your conclusions. Put yourselves into the mind of a reviewer: are you providing enough information so that your (computational) steps can be reproduced? Are your source IDs specified? Your resources and programs?
  • Second, your documentation must explain the rationale behind your procedure and conclusions. This is not so much what you did but why you did this, what was the logic behind a certain process or decision.
  • Form is important:
  • structure your project clearly, include a brief introduction and definitely include a meaningful conclusion;
  • avoid jargon;
  • make it easy to copy data for further analysis (no screenshots unless you are illustrating a Web-site or GUI);
  • write complete sentences;
  • do not plagiarize, but reference judiciously;
  • make sure your references are complete and take advantage of the <ref> ... </ref> tags and the {{#pmid:1234567}} template.

Ask(!) if you are not sure about Wiki markup or formatting to achieve a particular layout.


Due dates

 

The project (like all class work) is due by the end of classes, December 6. 2016. If you need an extension you must contact me at least a day before the deadline. Please state briefly the requested duration of the extension. The extension request should not extend past the final exam date.


 

Late submissions

The time of submission is recorded with your edits on the Wiki and can be identified in the View history tab of a page: I will consider the last edit before the submission deadline for marking. There will be no other "late deductions" applied.


 

Resources

  1. I speak of genes here in a very informal sense, the system components may include genes, their encoded proteins, structural and regulatory RNA, metabolites, and even environmental signals.