Difference between revisions of "FND-Biocomputing setup"

From "A B C"
Jump to navigation Jump to search
m
m
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
<div id="BIO">
+
<div id="ABC">
  <div class="b1">
+
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce; font-size:300%; font-weight:400; color: #000000; width:100%;">
 
Computer Setup for Biocomputing
 
Computer Setup for Biocomputing
  </div>
+
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#b3dbce; font-size:30%; font-weight:200; color: #000000; ">
 
+
(Paths, folders and files; Course Folder; Setup for biocomputing: Xcode, R and RStudio, python, homebrew, TeX ...)
  {{Vspace}}
+
</div>
 
 
<div class="keywords">
 
<b>Keywords:</b>&nbsp;
 
Paths, folders and files; Course Folder; Setup for biocomputing: Xcode, R and RStudio, python, homebrew, TeX ...
 
 
</div>
 
</div>
  
{{Vspace}}
+
{{Smallvspace}}
  
  
__TOC__
+
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce33; font-size:85%;">
 
+
<div style="font-size:118%;">
{{Vspace}}
+
<b>Abstract:</b><br />
 
 
 
 
{{LIVE}}
 
 
 
{{Vspace}}
 
 
 
 
 
</div>
 
<div id="ABC-unit-framework">
 
== Abstract ==
 
 
<section begin=abstract />
 
<section begin=abstract />
<!-- included from "./components/FND-Biocomputing_setup.components.txt", section: "abstract" -->
+
It takes a bit of effort to turn your laptop into an effective tool for biocomputing tasks. You need consistent principles for organizing files and folders, and you need tools to create, install, and deploy software. This unit introduces those concepts.
Some considerations are required to turn your laptop into an effective tool for biocomputing tasks. This includes consistent principles for organizing files and folders, and availability of tools to create, install, and deploy software. This unit introduces those concepts.
 
 
<section end=abstract />
 
<section end=abstract />
 
+
</div>
{{Vspace}}
+
<!-- ============================ -->
 
+
<hr>
 
+
<table>
== This unit ... ==
+
<tr>
=== Objectives ===
+
<td style="padding:10px;">
<!-- included from "./components/FND-Biocomputing_setup.components.txt", section: "objectives" -->
+
<b>Objectives:</b><br />
 
This unit will ...
 
This unit will ...
 
* ... inform you about file- and folder names and paths;
 
* ... inform you about file- and folder names and paths;
 
* ... outline a basic set of software tools that are useful.
 
* ... outline a basic set of software tools that are useful.
 
+
</td>
{{Vspace}}
+
<td style="padding:10px;">
 
+
<b>Outcomes:</b><br />
 
 
=== Outcomes ===
 
<!-- included from "./components/FND-Biocomputing_setup.components.txt", section: "outcomes" -->
 
 
After working through this unit you ...
 
After working through this unit you ...
 
* ... can correctly identify (and write) file names with extensions, and file paths, on your computer;
 
* ... can correctly identify (and write) file names with extensions, and file paths, on your computer;
 
* ... have created a Course Folder for this course or workshop on your computer;
 
* ... have created a Course Folder for this course or workshop on your computer;
 
* ... are able to to further configure your computer for biocomputing tasks.
 
* ... are able to to further configure your computer for biocomputing tasks.
 +
</td>
 +
</tr>
 +
</table>
 +
<!-- ============================  -->
 +
<hr>
 +
<b>Deliverables:</b><br />
 +
<section begin=deliverables />
 +
<li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li>
 +
<li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li>
 +
<li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li>
 +
<section end=deliverables />
 +
<!-- ============================  -->
 +
</div>
  
{{Vspace}}
+
{{Smallvspace}}
 +
 
 +
 
 +
 
 +
{{Smallvspace}}
  
  
=== Deliverables ===
+
__TOC__
<!-- included from "./components/FND-Biocomputing_setup.components.txt", section: "deliverables" -->
 
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-time_management" -->
 
*<b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
 
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-journal" -->
 
*<b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.
 
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-insights" -->
 
*<b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].
 
  
 
{{Vspace}}
 
{{Vspace}}
  
  
</div>
+
=== Evaluation ===
<div id="BIO">
+
<b>Evaluation: NA</b><br />
 +
<div style="margin-left: 2rem;">This unit is not evaluated for course marks.</div>
 
== Contents ==
 
== Contents ==
<!-- included from "./components/FND-Biocomputing_setup.components.txt", section: "contents" -->
 
  
 
==Creating and consuming==
 
==Creating and consuming==
  
Over the last decade the central paradigm of how we work with computing devices has changed for a large fraction of our day to day activities. Our previous, open and effective data-centric model of computation - i.e. storing data and then transforming it through various interoperable tools - has largely been replaced by a landscape where the software ecosystems consists of apps that lock down the data, aim to own the user experience, and thus find ways to monetize it. We are now in an application-centric era of computing. That is in itself not suprising - after all, much of this functionality is free, and you must realize that if you use free tools, you are not the customer, you are the product. However, this development is <b>not</b> helpful for scientific computing and you have to make some effort to go beyond these constraints that are designed to make users convenient consumers of data and services.
+
You might not have noticed that the way we work day to day with computers has changed dramatically after 2010. This was a slow process, but previously our model of computation was open, and '''data-centric'''. We collected data, we stored it, we transformed it through various tools that are (hopefully) interoperable, using shared and open formats. But in our current landscape the software ecosystems consists of apps that lock down the data, that aim to own the user experience - and thus find ways to monetize it. Our sytems are not designed to be open and transparent. The data-centric age of computing is past, we are now in an '''application-centric''' era of computing. That is not suprising - after all, much of this functionality is free, and you must realize that if you use free tools, you are not the customer, you are the product. However, this development is <b>not</b> helpful for scientific computing and therefore you have to make some effort to go beyond the constraints that you have accepted on your mobile phones, keeping your data in the cloud, having everything integrated under one banner - everything that is well designed to make users convenient consumers of data and services.
  
 
You need to become creators instead.
 
You need to become creators instead.
Line 88: Line 80:
 
Organizing files means giving them good names and placing them into folders where they are easy to find.
 
Organizing files means giving them good names and placing them into folders where they are easy to find.
  
A '''filename''' is a label that identifies a file. Often filenames have two parts: the actual name, and an extension. To specify a file on the computer's command line, or when working with '''R''', you need to specify its full name <u>including the extension</u>. Now, the problem is that you can switch off the display of extensions in Windows; I'm afraid this is actually done by default. This means you don't see what the file is <b>actually</b> called, lest you be frightened by a <code>.jpg</code> or a <code>.mp4</code> suffix to the name. But then all hell breaks lose when you are trying to do "real" work. Files can't be found, or worse, can be inadvertently overwritten. '''Never allow your operating system to hide file extensions from you.''' You must be able to see the full name<ref>RStudio is actually very helpful in this regard, since it always shows you the full name of your file in its file-pane, and it always also shows you the "hidden" files that your operating system does not show to you, lest they hurt our little brains.</ref>.
+
A '''filename''' is a label that identifies a file. Often filenames have two parts: the actual name, and an extension. To specify a file on the computer's command line, or when working with '''R''', you need to specify its full name <u>including the extension</u>. Now, the problem is that you can switch off the display of extensions in Windows; I'm afraid this is actually done by default. This means you don't see what the file is <b>actually</b> called, lest you be frightened by a <code>.jpg</code> or a <code>.mp4</code> suffix to the name. But then all hell breaks lose when you are trying to do "real" work. Files can't be found, or worse, can be inadvertently overwritten. '''Never allow your operating system to hide file extensions from you.''' You must be able to see the full name<ref>RStudio is actually very helpful in this regard, since it always shows you the full name of your file in its file-pane, and it always also shows you the "hidden" files that your operating system does not show to you (by default), lest they hurt our little brains. (Or, to be fair, to help preventing the brash to edit files they don't undertsand, or delete files they don't recognize.)</ref>.
  
A '''path''' is the complete specification of where a file is located in the hierarchically organized directory tree of your computer. Paths are simply directories strung together into a long string, separated by a forward slash "<code>/</code>" (on Mac or Unix) or a backslash "<code>\</code>" on Windows.
+
A '''path''' is the complete specification of where a file is located in the hierarchically organized directory tree of your computer. Paths are simply directory names strung together into a long string, separated by a forward slash "<code>/</code>" (on Mac or Unix) or a backslash "<code>\</code>" on Windows.
  
 
;Folder name and path examples
 
;Folder name and path examples
Line 107: Line 99:
 
In my experience, it is better to organize file hierarchies <b>wide, not deep</b>. This means I aim to put more things in one folder rather than create elaborate directory structures. I need to look for stuff a lot, and looking more-or-less in the same folder keeps my files more visible. Files that are tucked away in sub-directories are harder to find. And to avoid having very, very, very many subdirectories in one place, you should consider adding an <code>99-Archive</code> folder (the <code>99-...</code> prefix keeps it sorted at the bottom of the directory listing, and move directories that you keep only for reference into there.
 
In my experience, it is better to organize file hierarchies <b>wide, not deep</b>. This means I aim to put more things in one folder rather than create elaborate directory structures. I need to look for stuff a lot, and looking more-or-less in the same folder keeps my files more visible. Files that are tucked away in sub-directories are harder to find. And to avoid having very, very, very many subdirectories in one place, you should consider adding an <code>99-Archive</code> folder (the <code>99-...</code> prefix keeps it sorted at the bottom of the directory listing, and move directories that you keep only for reference into there.
  
One more thing, one golden rule that you should make every effort to adhere to: <b>don't store the same contents in more than one place</b>. In the best case this is merely unnecessarily needlessly redundant, but in the common worst case the two copies will go out of sync. If you need to have a file in two different folder, keep it in only one folder and put an "alias" into the other. On the Mac, you select a file or folder and <code>&lt;option&gt;&lt;command&gt;&lt;drag&gt;</code> it to a new location to create an alias, or hit <code>&lt;command&gt;L</code>. On Windows - ???.
+
One more thing: a golden rule that you should make every effort to adhere to: <b>don't store the same contents in more than one place</b>. In the best case this is merely unnecessarily needlessly redundant, but in the common worse case the two copies will go out of sync. If you need to have a file in two different folders, keep the data in one folder and put an "alias" into the other. On the Mac, you select a file or folder and <code>&lt;option&gt;&lt;command&gt;&lt;drag&gt;</code> it to a new location to create an alias, or hit <code>&lt;command&gt;L</code>. On Windows - ??? <small>I think it's something you do in the "explorer" menu - but perhaps someone can educate me?</small>.
  
 
{{Vspace}}
 
{{Vspace}}
Line 118: Line 110:
 
Create a folder (directory) on your computer in which to keep materials for this course (or workshop). Put it into the right place, and give it the right name:
 
Create a folder (directory) on your computer in which to keep materials for this course (or workshop). Put it into the right place, and give it the right name:
  
:'''The right place''' is directly in the <code>Documents</code> folder of your account.
+
:'''The right place''' is directly in the <code>Documents</code> folder of your account (or user directory).
  
:'''The right name''' is simply the <code>&lt;Coursecode&gt;</code> e.g. for a CBW workshop in 2016, you call the folder <code>CBW</code>, for a BCH441 course, the name should be <code>BCH441</code>, or you could use my generic <code>ABC</code> (A Bioinformatics Course). Keep it short, but specific.
+
:'''The right name''' is simply the <code>&lt;Coursecode&gt;</code> e.g. for a CBW workshop, you call the folder <code>CBW</code>, for a BCH441 course, the name should be <code>BCH441</code>, or you could use my generic <code>ABC</code> (A Bioinformatics Course). Keep it short, but specific.
  
 
<b>Do not use spaces, hyphens, or any other special characters in your filename</b> - we have encountered various problems with such filenames in the past.<ref>'''After''' the course, you can rename / move the directory to whatever, wherever you want, but during the course, we need your files in a predictable location to be able to troubleshoot problems.</ref>.
 
<b>Do not use spaces, hyphens, or any other special characters in your filename</b> - we have encountered various problems with such filenames in the past.<ref>'''After''' the course, you can rename / move the directory to whatever, wherever you want, but during the course, we need your files in a predictable location to be able to troubleshoot problems.</ref>.
Line 138: Line 130:
 
{{Smallvspace}}
 
{{Smallvspace}}
 
;A commandline interface
 
;A commandline interface
:While graphical user interfaces (GUI) are very helpful for interactive work, they (generally) can't be scripted and thus are an obstacle to high-throughput- and repetitive tasks. A commandline interface allows more expressive commands and easy scripting. For Mac and Linux users, your systems have terminal applications that make the underlying unix commands available. On the Mac, find "Terminal in your Applications/Utilities folder and put it in the dock. For windows users there is <code>cmd.exe</code> - but the command set is very different and that requires you to learn yet another language. The better solution is to instal {{WP|Cygwin}}, which creates a unix shell that interacts with the Windows operating system.
+
:While graphical user interfaces (GUI) are very helpful for interactive work, they (generally) can't be scripted and thus are an obstacle to high-throughput- and repetitive tasks. A commandline interface allows more expressive commands and easy scripting<ref>But note that an R script that executes <code>system()</code> commands can substitute for a commandline script - and may save you from having to learn another language. Now: "real" programmers should not be intimidated by commandline scripts. And that's true. But checking results, handling error conditions, and testing the validity of your script may be significantly easier, more explicit, and better maintainable from a higher language. Just as a side note - there seems to be a current trend to use the unix <code>make</code> utility for organizing computational workflows. That's just ''wrong''' in so many ways ...</ref>. For Mac and Linux users, your systems have terminal applications that make the underlying unix commands available. On the Mac, find "Terminal" in your Applications/Utilities folder and put it in the dock. For windows users there is <code>cmd.exe</code> - but the command set is very different and that requires you to learn yet another language. The better solution is to install {{WP|Cygwin}}, which creates a unix shell that interacts with the Windows operating system.
 
{{Smallvspace}}
 
{{Smallvspace}}
 
;A package manager
 
;A package manager
:Complex software uses other software, such as libraries for graphics, numerical methods, or security and  such libraries are "dependencies" of the code. Often dependencies need a specific version to work with a particular software. Especially when code needs to be compiled from source-code because it is not available as a pre-compiled bundel, dependencies need to be updated in very specific ways. Package managers to the rescue. These programs have validated recipes on how to install, maintain and update  software. On the mac, the go-to system is {{WP|Homebrew}}. On Linux, this depends on which windows version you are running. On Windows I have heard of "chocolatey" and "scoop"; can't give a recommendation though.
+
:Complex software uses other software, such as libraries for graphics, numerical methods, or security and  such libraries are "dependencies" of the code. Often dependencies need a specific version to work with a particular software. Especially when code needs to be compiled from source because it is not available as a pre-compiled bundle, dependencies need to be updated in very specific ways. Package managers to the rescue. These programs have validated recipes on how to install, maintain and update  software. On the mac, the go-to system is {{WP|Homebrew}}. I've had excellent experience with Homebrew - It. Just. Works. On Linux, your package manager depends on which flavour of Linux you are running - but you'll know, because all your installs are done through the manager anyway. On Windows I have heard of "chocolatey" and "scoop"; can't give a recommendation though.
 
{{Smallvspace}}
 
{{Smallvspace}}
 
;A version control system
 
;A version control system
:Creating assets for reproducible research requires version control, i.e. the ability to document when what change was made, and to return to previous versions if necessary. The current go-to tool for this is {{WP|Git}}, which is especially useful since it interfaces with {{WP|GitHub}}
+
:Creating assets for reproducible research requires version control, i.e. the ability to document when what change was made, and to return to previous versions if necessary. The current go-to tool for this is {{WP|Git}}, which is especially useful since it interfaces with {{WP|GitHub}}. You need to install <code>git</code> for our course - details later.
 
{{Smallvspace}}
 
{{Smallvspace}}
 
;Programming languages
 
;Programming languages
Line 150: Line 142:
 
{{Smallvspace}}
 
{{Smallvspace}}
 
;LaTeX
 
;LaTeX
In engineering and computer science, you practically grow up with it. Conference abstracts are submitted in LaTex, papers are written in LaTex, most likely your thesis will be submitted in LaTex - it's everywhere. But in the life sciences you can go through your entire career without having heard of LaTex. {{WP|LaTeX|'''LaTex'''}} is a document preparation and layout system that is very powerful, very flexible - and, boy, does it have a learning curve. Fortunately you don't need to know LaTeX to use LaTeX, at least for basic tasks - there are now pretty decent WYSIWYG editors, and many applications have some wrapper function that uses LaTeX as its backend - for example to produce PDF documents. Some of the functions provided by R work that way. Thus, installing LaTeX on your system and is a good thing to have - even though you can get by without for most analysis tasks, you will sooner or later need it when it comes to publication quality plots. However, installation is not only platform dependent, but depends on the version of your OS, and sometimes the level - since most of the time LaTeX is used by other programs, the installers need to get the path just right, set some environment variables, etc. You should be fine with the instructions at [https://www.latex-project.org/get/ ''' the LaTeX project'''] but don't do this from home if you have a bandwidth cap - a full installation runs to 2.5 GB (though there are smaller basic installs available). Linux users are best served through one of the standard package managers, and on the Mac there is a homebrew "cask" but folklore has it that the direct installation directly is more robust<ref>See here: https://tex.stackexchange.com/questions/307483/setting-up-basictex-homebrew - but keep in mind that people who post on the TeX forum at stackexchange may have a different requirements profile than you do.</ref>.
+
In engineering and computer science, you practically grow up with it. Conference abstracts are submitted in LaTex, papers are written in LaTex, most likely your thesis will be submitted in LaTex - it's everywhere. But in the life sciences you can go through your entire career without having heard of LaTex. {{WP|LaTeX|'''LaTex'''}} is a document preparation and layout system that is very powerful, very flexible - and, boy, does it have a learning curve. Fortunately you don't need to know LaTeX to use LaTeX, at least for basic tasks - there are now pretty decent WYSIWYG<ref>Graphical layout editors, as opposed to code editors: '''W'''hat '''Y'''ou '''S'''ee '''I'''s '''W'''hat '''Y'''ou '''G'''et.</ref> editors, and many applications have some wrapper function that uses LaTeX as its backend - for example to produce PDF documents. Some of the functions provided by '''R''' work that way. Thus, install LaTeX on your system, it's a good thing to have - even though you can get by without it for most analysis tasks, you will sooner or later need it when it comes to publication quality plots. However, installation is not only platform dependent, but depends on the version of your OS, and sometimes the level - since most of the time LaTeX is used by other programs, the installers need to get the path just right, set some environment variables, etc. You should be fine with the instructions at [https://www.latex-project.org/get/ ''' the LaTeX project'''] but don't do this from home if you have a bandwidth cap - a full installation runs to 2.5 GB (though there are smaller basic installs available). Linux users are best served through one of the standard package managers, and on the Mac there is a homebrew "cask" but folklore has it that the direct installation is more robust<ref>See here: https://tex.stackexchange.com/questions/307483/setting-up-basictex-homebrew - however, keep in mind that people who post on the TeX forum at stackexchange may have a different requirements profile than you do.</ref>.
  
 
{{Vspace}}
 
{{Vspace}}
Line 162: Line 154:
 
===Windows===
 
===Windows===
  
I don't know about a good, current guide for setting up Windows computers for biocomputing. If you have specific experience and advice, let's collect it here.
+
I don't know about a good, current guide for setting up Windows computers for biocomputing. If you have specific experience and advice, please let me know so I can post it here.
  
  
Line 172: Line 164:
  
 
{{Vspace}}
 
{{Vspace}}
 
 
{{Vspace}}
 
 
 
== Further reading, links and resources ==
 
<!-- {{#pmid: 19957275}} -->
 
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 
 
{{Vspace}}
 
 
  
 
== Notes ==
 
== Notes ==
<!-- included from "./components/FND-Biocomputing_setup.components.txt", section: "notes" -->
 
<!-- included from "./data/ABC-unit_components.txt", section: "notes" -->
 
 
<references />
 
<references />
  
 
{{Vspace}}
 
{{Vspace}}
  
 
</div>
 
<div id="ABC-unit-framework">
 
== Self-evaluation ==
 
<!-- included from "./components/FND-Biocomputing_setup.components.txt", section: "self-evaluation" -->
 
<!--
 
=== Question 1===
 
 
Question ...
 
 
<div class="toccolours mw-collapsible mw-collapsed" style="width:800px">
 
Answer ...
 
<div class="mw-collapsible-content">
 
Answer ...
 
 
</div>
 
  </div>
 
 
  {{Vspace}}
 
 
-->
 
 
{{Vspace}}
 
 
 
 
{{Vspace}}
 
 
 
<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_ask" -->
 
 
----
 
 
{{Vspace}}
 
 
<b>If in doubt, ask!</b> If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
 
 
----
 
 
{{Vspace}}
 
  
 
<div class="about">
 
<div class="about">
Line 241: Line 179:
 
:2017-08-05
 
:2017-08-05
 
<b>Modified:</b><br />
 
<b>Modified:</b><br />
:2017-09-09
+
:2018-05-02
 
<b>Version:</b><br />
 
<b>Version:</b><br />
:1.1
+
:1.1.1
 
<b>Version history:</b><br />
 
<b>Version history:</b><br />
 +
*1.1.1 Maintenance
 
*1.1 Add note on LaTeX
 
*1.1 Add note on LaTeX
 
*1.0 Completed to first live version
 
*1.0 Completed to first live version
 
*0.1 Material collected from previous assignments
 
*0.1 Material collected from previous assignments
 
</div>
 
</div>
[[Category:ABC-units]]
 
<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_footer" -->
 
  
 
{{CC-BY}}
 
{{CC-BY}}
  
 +
[[Category:ABC-units]]
 +
{{UNIT}}
 +
{{LIVE}}
 
</div>
 
</div>
 
<!-- [END] -->
 
<!-- [END] -->

Latest revision as of 09:27, 25 September 2020

Computer Setup for Biocomputing

(Paths, folders and files; Course Folder; Setup for biocomputing: Xcode, R and RStudio, python, homebrew, TeX ...)


 


Abstract:

It takes a bit of effort to turn your laptop into an effective tool for biocomputing tasks. You need consistent principles for organizing files and folders, and you need tools to create, install, and deploy software. This unit introduces those concepts.


Objectives:
This unit will ...

  • ... inform you about file- and folder names and paths;
  • ... outline a basic set of software tools that are useful.

Outcomes:
After working through this unit you ...

  • ... can correctly identify (and write) file names with extensions, and file paths, on your computer;
  • ... have created a Course Folder for this course or workshop on your computer;
  • ... are able to to further configure your computer for biocomputing tasks.

Deliverables:

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

  •  



     



     


    Evaluation

    Evaluation: NA

    This unit is not evaluated for course marks.

    Contents

    Creating and consuming

    You might not have noticed that the way we work day to day with computers has changed dramatically after 2010. This was a slow process, but previously our model of computation was open, and data-centric. We collected data, we stored it, we transformed it through various tools that are (hopefully) interoperable, using shared and open formats. But in our current landscape the software ecosystems consists of apps that lock down the data, that aim to own the user experience - and thus find ways to monetize it. Our sytems are not designed to be open and transparent. The data-centric age of computing is past, we are now in an application-centric era of computing. That is not suprising - after all, much of this functionality is free, and you must realize that if you use free tools, you are not the customer, you are the product. However, this development is not helpful for scientific computing and therefore you have to make some effort to go beyond the constraints that you have accepted on your mobile phones, keeping your data in the cloud, having everything integrated under one banner - everything that is well designed to make users convenient consumers of data and services.

    You need to become creators instead.

    You need to take control of how your computer is organized, and how you work with it.


     

    Paths, Folders and Files

    The first step of organizing your computer is to obtain a good awareness of files, folders (or directories) that contain them, and how files are identified. A file is some stored information that is uniquely identified in a catalogue called a "filesystem". Files can be data, like documents, music or movies, files can be applications (yes, computer programs are also data - a set of machine-level instructions), and your computer can also have subsystems that look like files, because one can read from them and write to them - but they actually handle the I/O (input / output) of your computer: things like keyboards and printers, or sockets for communicating with other computers.

    Organizing files means giving them good names and placing them into folders where they are easy to find.

    A filename is a label that identifies a file. Often filenames have two parts: the actual name, and an extension. To specify a file on the computer's command line, or when working with R, you need to specify its full name including the extension. Now, the problem is that you can switch off the display of extensions in Windows; I'm afraid this is actually done by default. This means you don't see what the file is actually called, lest you be frightened by a .jpg or a .mp4 suffix to the name. But then all hell breaks lose when you are trying to do "real" work. Files can't be found, or worse, can be inadvertently overwritten. Never allow your operating system to hide file extensions from you. You must be able to see the full name[1].

    A path is the complete specification of where a file is located in the hierarchically organized directory tree of your computer. Paths are simply directory names strung together into a long string, separated by a forward slash "/" (on Mac or Unix) or a backslash "\" on Windows.

    Folder name and path examples
    • /Users/Pierette/Documents/BCB420  ◁ Looking good on a Mac or a Linux system.
    • C:\Users\Pulcinella\Documents\CBW  ◁ Looking good on a Windows computer.

    The "top level directory" is the letter of the drive followed by ":\" on Windows computers, and a simple forward-slash "/" on Mac and Unix computers. All other directories are "sub-directories". Note that you can't tell from a directory listing alone whether e.g. "Users" is a directory or a file. The operating system will usually identify this with an icon, and R has different commands to differentiate the two[2].



     

    It's really useful to get into a consistent habit of giving your files a meaningful name. The name should include something that tells you what the file contains, and something that tells you the date or version. I give versions major and minor numbers, and - knowing how much things always change - I write major version numbers with a leading zero eg. 04 so that they will be correctly sorted by name in a directory listing. The same goes for dates: always write YYYY-MM-DD to ensure proper sorting.

    In my experience, it is better to organize file hierarchies wide, not deep. This means I aim to put more things in one folder rather than create elaborate directory structures. I need to look for stuff a lot, and looking more-or-less in the same folder keeps my files more visible. Files that are tucked away in sub-directories are harder to find. And to avoid having very, very, very many subdirectories in one place, you should consider adding an 99-Archive folder (the 99-... prefix keeps it sorted at the bottom of the directory listing, and move directories that you keep only for reference into there.

    One more thing: a golden rule that you should make every effort to adhere to: don't store the same contents in more than one place. In the best case this is merely unnecessarily needlessly redundant, but in the common worse case the two copies will go out of sync. If you need to have a file in two different folders, keep the data in one folder and put an "alias" into the other. On the Mac, you select a file or folder and <option><command><drag> it to a new location to create an alias, or hit <command>L. On Windows - ??? I think it's something you do in the "explorer" menu - but perhaps someone can educate me?.


     

    Course Folder

    Files for this course (or workshop) should all be in one "Course Folder".

    Task:
    Create a folder (directory) on your computer in which to keep materials for this course (or workshop). Put it into the right place, and give it the right name:

    The right place is directly in the Documents folder of your account (or user directory).
    The right name is simply the <Coursecode> e.g. for a CBW workshop, you call the folder CBW, for a BCH441 course, the name should be BCH441, or you could use my generic ABC (A Bioinformatics Course). Keep it short, but specific.

    Do not use spaces, hyphens, or any other special characters in your filename - we have encountered various problems with such filenames in the past.[3].


     

    We will refer to this folder as the Course Folder. (I use the words "folder" and "directory" synonymously and completely interchangeably.)


     

    Biocomputing tools

     

    There are a number of tools you will commonly find on a professionally configured computer:

     
    A commandline interface
    While graphical user interfaces (GUI) are very helpful for interactive work, they (generally) can't be scripted and thus are an obstacle to high-throughput- and repetitive tasks. A commandline interface allows more expressive commands and easy scripting[4]. For Mac and Linux users, your systems have terminal applications that make the underlying unix commands available. On the Mac, find "Terminal" in your Applications/Utilities folder and put it in the dock. For windows users there is cmd.exe - but the command set is very different and that requires you to learn yet another language. The better solution is to install Cygwin, which creates a unix shell that interacts with the Windows operating system.
     
    A package manager
    Complex software uses other software, such as libraries for graphics, numerical methods, or security and such libraries are "dependencies" of the code. Often dependencies need a specific version to work with a particular software. Especially when code needs to be compiled from source because it is not available as a pre-compiled bundle, dependencies need to be updated in very specific ways. Package managers to the rescue. These programs have validated recipes on how to install, maintain and update software. On the mac, the go-to system is Homebrew. I've had excellent experience with Homebrew - It. Just. Works. On Linux, your package manager depends on which flavour of Linux you are running - but you'll know, because all your installs are done through the manager anyway. On Windows I have heard of "chocolatey" and "scoop"; can't give a recommendation though.
     
    A version control system
    Creating assets for reproducible research requires version control, i.e. the ability to document when what change was made, and to return to previous versions if necessary. The current go-to tool for this is Git, which is especially useful since it interfaces with GitHub. You need to install git for our course - details later.
     
    Programming languages
    Besides R and RStudio, you need to be able to compile C- and C++ code, you will need a working version of Java, and you should have a recent Python installation and an IDE to write code.
     
    LaTeX

    In engineering and computer science, you practically grow up with it. Conference abstracts are submitted in LaTex, papers are written in LaTex, most likely your thesis will be submitted in LaTex - it's everywhere. But in the life sciences you can go through your entire career without having heard of LaTex. LaTex is a document preparation and layout system that is very powerful, very flexible - and, boy, does it have a learning curve. Fortunately you don't need to know LaTeX to use LaTeX, at least for basic tasks - there are now pretty decent WYSIWYG[5] editors, and many applications have some wrapper function that uses LaTeX as its backend - for example to produce PDF documents. Some of the functions provided by R work that way. Thus, install LaTeX on your system, it's a good thing to have - even though you can get by without it for most analysis tasks, you will sooner or later need it when it comes to publication quality plots. However, installation is not only platform dependent, but depends on the version of your OS, and sometimes the level - since most of the time LaTeX is used by other programs, the installers need to get the path just right, set some environment variables, etc. You should be fine with the instructions at the LaTeX project but don't do this from home if you have a bandwidth cap - a full installation runs to 2.5 GB (though there are smaller basic installs available). Linux users are best served through one of the standard package managers, and on the Mac there is a homebrew "cask" but folklore has it that the direct installation is more robust[6].


     

    OS X

    A useful guide to configuring your system is posted here.


     

    Windows

    I don't know about a good, current guide for setting up Windows computers for biocomputing. If you have specific experience and advice, please let me know so I can post it here.


     

    Linux

    I mention Linux rarely because I find that people who work on a Linux platform already know what they are doing. Yay. Ask, if you have specific questions.


     

    Notes

    1. RStudio is actually very helpful in this regard, since it always shows you the full name of your file in its file-pane, and it always also shows you the "hidden" files that your operating system does not show to you (by default), lest they hurt our little brains. (Or, to be fair, to help preventing the brash to edit files they don't undertsand, or delete files they don't recognize.)
    2. list.dirs()and list.files().
    3. After the course, you can rename / move the directory to whatever, wherever you want, but during the course, we need your files in a predictable location to be able to troubleshoot problems.
    4. But note that an R script that executes system() commands can substitute for a commandline script - and may save you from having to learn another language. Now: "real" programmers should not be intimidated by commandline scripts. And that's true. But checking results, handling error conditions, and testing the validity of your script may be significantly easier, more explicit, and better maintainable from a higher language. Just as a side note - there seems to be a current trend to use the unix make utility for organizing computational workflows. That's just wrong' in so many ways ...
    5. Graphical layout editors, as opposed to code editors: What You See Is What You Get.
    6. See here: https://tex.stackexchange.com/questions/307483/setting-up-basictex-homebrew - however, keep in mind that people who post on the TeX forum at stackexchange may have a different requirements profile than you do.


     


    About ...
     
    Author:

    Boris Steipe <boris.steipe@utoronto.ca>

    Created:

    2017-08-05

    Modified:

    2018-05-02

    Version:

    1.1.1

    Version history:

    • 1.1.1 Maintenance
    • 1.1 Add note on LaTeX
    • 1.0 Completed to first live version
    • 0.1 Material collected from previous assignments

    CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.