FND-Biocomputing setup

From "A B C"
Revision as of 18:38, 9 September 2017 by Boris (talk | contribs)
Jump to navigation Jump to search

Computer Setup for Biocomputing


 

Keywords:  Paths, folders and files; Course Folder; Setup for biocomputing: Xcode, R and RStudio, python, homebrew, TeX ...


 



 


 


Abstract

Some considerations are required to turn your laptop into an effective tool for biocomputing tasks. This includes consistent principles for organizing files and folders, and availability of tools to create, install, and deploy software. This unit introduces those concepts.


 


This unit ...

Prerequisites

This unit has no prerequisites.


 


Objectives

This unit will ...

  • ... inform you about file- and folder names and paths;
  • ... outline a basic set of software tools that are useful.


 


Outcomes

After working through this unit you ...

  • ... can correctly identify (and write) file names with extensions, and file paths, on your computer;
  • ... have created a Course Folder for this course or workshop on your computer;
  • ... are able to to further configure your computer for biocomputing tasks.


 


Deliverables

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.


 


Evaluation

Evaluation: NA

This unit is not evaluated for course marks.


 


Contents

Creating and consuming

Over the last decade the central paradigm of how we work with computing devices has changed for a large fraction of our day to day activities. Our previous, open and effective data-centric model of computation - i.e. storing data and then transforming it through various interoperable tools - has largely been replaced by a landscape where the software ecosystems consists of apps that lock down the data, aim to own the user experience, and thus find ways to monetize it. We are now in an application-centric era of computing. That is in itself not suprising - after all, much of this functionality is free, and you must realize that if you use free tools, you are not the customer, you are the product. However, this development is not helpful for scientific computing and you have to make some effort to go beyond these constraints that are designed to make users convenient consumers of data and services.

You need to become creators instead.

You need to take control of how your computer is organized, and how you work with it.


 

Paths, Folders and Files

The first step of organizing your computer is to obtain a good awareness of files, folders (or directories) that contain them, and how files are identified. A file is some stored information that is uniquely identified in a catalogue called a "filesystem". Files can be data, like documents, music or movies, files can be applications (yes, computer programs are also data - a set of machine-level instructions), and your computer can also have subsystems that look like files, because one can read from them and write to them - but they actually handle the I/O (input / output) of your computer: things like keyboards and printers, or sockets for communicating with other computers.

Organizing files means giving them good names and placing them into folders where they are easy to find.

A filename is a label that identifies a file. Often filenames have two parts: the actual name, and an extension. To specify a file on the computer's command line, or when working with R, you need to specify its full name including the extension. Now, the problem is that you can switch off the display of extensions in Windows; I'm afraid this is actually done by default. This means you don't see what the file is actually called, lest you be frightened by a .jpg or a .mp4 suffix to the name. But then all hell breaks lose when you are trying to do "real" work. Files can't be found, or worse, can be inadvertently overwritten. Never allow your operating system to hide file extensions from you. You must be able to see the full name[1].

A path is the complete specification of where a file is located in the hierarchically organized directory tree of your computer. Paths are simply directories strung together into a long string, separated by a forward slash "/" (on Mac or Unix) or a backslash "\" on Windows.

Folder name and path examples
  • /Users/Pierette/Documents/BCB420  ◁ Looking good on a Mac or a Linux system.
  • C:\Users\Pulcinella\Documents\CBW  ◁ Looking good on a Windows computer.

The "top level directory" is the letter of the drive followed by ":\" on Windows computers, and a simple forward-slash "/" on Mac and Unix computers. All other directories are "sub-directories". Note that you can't tell from a directory listing alone whether e.g. "Users" is a directory or a file. The operating system will usually identify this with an icon, and R has different commands to differentiate the two[2].



 

It's really useful to get into a consistent habit of giving your files a meaningful name. The name should include something that tells you what the file contains, and something that tells you the date or version. I give versions major and minor numbers, and - knowing how much things always change - I write major version numbers with a leading zero eg. 04 so that they will be correctly sorted by name in a directory listing. The same goes for dates: always write YYYY-MM-DD to ensure proper sorting.

In my experience, it is better to organize file hierarchies wide, not deep. This means I aim to put more things in one folder rather than create elaborate directory structures. I need to look for stuff a lot, and looking more-or-less in the same folder keeps my files more visible. Files that are tucked away in sub-directories are harder to find. And to avoid having very, very, very many subdirectories in one place, you should consider adding an 99-Archive folder (the 99-... prefix keeps it sorted at the bottom of the directory listing, and move directories that you keep only for reference into there.

One more thing, one golden rule that you should make every effort to adhere to: don't store the same contents in more than one place. In the best case this is merely unnecessarily needlessly redundant, but in the common worst case the two copies will go out of sync. If you need to have a file in two different folder, keep it in only one folder and put an "alias" into the other. On the Mac, you select a file or folder and <option><command><drag> it to a new location to create an alias, or hit <command>L. On Windows - ???.


 

Course Folder

Files for this course (or workshop) should all be in one "Course Folder".

Task:
Create a folder (directory) on your computer in which to keep materials for this course (or workshop). Put it into the right place, and give it the right name:

The right place is directly in the Documents folder of your account.
The right name is simply the <Coursecode> e.g. for a CBW workshop in 2016, you call the folder CBW, for a BCH441 course, the name should be BCH441, or you could use my generic ABC (A Bioinformatics Course). Keep it short, but specific.

Do not use spaces, hyphens, or any other special characters in your filename - we have encountered various problems with such filenames in the past.[3].


 

We will refer to this folder as the Course Folder. (I use the words "folder" and "directory" synonymously and completely interchangeably.)


 

Biocomputing tools

There are a number of tools you will commonly find on a professionally configured computer:

A commandline interface
While graphical user interfaces (GUI) are very helpful for interactive work, they (generally) can't be scripted and thus are an obstacle to high-throughput- and repetitive tasks. A commandline interface allows more expressive commands and easy scripting. For Mac and Linux users, your systems have terminal applications that make the underlying unix commands available. On the Mac, find "Terminal in your Applications/Utilities folder and put it in the dock. For windows users there is cmd.exe - but the command set is very different and that requires you to learn yet another language. The better solution is to instal Cygwin, which creates a unix shell that interacts with the Windows operating system.
A package manager
Complex software uses other software, such as libraries for graphics, numerical methods, or security and such libraries are "dependencies" of the code. Often dependencies need a specific version to work with a particular software. Especially when code needs to be compiled from source-code because it is not available as a pre-compiled bundel, dependencies need to be updated in very specific ways. Package managers to the rescue. These programs have validated recipes on how to install, maintain and update software. On the mac, the go-to system is Homebrew. On Linux, this depends on which windows version you are running. On Windows I have heard of "chocolatey" and "scoop"; can't give a recommendation though.
A version control system
Creating assets for reproducible research requires version control, i.e. the ability to document when what change was made, and to return to previous versions if necessary. The current go-to tool for this is Git, which is especially useful since it interfaces with GitHub
Programming languages
Besides R and RStudio, you need to be able to compile C- and C++ code, you will need a working version of Java, and you should have a recent Python installation and an IDE to write code.


 

OS X

A useful guide to configuring your system is posted here.


 

Windows

I don't know about a good, current guide for setting up Windows computers for biocomputing. If you have specific experience and advice, let's collect it here.


 

Linux

I mention Linux rarely because I find that people who work on a Linux platform already know what they are doing. Yay. Ask, if you have specific questions.



 


Further reading, links and resources

 


Notes

  1. RStudio is actually very helpful in this regard, since it always shows you the full name of your file in its file-pane, and it always also shows you the "hidden" files that your operating system does not show to you, lest they hurt our little brains.
  2. list.dirs()and list.files().
  3. After the course, you can rename / move the directory to whatever, wherever you want, but during the course, we need your files in a predictable location to be able to troubleshoot problems.


 


Self-evaluation

 



 




 

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.



 

About ...
 
Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2017-09-09

Version:

1.0

Version history:

  • 1.0 Completed to first live version
  • 0.1 Material collected from previous assignments

CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.