FND-Biocomputing setup

From "A B C"
Revision as of 07:15, 17 September 2020 by Boris (talk | contribs)
Jump to navigation Jump to search

Computer Setup for Biocomputing

(Paths, folders and files; Course Folder; Setup for biocomputing: Xcode, R and RStudio, python, homebrew, TeX ...)


 


Abstract:

It takes a bit of effort to turn your laptop into an effective tool for biocomputing tasks. You need consistent principles for organizing files and folders, and you need tools to create, install, and deploy software. This unit introduces those concepts.


Objectives:
This unit will ...

  • ... inform you about file- and folder names and paths;
  • ... outline a basic set of software tools that are useful.

Outcomes:
After working through this unit you ...

  • ... can correctly identify (and write) file names with extensions, and file paths, on your computer;
  • ... have created a Course Folder for this course or workshop on your computer;
  • ... are able to to further configure your computer for biocomputing tasks.

Deliverables:

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

  •  



     



     


    Contents

    Creating and consuming

    You might not have noticed that the way we work day to day with computers has changed dramatically after 2010. This was a slow process, but previously our model of computation was open, and data-centric. We collected data, we stored it, we transformed it through various tools that are (hopefully) interoperable, using shared and open formats. But in our current landscape the software ecosystems consists of apps that lock down the data, that aim to own the user experience - and thus find ways to monetize it. Our sytems are not designed to be open and transparent. The data-centric age of computing is past, we are now in an application-centric era of computing. That is not suprising - after all, much of this functionality is free, and you must realize that if you use free tools, you are not the customer, you are the product. However, this development is not helpful for scientific computing and therefore you have to make some effort to go beyond the constraints that you have accepted on your mobile phones, keeping your data in the cloud, having everything integrated under one banner - everything that is well designed to make users convenient consumers of data and services.

    You need to become creators instead.

    You need to take control of how your computer is organized, and how you work with it.


     

    Paths, Folders and Files

    The first step of organizing your computer is to obtain a good awareness of files, folders (or directories) that contain them, and how files are identified. A file is some stored information that is uniquely identified in a catalogue called a "filesystem". Files can be data, like documents, music or movies, files can be applications (yes, computer programs are also data - a set of machine-level instructions), and your computer can also have subsystems that look like files, because one can read from them and write to them - but they actually handle the I/O (input / output) of your computer: things like keyboards and printers, or sockets for communicating with other computers.

    Organizing files means giving them good names and placing them into folders where they are easy to find.

    A filename is a label that identifies a file. Often filenames have two parts: the actual name, and an extension. To specify a file on the computer's command line, or when working with R, you need to specify its full name including the extension. Now, the problem is that you can switch off the display of extensions in Windows; I'm afraid this is actually done by default. This means you don't see what the file is actually called, lest you be frightened by a .jpg or a .mp4 suffix to the name. But then all hell breaks lose when you are trying to do "real" work. Files can't be found, or worse, can be inadvertently overwritten. Never allow your operating system to hide file extensions from you. You must be able to see the full name[1].

    A path is the complete specification of where a file is located in the hierarchically organized directory tree of your computer. Paths are simply directory names strung together into a long string, separated by a forward slash "/" (on Mac or Unix) or a backslash "\" on Windows.

    Folder name and path examples
    • /Users/Pierette/Documents/BCB420  ◁ Looking good on a Mac or a Linux system.
    • C:\Users\Pulcinella\Documents\CBW  ◁ Looking good on a Windows computer.

    The "top level directory" is the letter of the drive followed by ":\" on Windows computers, and a simple forward-slash "/" on Mac and Unix computers. All other directories are "sub-directories". Note that you can't tell from a directory listing alone whether e.g. "Users" is a directory or a file. The operating system will usually identify this with an icon, and R has different commands to differentiate the two[2].



     

    It's really useful to get into a consistent habit of giving your files a meaningful name. The name should include something that tells you what the file contains, and something that tells you the date or version. I give versions major and minor numbers, and - knowing how much things always change - I write major version numbers with a leading zero eg. 04 so that they will be correctly sorted by name in a directory listing. The same goes for dates: always write YYYY-MM-DD to ensure proper sorting.

    In my experience, it is better to organize file hierarchies wide, not deep. This means I aim to put more things in one folder rather than create elaborate directory structures. I need to look for stuff a lot, and looking more-or-less in the same folder keeps my files more visible. Files that are tucked away in sub-directories are harder to find. And to avoid having very, very, very many subdirectories in one place, you should consider adding an 99-Archive folder (the 99-... prefix keeps it sorted at the bottom of the directory listing, and move directories that you keep only for reference into there.

    One more thing: a golden rule that you should make every effort to adhere to: don't store the same contents in more than one place. In the best case this is merely unnecessarily needlessly redundant, but in the common worse case the two copies will go out of sync. If you need to have a file in two different folders, keep the data in one folder and put an "alias" into the other. On the Mac, you select a file or folder and <option><command><drag> it to a new location to create an alias, or hit <command>L. On Windows - ??? I think it's something you do in the "explorer" menu - but perhaps someone can educate me?.


     

    Course Folder

    Files for this course (or workshop) should all be in one "Course Folder".

    Task:
    Create a folder (directory) on your computer in which to keep materials for this course (or workshop). Put it into the right place, and give it the right name:

    The right place is directly in the Documents folder of your account (or user directory).
    The right name is simply the <Coursecode> e.g. for a CBW workshop, you call the folder CBW, for a BCH441 course, the name should be BCH441, or you could use my generic ABC (A Bioinformatics Course). Keep it short, but specific.

    Do not use spaces, hyphens, or any other special characters in your filename - we have encountered various problems with such filenames in the past.[3].


     

    We will refer to this folder as the Course Folder. (I use the words "folder" and "directory" synonymously and completely interchangeably.)


     

    Biocomputing tools

     

    There are a number of tools you will commonly find on a professionally configured computer:

     
    A commandline interface
    While graphical user interfaces (GUI) are very helpful for interactive work, they (generally) can't be scripted and thus are an obstacle to high-throughput- and repetitive tasks. A commandline interface allows more expressive commands and easy scripting[4]. For Mac and Linux users, your systems have terminal applications that make the underlying unix commands available. On the Mac, find "Terminal" in your Applications/Utilities folder and put it in the dock. For windows users there is cmd.exe - but the command set is very different and that requires you to learn yet another language. The better solution is to install Cygwin, which creates a unix shell that interacts with the Windows operating system.
     
    A package manager
    Complex software uses other software, such as libraries for graphics, numerical methods, or security and such libraries are "dependencies" of the code. Often dependencies need a specific version to work with a particular software. Especially when code needs to be compiled from source because it is not available as a pre-compiled bundle, dependencies need to be updated in very specific ways. Package managers to the rescue. These programs have validated recipes on how to install, maintain and update software. On the mac, the go-to system is Homebrew. I've had excellent experience with Homebrew - It. Just. Works. On Linux, your package manager depends on which flavour of Linux you are running - but you'll know, because all your installs are done through the manager anyway. On Windows I have heard of "chocolatey" and "scoop"; can't give a recommendation though.
     
    A version control system
    Creating assets for reproducible research requires version control, i.e. the ability to document when what change was made, and to return to previous versions if necessary. The current go-to tool for this is Git, which is especially useful since it interfaces with GitHub. You need to install git for our course - details later.
     
    Programming languages
    Besides R and RStudio, you need to be able to compile C- and C++ code, you will need a working version of Java, and you should have a recent Python installation and an IDE to write code.
     
    LaTeX

    In engineering and computer science, you practically grow up with it. Conference abstracts are submitted in LaTex, papers are written in LaTex, most likely your thesis will be submitted in LaTex - it's everywhere. But in the life sciences you can go through your entire career without having heard of LaTex. LaTex is a document preparation and layout system that is very powerful, very flexible - and, boy, does it have a learning curve. Fortunately you don't need to know LaTeX to use LaTeX, at least for basic tasks - there are now pretty decent WYSIWYG[5] editors, and many applications have some wrapper function that uses LaTeX as its backend - for example to produce PDF documents. Some of the functions provided by R work that way. Thus, install LaTeX on your system, it's a good thing to have - even though you can get by without it for most analysis tasks, you will sooner or later need it when it comes to publication quality plots. However, installation is not only platform dependent, but depends on the version of your OS, and sometimes the level - since most of the time LaTeX is used by other programs, the installers need to get the path just right, set some environment variables, etc. You should be fine with the instructions at the LaTeX project but don't do this from home if you have a bandwidth cap - a full installation runs to 2.5 GB (though there are smaller basic installs available). Linux users are best served through one of the standard package managers, and on the Mac there is a homebrew "cask" but folklore has it that the direct installation is more robust[6].


     

    OS X

    A useful guide to configuring your system is posted here.


     

    Windows

    I don't know about a good, current guide for setting up Windows computers for biocomputing. If you have specific experience and advice, please let me know so I can post it here.


     

    Linux

    I mention Linux rarely because I find that people who work on a Linux platform already know what they are doing. Yay. Ask, if you have specific questions.


     

    Self-evaluation

    Notes

    1. RStudio is actually very helpful in this regard, since it always shows you the full name of your file in its file-pane, and it always also shows you the "hidden" files that your operating system does not show to you (by default), lest they hurt our little brains. (Or, to be fair, to help preventing the brash to edit files they don't undertsand, or delete files they don't recognize.)
    2. list.dirs()and list.files().
    3. After the course, you can rename / move the directory to whatever, wherever you want, but during the course, we need your files in a predictable location to be able to troubleshoot problems.
    4. But note that an R script that executes system() commands can substitute for a commandline script - and may save you from having to learn another language. Now: "real" programmers should not be intimidated by commandline scripts. And that's true. But checking results, handling error conditions, and testing the validity of your script may be significantly easier, more explicit, and better maintainable from a higher language. Just as a side note - there seems to be a current trend to use the unix make utility for organizing computational workflows. That's just wrong' in so many ways ...
    5. Graphical layout editors, as opposed to code editors: What You See Is What You Get.
    6. See here: https://tex.stackexchange.com/questions/307483/setting-up-basictex-homebrew - however, keep in mind that people who post on the TeX forum at stackexchange may have a different requirements profile than you do.


     


    About ...
     
    Author:

    Boris Steipe <boris.steipe@utoronto.ca>

    Created:

    2017-08-05

    Modified:

    2018-05-02

    Version:

    1.1.1

    Version history:

    • 1.1.1 Maintenance
    • 1.1 Add note on LaTeX
    • 1.0 Completed to first live version
    • 0.1 Material collected from previous assignments

    CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.