Expected Preparations:
|
||||
|
||||
Keywords: Paths; folders and files; Course Folder; Setup for biocomputing: Xcode; R and RStudio; python; homebrew; TeX … | ||||
|
||||
Objectives:
This unit will …
|
Outcomes:
After working through this unit you …
|
|||
|
||||
Deliverables: Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit. Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these. Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page. |
||||
|
||||
Evaluation: NA: This unit is not evaluated for course marks. |
It takes a bit of effort to turn your laptop into an effective tool for biocomputing tasks. You need consistent principles for organizing files and folders, and you need tools to create, install, and deploy software. This unit introduces those concepts.
You might not have noticed that the way we work day to day with computers has changed dramatically after 2010. This was a slow process, but previously our model of computation was open, and data-centric. We collected data, we stored it, we transformed it through various tools that are (hopefully) interoperable, using shared and open formats. But in our current landscape the software ecosystems consists of apps that lock down the data, that aim to own the user experience - and thus find ways to monetize it. Our sytems are not designed to be open and transparent. The data-centric age of computing is past, we are now in an application-centric era of computing. That is not suprising - after all, much of this functionality is free, and you must realize that if you use free tools, you are not the customer, you are the product. However, this development is not helpful for scientific computing and therefore you have to make some effort to go beyond the constraints that you have accepted on your mobile phones, keeping your data in the cloud, having everything integrated under one banner - everything that is well designed to make users convenient consumers of data and services.
You need to become creators instead.
You need to take control of how your computer is organized, and how you work with it.
The first step of organizing your computer is to obtain a good awareness of files, folders (or directories) that contain them, and how files are identified. A file is some stored information that is uniquely identified in a catalogue called a “filesystem”. Files can be data, like documents, music or movies, files can be applications (yes, computer programs are also data - a set of machine-level instructions), and your computer can also have subsystems that look like files, because one can read from them and write to them - but they actually handle the I/O (input / output) of your computer: things like keyboards and printers, or sockets for communicating with other computers.
Organizing files means giving them good names and placing them into folders where they are easy to find.
A filename is a label that identifies a file. Often
filenames have two parts: the actual name, and an extension. To specify
a file on the computer’s command line, or when working with
R, you need to specify its full name including the
extension. Now, the problem is that you can switch off the display
of extensions in Windows; I’m afraid this is actually done by default.
This means you don’t see what the file is actually
called, lest you be frightened by a .jpg
or a
.mp4
suffix to the name. But then all hell breaks lose when
you are trying to do “real” work. Files can’t be found, or worse, can be
inadvertently overwritten. Never allow your operating system to
hide file extensions from you. You must be able to see the full
name1.
A path is the complete specification of where a file
is located in the hierarchically organized directory tree of your
computer. Paths are simply directory names strung together into a long
string, separated by a forward slash “/
” (on Mac or Unix)
or a backslash “\
” on Windows.
For example …
/Users/Pierette/Documents/BCB420
Looking
good on a Mac or a Linux system.C:\Users\Pulcinella\Documents\CBW
Looking
good on a Windows computer.The “top level directory” is the letter of the drive followed by
“:\
” on Windows computers, and a simple forward-slash
“/
” on Mac and Unix computers. All other directories are
“sub-directories”. Note that you can’t tell from a directory listing
alone whether e.g. “Users
” is a directory or a file. The
operating system will usually identify this with an icon, and R has
different commands to differentiate the two2.
It’s really useful to get into a consistent habit of giving your
files a meaningful name. The name should include something that tells
you what the file contains, and something that tells you the date or
version. I give versions major and minor numbers, and - knowing how much
things always change - I write major version numbers with a leading zero
eg. 04
so that they will be correctly sorted by name in a
directory listing. The same goes for dates: always write
YYYY-MM-DD
to ensure proper sorting.
In my experience, it is better to organize file hierarchies
wide, not deep. This means I aim to put more things in
one folder rather than create elaborate directory structures. I need to
look for stuff a lot, and looking more-or-less in the same folder keeps
my files more visible. Files that are tucked away in sub-directories are
harder to find. And to avoid having very, very, very many subdirectories
in one place, you should consider adding an 99-Archive
folder (the 99-…
prefix keeps it sorted at the bottom of
the directory listing, and move directories that you keep only for
reference into there.
One more thing: a golden rule that you should make every effort to
adhere to: don’t store the same contents in more than one
place. In the best case this is merely unnecessarily needlessly
redundant, but in the common worse case the two copies will go out of
sync. If you need to have a file in two different folders, keep the data
in one folder and put an “alias” into the other. On the Mac, you select
a file or folder and
<option><command><drag>
it to a new
location to create an alias, or hit <command>L
. On
Windows - ??? I think it’s something you do in the “explorer”
menu - but perhaps someone can educate me?.
Files for this course (or workshop) should all be in one “Course Folder”.
Task…
Create a folder (directory) on your computer in which to keep materials for this course (or workshop). Put it into the right place, and give it the right name:
The right place is directly in the
Documents
folder of your account (or user
directory).
The right name is simply the
<Coursecode>
e.g. for a CBW workshop, you call the
folder CBW
, for a BCH441 course, the name should be
BCH441
, or you could use my generic ABC
(A
Bioinformatics Course). Keep it short, but specific.
Do not use spaces, hyphens, or any other special characters in your filename - we have encountered various problems with such filenames in the past.3.
We will refer to this folder as the Course Folder. (I use the words “folder” and “directory” synonymously and completely interchangeably.)
There are a number of tools you will commonly find on a professionally configured computer:
cmd.exe
- but the command set is very different
and that requires you to learn yet another language. The better solution
is to install Cygwin(W), which
creates a unix shell that interacts with the Windows operating system.
A package manager
A version control system
git
for our course - details later.
Programming languages
LaTeX In engineering and computer science, you practically grow up with it. Conference abstracts are submitted in LaTex, papers are written in LaTex, most likely your thesis will be submitted in LaTex - it’s everywhere. But in the life sciences you can go through your entire career without having heard of LaTex. LaTex(W) is a document preparation and layout system that is very powerful, very flexible - and, boy, does it have a learning curve. Fortunately you don’t need to know LaTeX to use LaTeX, at least for basic tasks - there are now pretty decent WYSIWYG5 editors, and many applications have some wrapper function that uses LaTeX as its backend - for example to produce PDF documents. Some of the functions provided by R work that way. Thus, install LaTeX on your system, it’s a good thing to have - even though you can get by without it for most analysis tasks, you will sooner or later need it when it comes to publication quality plots. However, installation is not only platform dependent, but depends on the version of your OS, and sometimes the level - since most of the time LaTeX is used by other programs, the installers need to get the path just right, set some environment variables, etc. You should be fine with the instructions at ** the LaTeX project** but don’t do this from home if you have a bandwidth cap - a full installation runs to 2.5 GB (though there are smaller basic installs available). Linux users are best served through one of the standard package managers, and on the Mac there is a homebrew “cask” but folklore has it that the direct installation is more robust6.
A useful guide to configuring your system is posted here. As far as I can tell it is still basically sane.
I don’t know about a good, current guide for setting up Windows computers for biocomputing. If you have specific experience and advice, please let me know so I can post it here.
I mention Linux rarely because I find that people who work on a Linux platform already know what they are doing. Yay. Ask, if you have specific questions.
If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.
Installing Python
On recent Macs, installing python 2.7.16 does not make a functional
pip
available and you will get a pip not found
error. Install python3 instead.
(1) $ brew install python3
.
(2) Use $ pip3 install
<package
> instead
of pip install
<package
>.
(3) However, the version of pip3 that gets automatically installed wit h
python3 is out of date and will fail to add packages until it is
updated. To get the update code, try installing numpy. Once it fails,
the last section of the error will include code to update pip3.
(via Kezia Dick, MGY441-2022)
[END]
RStudio is actually very helpful in this regard, since it always shows you the full name of your file in its file-pane, and it always also shows you the “hidden” files that your operating system does not show to you (by default), lest they hurt our little brains. (Or, to be fair, to help preventing the brash to edit files they don’t undertsand, or delete files they don’t recognize.)↩︎
list.dirs()
and list.files()
.↩︎
After the course, you can rename / move the directory to whatever, wherever you want, but during the course, we need your files in a predictable location to be able to troubleshoot problems.↩︎
But note that an R script that executes
system()
commands can substitute for a commandline script -
and may save you from having to learn another language. Now: “real”
programmers should not be intimidated by commandline scripts. And that’s
true. But checking results, handling error conditions, and testing the
validity of your script may be significantly easier, more explicit, and
better maintainable from a higher language. Just as a side note - there
seems to be a current trend to use the unix make
utility
for organizing computational workflows. That’s just _wrong** in so many
ways …↩︎
Graphical layout editors, as opposed to code editors: What You See Is What You Get.↩︎
See here: https://tex.stackexchange.com/questions/307483/setting-up-basictex-homebrew - however, keep in mind that people who post on the TeX forum at stackexchange may have a different requirements profile than you do.↩︎