FND-STA-Probability

Probability

(Event probabilities in closed form, by total enumeration, and from simulation.)

Abstract:

Probability of an event as its status in the space of possibilities - with R exercises.

Objectives:
This unit will ...

... introduce the concept of probability by emphasizing knowledge of the space of possible outcomes;
... discuss how outcomes can be evaluated in closed form, from enumerating a probability table, or by simulation.

Outcomes:
After working through this unit you ...

... can construct a set of outcomes from first principles;
... can construct a set of outcomes from enumeration;
... can construct a set of outcomes from simulation.

Deliverables:

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.

Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.

Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Prerequisites:
You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:

Probability: event, probability, hypothesis and significance.

This unit builds on material covered in the following prerequisite units:

RPR-Introduction (Introduction to R)

Let me give you an example: I roll a fair^[1], six-sided die. You see five dots on top. I ask you: what was the probability of throwing a five? You answer: 1/6, because you remember from high school that that is the correct answer. But then I say: now look closely at the die. You pick it up and you see five dots on one side. The next side also has five dots, the third side too, as does the fourth, fifth and last. At that moment you understand that you know nothing about the probability of an event if you do not know the possible alternative outcomes.

Now, this is a problem. Because in general, we have no way of knowing all possible outcomes in biology.

The way we address this is to estimate the possible outcomes, either with a mathematical expression (closed form), or by enumeration, or simulation.

Closed form

Closed form means applying a formula to characterize the space of outcomes. If we have an honest die, we can state that a possible outcome is any number from the sequence (1, 2, 3, 4, 5, 6). If your experiment finds three genes in yeast, and they are all three known to be transcription factors, then you can calculate from the fact that there are 256 yeast transcription factors^[2] among the 6091 protein-coding ORFs^[3] that you can choose three genes in ~ 6091³ ways from the whole, but the number of possible outcomes in which three will be transcription factors is only ~ 256³ and that number is thirteenthousand times smaller. If you meet a nice person at Robart's library you can estimate that there is a one-in-seven chance that you were born on the same day of the week, one in threehundredsixtyfive chance that you have the same birthday (if you weren't born on a 29 of February). In general we use the "binomial coefficients" formula to calculate the number of outcomes of choosing k specific elements from n possible choices.

Often we are asking about joint probabilities that can be simply calculated from the product of individual probabilities. If the probability of randomly guessing the correct answer on a multiple choice question with five option is 1/5, then the probability of guessing this question right, plus another four-option question, is 1/5 x 1/4. If the probability of flipping a fair coin with the Queen facing up is 1/2, then the probability of doing this ten times in a row is (1/2)¹⁰ or approximately one-in-a-thousand, which is actually tractable odds if you are very patient (or very bored).

# Here is R code that tabulates runs of "Queen" and "Loon" on a $1 coin flip.
# We assume that there is a one-in-a-million chance that the coin will stand
# on edge instead.

outcomes <- c("Queen", "Loon", "Edge")

pEdge <- 1/1e6

results <- sample(outcomes,                     # vector of possible outcomes
                  10000,                        # number of trials
                  replace = TRUE,               # sample with replacement
                  prob = c(0.5 * (1 - pEdge),   # target probabilities
                            0.5 * (1 - pEdge),  # ...
                            pEdge))             # ...

table(results)

# Now calculate how long the runs are: we use run-length encoding - rle()
results[1:10]
rle(results[1:10])

# rle() returns the lengths of "runs" of the same value it
# encounters in the vector we pass to it
table(rle(results))

# values
# lengths Loon Queen
# 1  1266  1304
# 2   608   601
# 3   327   323
# 4   171   164
# 5    75    66
# 6    40    29
# 7    17    13
# 8    10    10
# 9     3     7
# 10    4     4
# 11    1     1
# 12    0     1
# 14    0     1
# 15    1     0

# I haven't used set.seed() before sample(), so your results will be similar,
# but not identical. I got 4 runs of "Loon" in the 10,000 trials, and 4 runs
#  of "Queen" - which is close to the 10 runs of the same face that we expect.

Here is a view on a similar problem, solvable in closed form. What is the probability that two persons in the same room will have the same birthday? (Assuming that all days of a 365 day-year are equally likely, which is not entirely true, and neglecting the possibility that someone is born on February 29.) If there is only one person in the room, the probability of another person sharing the same birthday is 0, because no one else is there. If there are 366 people, at least one pair must share a birthday, because there are more people than days in the year. But how does this play out in the general case?

Problems of the type "at-least-one positive outcome among many" are often better expressed by considering the chance of the event not happening - i.e. no-one sharing a birthday.

If we consider a pair of people, the probability that they do not share a birthday is 364/365. If there are two pairs, the probability of the two pairs not sharing a birthday is 364/365 x 364/365. If there are nP pairs of people, the probability that none of them shares a birthday is (364/365)^nP. How many pairs are there if there are n people in the room? That's just (n*(n-1))/2 if we don't allow pairs with oneself, and count each pair (a, b) and (b, a) only once^[4].

Let's plot this.

N <- 150
pBirthday <- numeric(N)

for (i in 1:N) {
  pBirthday[i] <- 1 - ((364/365)^((i * (i-1)) / 2))
}

plot(1:N, pBirthday,
     type = "l",
     col = "firebrick",
     xlab = "number of people in the room",
     cex.lab = "0.7",
     ylab = "Probability of at least one shared birthday",
     main = "Birthday Problem"
)

As you see, once you have more than 23 people in a room, the probability of a shared birthday is better than 50%.

Task:
Asume there are 59 students in the class. What is the probability of at least one pair of them sharing a birthday?

Did you get 0.9909? That's what I get.

Task:
If a restriction-endonuclease site is six bases long, how many such sites would you expect in a 3MB long bacterial chromosome with a GC-content of 50%?

Enumeration

These are simple examples, but our minds are curiously weak at reasoning about probabilities. Here is a famous example. In a TV game-show, contestants were led to three doors. Behind one door was the main prize - a car. Behind the two other doors was a goat. Since most people prefer cars to goats^[5], the goal of the contestant was to guess the door with the car. So they made a guess and chose a door. Then it gets interesting. The game show host opens one of the two doors that were not chosen - and there is a goat (he knows where the car is and always picks a door with a goat). And then he says: you can stay with your original choice, or you can change your choice now...

What should you do to maximize the chance of winning: stay, or switch? And what is the final probability of winning? There are three ways to think about it:

A: The probability to pick the right door is 1/3. Nothing that happens afterwards can change that probability. It doesn't matter if I stay or switch, I will win with 1/3 probability.
B: After one door is eliminated, it's actually like a new game with a probability of 1/2 to win. It doesn't matter if I stay or switch, I will win with 1/2 probability - even though my chance of winning was only 1/3 before the door was opened.
C: After one door is eliminated, I learn something about my first choice. I can use this information to infer something about where the car is not. Therefore I should switch. Then my chance of winning are 2/3.

The problem is: all three interpretations sound plausible. But they arrive at entirely different conclusions. And only one can be correct.

Task:
Think about this problem. Which solution do you prefer?

This can lead to a lot of controversy, and the argument for the correct solution is not intuitive to most people, so most people will get this wrong. But there is an rigorous way to solve this and you should pursue this strategy whenever questions of probability and choice come up: enumerate the possible outcomes with an outcomes table, or a tree diagram. Then count how many outcomes were favourable, and how mant were not.

Switching

I choose the right door. One of the wrong doors is opened. I switch to the other wrong door. I loose.
I choose one wrong door. The other wrong door is opened. I switch to the right door. I win.
I choose the other wrong door. The first wrong door is opened. I switch to the right door. I win.

Staying

I choose the right door. One of the wrong doors is opened. I stay with my right door. I win.
I choose one wrong door. The other wrong door is opened. I stay with my wrong door. I loose.
I choose the other wrong door. The first wrong door is opened. I stay with my wrong door. I loose.

This enumeration tells us: the better strategy is to switch, and the winning odds are 2/3. Total enumeration of the possible outcomes solves this right away. But getting this right is also not trivial: you need to recognize a symmetry in the problem: it does not matter to us which wrong door the host chooses when our first choice was the right door. This does not create an extra possibility on the outcome tree. If we don't realize that, we could have written the table like below, i.e. treating the host's choices as distinct:

I choose the right door. The first wrong door is opened. I switch to the second wrong door. I loose.
I choose the right door. The second wrong door is opened. I switch to the first wrong door. I loose.
I choose the first wrong door. The second wrong door is opened. I switch to the right door. I win.
I choose the second wrong door. The first wrong door is opened. I switch to the right door. I win.

... and this has us calculate the chance of winning-by-switching to be 2/4, which is the wrong answer. You realize that this is a bit subtle. There has to be a more explicit way, one that does not rely on our intuition about probability and our insight into symmetries at all. And there is such a strategy ...

Simulation

Often problems are too complex for a solution in closed form, and too large to enumerate them. Then simulation is a solution. Here is an example. You investigate genes in a particular biological assay and find that among your top ten genes, six have a particular function annotated to them. Then you might be tempted to use the argument in our discussion of closed form probabilities above to come up with the space of possible outcomes that characterizes your observation. But there's a catch. We made an unstated assumption previously, that the events were independent. But independence of observations is often a very poor assumption in biology. You always have to ask: if I made one kind of observation, is the next observation of the same kind equally likely, or more or less likely. Since biomolecules have all manners of interactions, coexpression patterns, and evolutionary relationships that lead to similar sequence, structure and function, the assumption of equal likelihood and therefore independence of events usually does not hold up. But if you can make a resonable assumption about how observations of one kind change possible other outcomes, then it is easy to set up a simulation of outcomes, and evaluate how your actual observation measures up against the possibilities^[6]

Simulation also provides the authoritative answer to the Monty Hall problem. The simulation of the game is absolutely straightforward, needs no assumptions, and there is no need to figure out and account for symmetric cases. All we need to do is to properly translate the specification into code. Thus simulation is much more resistant to mistakes.

# The "Monty Hall" problem

N <- 10000 # number of trials

nWinsStay     <- 0   # Counters to keep track of how often we win
nWinsSwitch   <- 0   # under each strategy.

for (i in 1:N) { # Simulate the game N times

  doors <- numeric(3)       # Initialize three prize doors with the number 0.
  iPrize <- sample(1:3, 1)  # Randomly place the prize ...
  doors[iPrize] <- 1        # ... (the number 1).

  iChoice <- sample(1:3, 1)  # Contestant randomly chooses one door
                             # from (1, 2, 3).

  # Determine which door the host opens
  if (iChoice == iPrize) { # If the contestant has chosen the prize door...

    remainingDoors <- (1:3)[-iPrize]   # ... the host has the two other doors
    iOpen <- sample(remainingDoors, 1) # from which to pick the one door
                                       # he will open.
  } else {
    iOpen <- (1:3)[-c(iChoice, iPrize)] # Else, there is only one wrong door
                                        # he can open: the one that was not
                                        # chosen and does not contain the prize.
  }

  # To "switch" , we pick the door that is not our first choice, and not open,
  # and we add the number behind that door (0, or 1) to our total.
  iFinal <- (1:3)[-c(iChoice, iOpen)]
  nWinsSwitch <- nWinsSwitch + doors[iFinal]

  # To "stay", we add the number behind the door of first choice to our total.
  nWinsStay <- nWinsStay + doors[iChoice]

} # End of simulation

cat(sprintf("Played: %i, stay won %i (%3.1f%%), switch won  %i (%3.1f%%).\n",
            N,
            nWinsStay, (100 * nWinsStay) / N,
            nWinsSwitch, (100 * nWinsSwitch) / N))

Task:
Try this. Are you convinced now? Or can you find an error in the translation of specification to code? (If there is no error, the result must be correct.)

Notes

↑ "Fair" mains: each side has the same likelihood of lying on top after a throw.

↑

de Boer & Hughes (2012) YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities. Nucleic Acids Res 40:D169-79. (pmid: 22102575)

[ PubMed ] [ DOI ] Abstract

↑

Lin et al. (2013) Re-annotation of protein-coding genes in the genome of saccharomyces cerevisiae based on support vector machines. PLoS ONE 8:e64477. (pmid: 23874379)

[ PubMed ] [ DOI ] Abstract

↑ Note that this problem is often phrased differently, calculating first two people sharing a birthday, then how many among three, how many among four etc. In general: pN = 1-(365!/(365^n*(365-n)!)). That probability turn out a bit higher, since there can be multiple shared birthdays in the same room.
↑ https://xkcd.com/1282/
↑ To use the ratio of events r in N observations to evaluate the significance of r, we define the "empirical p-value" as (r+1)/(N+1). (cf. North et al. (2002) Am J Hum Genet 71(2):439-441)..

About ...

Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2020-09-22

Version:

1.1

Version history:

1.1 2020 Maintenance
1.0 First live version
0.1 First stub

This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.

[1] "Fair" mains: each side has the same likelihood of lying on top after a throw.

[2] 
de Boer & Hughes (2012) YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities. Nucleic Acids Res 40:D169-79. (pmid: 22102575)

[ PubMed ] [ DOI ] Abstract
The yeast Saccharomyces cerevisiae is a prevalent system for the analysis of transcriptional networks. As a result, multiple DNA-binding sequence specificities (motifs) have been derived for most yeast transcription factors (TFs). However, motifs from different studies are often inconsistent with each other, making subsequent analyses complicated and confusing. Here, we have created YeTFaSCo (The Yeast Transcription Factor Specificity Compendium, http://yetfasco.ccbr.utoronto.ca/), an extensive collection of S. cerevisiae TF specificities. YeTFaSCo differs from related databases by being more comprehensive (including 1709 motifs for 256 proteins or protein complexes), and by evaluating the motifs using multiple objective quality metrics. The metrics include correlation between motif matches and ChIP-chip data, gene expression patterns, and GO terms, as well as motif agreement between different studies. YeTFaSCo also features an index of 'expert-curated' motifs, each associated with a confidence assessment. In addition, the database website features tools for motif analysis, including a sequence scanning function and precomputed genome-browser tracks of motif occurrences across the entire yeast genome. Users can also search the database for motifs that are similar to a query motif.

[3] 
Lin et al. (2013) Re-annotation of protein-coding genes in the genome of saccharomyces cerevisiae based on support vector machines. PLoS ONE 8:e64477. (pmid: 23874379)

[ PubMed ] [ DOI ] Abstract
The annotation of the well-studied organism, Saccharomyces cerevisiae, has been improving over the past decade while there are unresolved debates over the amount of biologically significant open reading frames (ORFs) in yeast genome. We revisited the total count of protein-coding genes in S. cerevisiae S288c genome using a theoretical approach by combining the Support Vector Machine (SVM) method with six widely used measurements of sequence statistical features. The accuracy of our method is over 99.5% in 10-fold cross-validation. Based on the annotation data in Saccharomyces Genome Database (SGD), we studied the coding capacity of all 1744 ORFs which lack experimental results and suggested that the overall number of chromosomal ORFs encoding proteins in yeast should be 6091 by removing 488 spurious ORFs. The importance of the present work lies in at least two aspects. First, cross-validation and retrospective examination showed the fidelity of our method in recognizing ORFs that likely encode proteins. Second, we have provided a web service that can be accessed at http://cobi.uestc.edu.cn/services/yeast/, which enables the prediction of protein-coding ORFs of the genus Saccharomyces with a high accuracy.

[4] Note that this problem is often phrased differently, calculating first two people sharing a birthday, then how many among three, how many among four etc. In general: pN = 1-(365!/(365^n*(365-n)!)). That probability turn out a bit higher, since there can be multiple shared birthdays in the same room.

[5] ttps://xkcd.com/1282/

[6] To use the ratio of events r in N observations to evaluate the significance of r, we define the "empirical p-value" as (r+1)/(N+1). (cf. North et al. (2002) Am J Hum Genet 71(2):439-441)..

[1]

[2]

[3]

[4]

[5]

[6]

FND-STA-Probability

Contents

Evaluation

Contents

Closed form

Enumeration

Simulation

Notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools