“Misunderstanding of probability may be the greatest of all impediments to scientific literacy.”
— Stephen Jay Gould
Intro to probability
Probability and randomness are placeholders for incomplete knowledge.
After I shuffled a deck of cards, you might consider the identity of the top card to be “random”.
But is it really?
If you knew the starting positions of the cards and a good HD video of my shuffling, you could surely know the positions of the cards, and which is on top.
Likewise for rolling a die. If we know everything about the starting position, how it was thrown, the texture of the surface, humidity, etc., could we predict what it would roll?
For mutually exclusive events,
The probability of any two events co-occurring is zero
The probability of one event or another event occurring is the sum of the two respective probabilities.
The probability of any one event not occurring is the sum of those remaining.
Example: A randomly selected single digit can be either odd (Event \(O\)) or even (Event \(E\)).
The events \(O\) and \(E\) are mutually exclusive because a number cannot be both odd and even.
The sample space is \(\{0,1,2,3,4,5,6,7,8,9\}\).
If events \(A\) and \(B\) are statistically independent, then \(P(A \text{ and } B) = P(A) \times P(B)\).
Conditional probability
\(P(A|B)\) is the probability of event \(A\) occurring given that event \(B\) is has occurred.
For example, the probability of a card you’ve drawn being a 5, given that it is a spade.
The sample space is reduced to that where \(B\) (e.g. the card is a spade) has occurred.
We say that two events (\(A\) and \(B\)) are independent if \(P(A | B) = P(A)\) and \(P(B | A) = P(B)\).
Observing event \(A\) doesn’t make event \(B\) any more or less likely, and vice versa.
For any two events \(A\) and \(B\), \(P(A \text{ and } B ) = P(A|B) \times P(B)\) and \(P(A \textbf{ and } B ) = P(B|A) \times P(A)\).
Two systems for categorising blood are:
For any person, their blood type in any one system
is independent of their blood type in any other.
For Europeans in New Zealand,
about 81% are Rh+ and about 8% are K+.
From the table:
If a European New Zealander is chosen at random, what is the probability that they are (Rh+ and K+) or (Rh– and K–)?
Suppose that a murder victim has a bloodstain on him with type (Rh– and K+), presumably from the assailant. What is the probability that a randomly selected person matches this type?
\[P(A\mid B)=\frac {P(B\mid A)P(A)}{P(B)}~~~~~~~~\rm{s.t}~~ P(B)>0\]
\(P(A\mid B)\) and \(P(B\mid A)\) are conditional probabilities.
\(P(A)\) and \(P(B)\) are marginal or prior probabilities.
Let \(D\) be the event of a person having the Disease and \(H\) be the event of a person being Healthy (i.e., not having the disease). The outcome of a test for the disease can be either positive \((T_+)\) or negative \((T_-)\).
Consider the following definitions of conditional probabilities:
Say the following were true:
We can use Bayes Rule to answer the following questions:
It can be useful to visualise the probabilities of the four possible states using a tree diagram.
Rules of the Probability Tree
Within each level, all branches are mutually exclusive events.
The tree covers all possibilities (i.e., the entire sample space).
We multiply as we move along branches.
We add when we move across branches.
T+ | T- | ||
---|---|---|---|
D | 0.0294 | 0.0006 | 0.03 |
H | 0.0485 | 0.9215 | 0.97 |
What proportion of the overall population will test positive vs negative?
The overall proportion of positive tests will be given by:
\[ \begin{aligned} P(T_{+}) &= P(T_{+} \& D) + P(T_{+} \& H) \\ &= P(T_{+} \mid D)P(D) + P(T_{+} \mid H)P(H) \\ &= 0.98 \times 0.03 + 0.05 \times 0.97 \\ &= 0.0779 \end{aligned} \] The overall proportion of negative tests will be given by:
\[ \begin{aligned} P(T_{-}) &= 1 - P(T_{+}) \\ &= 0.9221 \end{aligned} \]
Complete table of probabilities:
T+ | T- | ||
---|---|---|---|
D | 0.0294 | 0.0006 | 0.03 |
H | 0.0485 | 0.9215 | 0.97 |
0.0779 | 0.9221 | 1 |
What are the implications of a positive or negative test result?
According to Bayes rule, the probability of a random person having the disease given they’ve tested positive is given by:
\[ \begin{aligned} P(D\mid T_{+}) &= \frac {P(T_{+}\mid D)P(D)} {P(T_{+})} \\ &= \frac{0.98 \times 0.03} {0.0779} \\ &= 0.3774 \end{aligned} \]
According to Bayes rule, the probability of a random person not having the disease given they’ve tested negative is given by:
\[ \begin{aligned} P(H \mid T_{-}) &= \frac {P(T_{-} \mid H)P(H)} {P(T_{-})} \\ &= \frac{0.95 \times 0.97} {0.9221} \\ &= 0.9993 \end{aligned} \]
The positive predictive value of the test is poor—only 38% of the subjects who tested positive will have the disease.
The negative predictive value is better—if a random subject tests negative, they’re very unlikely to have the disease.
Discrete probability distributions
Consider the number of eggs \((X)\) in an Adelie penguin’s nest. The values range from \(1\) to \(5\), each with a certain probability (or relative frequency) of occurrence.
The population mean \(\mu_X\) is simply the sum of each outcome multiplied by its probability.
\[\mu_X = E(X)= \sum xP(X=x)=\sum xP(x)\]
In R,
The population variance is given by
\[Var(X)= \sigma_X^2=\sum (x-\mu_X)^2 P(x)\]
The population SD is simply the square-root of the variance.
In R,
Consider a variable that has two possible outcomes
(say success and failure, with 50% probabilty each).
This can be described as a “Bernoulli” random variable.
A “Binomial” is just a collection of Bernoulli trials.
Let \(X\) be the number of heads when two coins are tossed.
The count of the number of successes \(X\) out of a fixed total of
\(n\) independent trials follows the binomial distribution.
That is, \(X \sim Bin(n, p)\), where \(p\) the probability of a success.
The binomial probability function \(P(X=x)\) or \(P(x)\)
is given by \[P(x)={n \choose x}p^{x}(1-p)^{n-x}\]
For \(n=10\), \(p=0.3\), the binomial probabilities,
\(P(x)\) for \(x=0,1,2, \dots, 10\), are plotted to the right.
If each of 10 basketball shots succeeded with probability 0.3, this describes the probability of your total score out of 10.
A microbiologist plates out certain bacteria on a plate, and picks out 10 colonies. She knows that the probability of successfully creating a recombinant is 0.15.
What is the probability that if she mixes all 10 colonies in a growth medium with penicillin, something (anything) will grow?
In other words:
If \(X \sim Bin(n = 10, p = 0.15)\), what is \(P(x > 0)\)?
Note \(P(x > 0)=1-P(x = 0)\). So in R, compute this as follows:
or
The code pbinom(k,size=n,prob=p)
gives the cumulative probabilities up to and including the quantile \(k\).
The Probability Mass Function (PMS) for a binomial random variable is:
\[P(X\leq k)=\sum _{i=0}^{k}{n \choose x}p^{x}(1-p)^{n-x}\]
The mean and variance of the binomial random variable is given by
\[\mu_X=np~~~~ \sigma^2_X=np(1-p)\]
In the last example, the expected number of recombinant strain of bacteria is
\[\mu_X=np=10*0.15=1.5\]
with standard deviation
\[\sigma_X=\sqrt {np(1-p)}=1.129159\]
The Poisson distribution is used to obtain the probabilities of counts of relatively rare events that occur independently in space or time.
Some Examples:
The number of snails in a quadrat \((1~m^2)\)
Fish counts in a visual transect (25m x 5m)
Bacterial colonies in 2 litres of milk
The random variable \(X\), the number of occurrences (count), often follows the Poisson distribution whose probability function is given by
\[\Pr(x)= \frac{\lambda^x e^{-\lambda}}{x!}~~~ x=0,1,2,\dots, \infty\]
The parameter \(\lambda\) is the mean which is also equal to the variance.
\[\mu_X=\lambda~~~~ \sigma^2_X=\lambda\]
Main assumptions:
The events occur at a constant average rate of \(\lambda\) per unit time or space.
Occurrences are independent of one another as well as they do not happen at exactly the same unit time or space.
Consider the number of changes that accumulate along a
stretch of a neutrally evolving gene over a given period of time.
This is a Poisson random variable with a
population mean of \(\lambda=kt\), where
\(k\) is the number of mutations per generation, and
\(t\) is the time in generations that has elapsed.
Assume that \(k = 1\times10^{-4}\) and \(t = 500\).
For \(\lambda=kt=0.05\), the Poisson probabilities are shown in the following plot.
What is the probability that at least one mutation has occurred over this period?
\(P(x > 0)=1-P(x = 0)\) is found in R as follows:
Continuous probability distributions
A discrete random variable takes values which are simply points on a real line. In other words, there is an inherent discontinuity in the values a discrete random variable can take.
If a random variable, \(X\), can take any value (i.e., not just integers) in some interval of the real line, it is called a continuous random variable.
E.g., height, weight, length, percentage protein
For a discrete random variable \(X\), the associated probabilities \(P(X=x)\) are also just points or masses, and hence the probability function \(P(x)\) is also called as the probability mass function (PMF).
For continuous random variables, probabilities can be computed when the variable falls in an interval such as \(5\) to \(15\), but not when it takes a fixed value such as \(10\) (which is equal to zero).
The Probability Density Function (PDF) gives the relative likelihood of any particular value.
For example, consider a random proportion \((X)\) between \(0\) and \(1\). Here \(X\) follows a (standard) continuous uniform distribution whose (probability) density function \(f(x)\) is defined as follows:
\[f(x)=\begin{cases}{1}~~~\mathrm {for} \ 0\leq x\leq 1,\\[9pt]0~~~\mathrm {for} \ x<0\ \mathrm {or} \ x>1\end{cases}\] This constant density function is the simple one in the graph to the right.
Continuous probability distributions
The density is the relative likelihood of any value of \(x\); that is, the height of the Probability Density Function (PDF). Say, the leaves of a particular tree had mean length 20 cm, SD 2.
The black line is the PDF, or \(f(x)\). The orange area underneath the whole PDF is 1.
The density is the relative likelihood of any value of \(x\); that is, the height of the Probability Density Function (PDF). Say, the leaves of a particular tree had mean length 20 cm, SD 2.
d <- tibble(x = seq(13, 27, by=0.01),
Density = dnorm(x, 20, 2))
p <- ggplot(d) + aes(x, Density) +
geom_hline(yintercept=0) +
geom_area(colour = 1,
fill = "darkorange",
size = 1.1, alpha = .6)
p +
annotate(geom = "path",
x = c(19.3, 19.3, 13),
y = c(0, rep(dnorm(19.3,20,2),2) ),
arrow = arrow(),
colour = "dodgerblue4", size = 1.1)
The black line is the PDF, or \(f(x)\). The orange area underneath the whole PDF is 1.
The density at 19.3 is \(f(19.3) = 0.1876\)).
The density is the relative likelihood of any value of \(x\); that is, the height of the Probability Density Function (PDF). Say, the leaves of a particular tree had mean length 20 cm, SD 2.
The black line is the PDF, or \(f(x)\). The orange area underneath the whole PDF is 1.
The area under the curve to the left of the value 19.3 is given by the Cumulative Density Function (CDF), or \(F(x)\). It gives the probability that x < 19.3; \(F(19.3) = 0.3632\).
The cumulative distribution function, CDF, \(F(x)\) gives the left tail area or probability up to \(x\). This is probability is found as
\[F_{X}(x)=\int _{-\infty }^{x}f_{X}(t)\,dt\] The relationship between the density function \(f(x)\) and the distribution function \(F(x)\) is given by the Fundamental Theorem of Calculus.
\[f(x)={dF(x) \over dx}\]
The total area under the PDF curve is \(1\). The probability of obtaining a value between two points (\(a\) and \(b\)) is the area under the PDF curve between those two points. This probability is given by \(F(b)-F(a)\).
For the uniform distribution \(U(0,1)\), \(f(x)=1\). So
\[F_{X}(x)=\int _{-\infty }^{x}\,dt=x\]
For example, the probability of a randomly drawn fraction from the interval \([0,1]\) to fall below \(x=0.5\) is 50%.
The probability of a random fraction falling between \(a=0.2\) and \(b=0.8\) is
\[F(b)-F(a)=0.8-0.2=0.6\]
Continuous probability distributions
The Gaussian or Normal Distribution is parameterised in terms of the mean \(\mu\) and the variance \(\sigma ^{2}\) and its Probability Density Function (PDF) is given by
\[f(x)={\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {1}{2}}\left({\frac {x-\mu }{\sigma }}\right)^{2}}\] A Standard Normal Distribution has mean \(\mu=0\) and standard deviation \(\sigma=1\). It has a simpler PDF:
\[f(z)={\frac {1}{ {\sqrt {2\pi }}}}e^{-{\frac {1}{2}}z^{2}}\] If \(X \sim N(\mu, \sigma)\), you can convert the \(X\) values into \(Z\)-scores by subtracting the mean \(\mu\) and dividing by the standard deviation \(\sigma\).
\[Z={\frac {X-\mu }{\sigma }}\]
We often deal with the standard normal because the symmetric bell shape of the normal distribution remains the same for all \(\mu\) and \(\sigma\).
dfn <- tibble(x=seq(-4,4,length=1000),
`f(x)` = dnorm(x),
`F(x)` = pnorm(x))
p1 <- ggplot(dfn) + aes(x=x,y=`f(x)`) + geom_line() +
geom_vline(xintercept = 0) +
labs(title = "Standard Normal Density",
x = "standard normal deviate, z")
p2 <- ggplot(dfn) + aes(x=x,y=`F(x)`) + geom_line() +
geom_vline(xintercept = 0) +
labs(title = "Cumulative Standard Normal Density",
x = "standard normal deviate, z")
p1/p2
Continuous probability distributions
The weight of an individual of Amphibola crenata, a marine snail,
is normally distributed with a mean of \(40g\) and variance of \(20g^2\).
What is the probability of getting a snail that weighs between \(35g\) and \(50g\)?
What is the probability of getting a snail that weighs below \(35g\) or over \(50g\)?
Under standard normal, the areas under the PDF curve are shown below for various situations.
Continuous probability distributions
For small samples, the shape might be difficult to judge.
set.seed(1234)
dfm <- data.frame(
x=rnorm(50,
mean=80,
sd=12)
)
p1 <- ggplot(dfm) +
geom_histogram(
aes(x=x, y=after_stat(density)),
colour=1
) +
stat_function(
fun = dnorm,
args = list(mean = 80, sd = 12),
geom = "line"
) +
xlim(min(dfm), max(dfm))
p2 <- ggplot(dfm) + aes(x) +
geom_boxplot() +
xlim(min(dfm), max(dfm)) +
theme_void()
library(patchwork)
p1 / p2 + plot_layout(heights = c(5, 1))
Skewed continuous probability distributions
In a normal quantile plot, the quantiles of the sample are plotted against the theoretical quantiles of the fitted normal distribution.
The points should roughly lie on a straight line
We can also compare the empirical and theoretical CDFs.
TV viewing time data
Skewed continuous probability distributions
Remember, theoretical distributions aren’t real—they’re just models—but they can be useful. Keep your purpose in mind.
Choose the simplest distribution that provides an adequate fit.
Data may be best served by a mixture of two or more distributions rather than a single distribution.