“A statistical analysis, properly conducted, is a delicate dissection of uncertainties, a surgery of suppositions.”
— M.J. Moroney
The term statistical inference means that we are using properties of a sample to make statements about the population from which the sample was drawn.
For example, say you wanted to know the mean length of the leaves on a tree (\(\mu\)). You wouldn’t want to (nor need to) measure every single leaf! You would take a random sample of leaves and measure their lengths (\(x_i\)), calculate the sample mean (\(\bar x\)), and use \(\bar x\) as an estimate of \(\mu\).
Consider a population \(X\) with mean \(\mu\),
variance \(\text{Var}(X)=\sigma^2\), and
standard deviation \(\sigma\).
We can estimate \(\mu\) by:
Importantly:
The variability of the sample means from sample to sample, \(\text{Var}(\bar X)\), depends on two things:
The variability of the sample estimate of a parameter is usually expressed as a standard error (SE), which is simply the theoretical standard deviation of \(\bar X\) from sample to sample.
The equation is surprisingly simple!
\(\text{Var}(\bar X) = \text{Var}(X)/n\)
\(\text{SE}(\bar X) = \text{SD}(\bar X) = \sigma/\sqrt n\)
In 1846, a Belgian scholar Adolphe Quetelet published an analysis of the chest sizes of a full population of 5,738 Scottish soldiers.
The distribution of the measurements (in inches) in his database has a mean of \(\mu\) = 39.83 inches, a standard deviation of \(\sigma\) = 2.05 inches, and is well approximated by a normal distribution.
Every soldier was measured so we can treat this as the population, and \(\mu\) = 39.83 and \(\sigma\) = 2.05 as population parameters. Let’s take some samples of size \(n\) = 6 from this population.
Every soldier was measured so we can treat this as the population, and \(\mu\) = 39.83 and \(\sigma\) = 2.05 as population parameters. Let’s take some samples of size \(n\) = 6 from this population.
# Put ten samples in a tibble
q_samples <- tibble( sample = as_factor(1:10) ) |>
mutate(
values = map(sample, ~ sample(qd_long, size = 6))
) |>
unnest()
head(q_samples, 15)
# A tibble: 15 × 2
sample values
<fct> <int>
1 1 36
2 1 41
3 1 42
4 1 41
5 1 38
6 1 38
7 2 41
8 2 40
9 2 42
10 2 36
11 2 39
12 2 41
13 3 38
14 3 38
15 3 41
Every soldier was measured so we can treat this as the population, and \(\mu\) = 39.83 and \(\sigma\) = 2.05 as population parameters. Let’s take some samples of size \(n\) = 6 from this population.
# Make a function to do it all and plot
plot_samples <- function(dat = qd_long,
no_samples = 12,
sample_size = 6) {
# Put samples in a tibble
q_samples <- tibble(
sample = as_factor(1:no_samples),
values = map(sample, ~ sample(dat, size = sample_size))
) |> unnest()
# Calculate means of each sample
sample_means <- q_samples |> group_by(sample) |>
summarise(mean = mean(values))
ggplot() +
xlim(min(dat), max(dat)) +
geom_vline(xintercept = mean(dat), alpha = .4) +
geom_jitter(
data = q_samples,
mapping = aes(y = sample, x = values),
width = 0, height = 0.1, alpha = .8
) +
geom_point(
data = sample_means,
mapping = aes(y = sample, x = mean),
shape = 15, size = 3,
colour = "dark orange"
) +
ggtitle(paste("Sample size =", sample_size))
}
Let’s compare the distribution of means for different sample sizes across the samples.
The larger the \(n\), the less the sample means vary.
If is distributed as a normal random variable with mean \(\mu\) and variance \(\sigma^2\),
\[X \sim \text{Normal}(\mu, \sigma)\]
then sample means of size \(n\) are distributed as
\[\bar X \sim \text{Normal}(\mu_\bar{X} = \mu, \sigma_\bar{X} = \sigma/\sqrt{n})\] \(\text{SE}(\bar{X}) = \sigma_\bar{X}= \sigma/\sqrt{n})\) is known as the standard error of the sample mean.
More generally, a standard error is the standard deviation of an estimated parameter over \(\infty\) theoretical samples.
According to the Central Limit Theorem (CLT), raw \(X\) values don’t have to be normally distributed for the sample means to be normally distributed (for any decent sample size)!
Question:
If Quetelet’s chest circumferences are distributed as \(X \sim \text{N}(\mu=39.83, \sigma=2.05)\), then how variable are the means of samples of size \(n\) = 6?
Answer:
\[ \begin{aligned} SE(\bar{X}_{n=6})&=\sigma/\sqrt{n} \\ &=2.05/\sqrt{6} \\ &=0.8369 \end{aligned} \] Sample means of size \(n\) = 6 would be distributed as
\(\bar{X}_{n=6} \sim \text{N}(\mu=39.83, \sigma=0.8369)\)
Another common way of expressing uncertainty when making statistical inferences is to present:
A confidence interval gives an indication of the sampling error.
A 95% confidence interval is constructed so that intervals from 95% of samples will contain the true population parameter.
In reality the interval either contains the true value or not, so you must not interpret a confidence interval as “there’s a 95% probability that the interval contains the true parameter”.
We can say:
Because we know the population parameters for the Quetelet dataset, we can calculate where 95% of sample means will lie for any particular sample size.
Recall that, for a normal distribution, 95% of values lie between \(\mu \pm 1.96\times\sigma\)
(i.e., \(\mu - 1.96\times\sigma\) and \(\mu + 1.96\times\sigma\)).
It follows that 95% of means of samples of size \(n\) will lie within \(\mu \pm 1.96 \times \sigma/\sqrt{n}\).
For example, for samples of size 6 from the Quetelet dataset, 95% of means will lie within \(39.83 \pm 1.96\times 2.05 / \sqrt 6\),
so \(\{ 37.02 , 42.64\}\).
It’s all very well deriving the distribution of sample means when we know the population parameters (\(\mu\) and \(\sigma\)) but in most cases we only have our one sample.
We don’t know any of the population parameters. We have to estimate the population mean (usually denoted \(\hat\mu\) or \(\bar x\)) and estimate of the population standard deviation (usually denoted \(\hat\sigma\) or \(s\)) from the sample data.
This additional uncertainty complicates things a bit. In fact, we can’t even use the normal distribution any more!
Introducing William Sealy Gosset, who described the t distribution after working with random numbers.
Gosset was a chemist and mathematician who worked for the Guinness brewery in Dublin.
Guinness forbade anyone from publishing under their own names so that competing breweries wouldn’t know what they were up to, so he published his discovery in 1908 under the penname “Student”. Student’s identity was only revealed when he died. It is therefore often called “Student’s t distribution”.
The t distribution is like the standard normal. It is bell-shaped and symmetric with mean = 0, but it has fatter tails (greater uncertainty) due to the fact that we do not know \(\sigma\); we have to estimate it.
The t distribution is not just a single distribution. It is really an entire series of distributions which is indexed by something called the “degrees of freedom” (or \(df\)).
For a sample of size \(n\), if
then the variable:
\[ T = \frac{\bar X - \mu} {s/\sqrt{n}} \]
is distributed as a \(t\) distribution with \((n – 1)\) degrees of freedom.
The process of using sample data to try and make useful statements about an unknown parameter, \(\theta\), is called statistical inference.
A confidence interval for the true value of a parameter is often obtained by:
\[ \hat \theta \pm t \times \text{SE}(\hat \theta) \] where \(\hat \theta\) is the sample estimate of \(\theta\) and \(t\) is a quantile from the \(t\) distribution with the appropriate degrees of freedom.
For example, to get the \(t\) score for a 95% confidence interval with 9 degrees of freedom:
The piece that is being added and subtracted, \(t \times \text{SE}(\hat \theta)\), is often called the margin of error.
We can calculate a 95% confidence interval for a sample mean with the following information:
A sample of size \(n\) = 6 from Quetelet data
This method can be used when the estimator \(\hat\theta\) is approximately normally distributed and
\[ \frac{\hat\theta - \theta} {\text{SE}(\hat \theta)} \]
has approximately a Student’s t distribution.
This paves the way for hypothesis testing for specific values of \(\theta\). Many of the methods you will learn in this course are based on this general rule.
Say a farmer has 9 paddocks of alfalfa. He’s hired a new manager, and wants to know if, on average, this year’s crop is different to last year’s. He measures the yield for each paddock in year 1 and year 2, and calculates the difference.
year1 <- c(0.8, 1.3, 1.7, 1.7, 1.8, 2.0, 2.0, 2.0, 2.2)
year2 <- c(0.7, 1.4, 1.8, 1.8, 2.0, 2.0, 2.1, 2.1, 2.2)
diff <- year2 - year1
( xbar <- mean(diff) ) ; ( s <- sd(diff) )
[1] 0.06666667
[1] 0.08660254
The mean difference is 0.067. On average, yields were 0.067 greater than last year.
Is that convincing different from zero?
How likely is such a difference to have arisen just by chance, and really, if we had a million paddocks, there would be no difference?
What is the probability of seeing a difference of 0.067 or more in our dataset if the true mean were zero?
These questions can be addressed with a \(p\)-value from a hypothesis test.
The hypothesis of interest (called the “alternative” hypothesis) is:
\(\ \ \ \ \ H_A: \mu \ne 0\), that is, the mean difference in yield \(\mu\) between the two years is not zero.
The null hypothesis (the case if the alternative hypothesis is wrong) is:
\(\ \ \ \ \ H_0: \mu = 0\), that is, the mean difference is zero.
We can test the null hypothesis using a t statistic \(t_0 = \frac{\bar x - \mu_0} {\text{SE}(\bar x)}\), where
\(\ \ \ \ \ \mu_0 = 0\) is the hypothesised value of the mean and
\(\ \ \ \ \ \text{SE}(\bar x) = s/\sqrt{n}\) is the (estimated) standard error of the sample mean.
The p-value is \(\text{Pr}(|t_{df=8}| > t_0)\)
The hypothesis of interest (called the “alternative” hypothesis) is:
\(\ \ \ \ \ H_A: \mu \ne 0\), that is, the mean difference in yield \(\mu\) between the two years is not zero.
The null hypothesis (the case if the alternative hypothesis is wrong) is:
\(\ \ \ \ \ H_0: \mu = 0\), that is, the mean difference is zero.
This can be done more easily:
One Sample t-test
data: diff
t = 2.3094, df = 8, p-value = 0.04974
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
9.806126e-05 1.332353e-01
sample estimates:
mean of x
0.06666667
The \(p\)-value of 0.05 means that, if the null hypothesis were true, only 5% of sample means would be as or more extreme than the observed value of 0.067. It is the area in orange in the graph below.
We can therefore reject the null hypothesis at the conventional 5% level, and conclude that, on average, yields were indeed higher this year.
This is an example of a paired t test. They two samples (year 1 and year 2) are not independent of one another because we have the same paddocks in both years. So, we take the differences, treat them like a single sample, and do a one-sample t test for the mean difference being zero.
\(p\)-values are random variables too.
https://shiny.massey.ac.nz/anhsmith/demos/demo.p.is.rv/
In reality, the null hypothesis is either true or false.
The outcome of a test is either we reject the null hypothesis, or we fail to reject the null hypothesis (note, we never “confirm” or “accept” the null hypothesis – absence of evidence is not evidence of absence!).
This gives us four possibilities:
\(H_0\) is true | \(H_0\) is false | |
---|---|---|
Reject \(H_0\) | Type I error | Correct result |
Do not reject \(H_0\) | Correct result | Type II error |
With probabilities usually represented by:
\(H_0\) is true | \(H_0\) is false | |
---|---|---|
Reject \(H_0\) | \(\alpha\) | \(1-\beta\) = “power” |
Do not reject \(H_0\) | \(1 - \alpha\) | \(\beta\) |
The probability of making a Type I error, rejecting \(H_0\) when \(H_0\) is true, is set by the experimenter a priori (beforehand) as the significance level, \(\alpha\).
The probability of retaining (not rejecting) \(H_0\) when \(H_0\) is false is usually represented by \(\beta\) (“beta”).
The power of a hypothesis test is the probability of rejecting the null hypothesis if it is indeed false. Basically, “if we’re right, what is the probability that we’ll be able to show it?”
The power of the \(t\) test can be evaluated using R if you assert the effect size \(\delta\):
The bigger the sample size, the bigger more power we have to detect an effect.
Often, experimenters might aim to have a power of 80% or more, so they will most likely reject the null if it is false.
The term sampling distribution means the distribution of the computed statistic such as the sample mean when sampling is repeated many times.
For a normal population,
Student’s \(t\) distribution is the sampling distribution of the mean (after rescaling).
\(\chi^2\) distribution is the sampling distribution of the sample variance \(S^2\).
\((n-1)S^2/\sigma^2\) follows \(\chi^2\) distribution.
\(F\) distribution is ratio of two \(\chi^2\) distributions.
\(t\) distribution is symmetric but \(\chi^2\) and \(F\) distributions are right skewed. - For large samples, they become normal - For \(n>30\), the skew will diminish
For the three sampling distributions, the sample size \(n\) becomes the proxy parameter, called the degrees of freedom (df).
The two-sample t-test is a common statistical test. You have two populations, \(X_1\) and \(X_2\), with means \(\mu_1\) and \(\mu_2\) and variances \(\sigma^2_1\) and \(\sigma^2_2\), respectively.
We wish to test whether the two population means are equal.
Null hypothesis: \(H_0:\mu=\mu_1=\mu_2\)
Two-sided alternative hypothesis: \(H_1:\mu_1 \neq \mu_2\)
The basic t-test assumes equal variances; that is, \(\sigma^2_1 = \sigma^2_2\).
If we make this assumption and it is false, the test may give the wrong conclusion!
Under the assumption of equal variances \(\sigma^2_1 = \sigma^2_2\), we can perform a pooled-sample t-test.
We calculate the estimated pooled variance as:
\[s_{p}^{2} = w_{1} s_{1}^{2} +w_{2} s_{2}^{2}\]
where the weights are \(w_{1} =\frac{n_{1}-1}{n_{1} +n_{2}-2}\) and \(w_{2} =\frac{n_{2}-1}{n_{1} +n_{2}-2}\),
and \(n_1\) and \(n_2\) are the sample sizes for samples 1 and 2, respectively.
For the pooled case, the \(df\) for the \(t\)-test is simply
\[df = n_{1}+n_{2}-2\]
This is often called the “Welch test”.
We do not need to calculated the pooled variance \(s^2_p\);
we simply use \(s^2_1\) and \(s^2_2\) as they are.
For the unpooled case, the \(df\) for the test is smaller: \[df=\frac{\left(\frac{s_{1}^{2}}{n_{1}} +\frac{s_{2}^{2} }{n_{2}} \right)^{2} }{\frac{1}{n_{1} -1} \left(\frac{s_{1}^{2}}{n_{1}}\right)^{2} +\frac{1}{n_{2} -1} \left(\frac{s_{2}^{2}}{n_{2} } \right)^{2}}\] The \(df\) doesn’t have to be an integer in this case.
We can test the null hypothesis that the variances are equal using either a Bartlett’s test or Levene’s test. Levene’s is generally favourable over Bartlett’s.
tv = read_csv("https://www.massey.ac.nz/~anhsmith/data/tv.csv")
car::leveneTest(TELETIME~factor(SEX), data=tv)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 3.1789 0.08149 .
44
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Highish p-value means there’s no strong evidence against the null hypothesis that the variances are equal. Therefore, we might feel happy to assume equal variances and use a pooled variance estimate.
t-test assuming equal variances:
Two Sample t-test
data: TELETIME by factor(SEX)
t = -0.7249, df = 44, p-value = 0.4723
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
-461.3476 217.2606
sample estimates:
mean in group 1 mean in group 2
1668.261 1790.304
t-test not assuming equal variances (Welch test):
Welch Two Sample t-test
data: TELETIME by factor(SEX)
t = -0.7249, df = 40.653, p-value = 0.4727
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
-462.1384 218.0514
sample estimates:
mean in group 1 mean in group 2
1668.261 1790.304
https://shiny.massey.ac.nz/anhsmith/demos/demo.2sample.t.test/
https://shiny.massey.ac.nz/anhsmith/demos/explore.2sample.t-test/
https://shiny.massey.ac.nz/anhsmith/demos/explore.paired.t-test/
For more on non-parametric tests, see Study Guide.
Testing the null hypothesis that a proportion \(p=0.5\)
Alternative hypothesis: \(p \ne 0.5\)
Say a survey of 1000 beached whales had 450 females.
Is the population 50:50?
An exact version of the test
Exact binomial test
data: c(450, 550)
number of successes = 450, number of trials = 1000, p-value = 0.001731
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.4188517 0.4814435
sample estimates:
probability of success
0.45
To test for a proportion other than 0.5
Exact binomial test
data: c(450, 550)
number of successes = 450, number of trials = 1000, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.75
95 percent confidence interval:
0.4188517 0.4814435
sample estimates:
probability of success
0.45
Data on smokers in four group of patients
(from Fleiss, 1981, Statistical methods for rates and proportions).
4-sample test for equality of proportions without continuity correction
data: smokers out of patients
X-squared = 12.6, df = 3, p-value = 0.005585
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.9651163 0.9677419 0.9485294 0.8536585
More on chi-squared tests next week.
Testing the hypothesis that sample data came from a normally distributed population. \(H_0: X \sim N(\mu, \sigma)\).
Example: a sample of \(n\) = 50 from N(100,1).
The large p-values mean that there’s no evidence of non-normality.
Like all tests, they can give the wrong answer. Try with set.seed(1234)
.
It can be useful to transform skewed data to make the distribution more like a ‘normal’ if that is required for a particular analysis.
A linear transformation \(Y^*= a+bY\) only changes the scale or centre, not the shape of the distribution.
Transformations can help to improve symmetry, normality, and stabilise the variance.
Right skewed distribution is made roughly symmetric using a log transformation for no. of vehicles variable (rangitikei.* dataset)
POWER | Formula | Name | Result |
---|---|---|---|
3 | \(x^3\) | cube | stretches large values |
2 | \(x^2\) | square | stretches large values |
1 | \(x\) | raw | No change |
1/2 | \(\sqrt{x}\) | square root | squashes large values |
0 | \(\log{x}\) | logarithm | squashes large values |
-1/2 | \(\frac{-1}{\sqrt{x}}\) | reciprocal root | squashes large values |
-1 | \(\frac{-1}{x}\) | reciprocal | squashes large values |
This is a normalising transformation based on the Box-Cox power parameter, \(\lambda\).
R gives a point estimate of the Box-Cox power & a confidence interval.
Usually, we are concerned about whether the residuals from a model are normally distributed – put the model in the lm
statement.
Sometimes, no “good” value of \(\lambda\) can be found.
Obtain the confidence interval using transformed data and then back transform the limits (to keep the original scale)
For hypothesis test, apply the same transformation on the value hypothesised under the null
Explore https://shiny.massey.ac.nz/anhsmith/demos/explore.transformations/ app (not currently working)
Brain weights of Animals
data are right-skewed.
data(Animals, package = "MASS")
bind_cols(
# confidence intervals on raw data
`raw x` = Animals |>
pull(brain) |>
t.test() |>
pluck("conf.int"),
# confidence intervals on log-transformed data
`log x` = Animals |>
pull(brain) |>
log() |>
t.test() |>
pluck("conf.int"),
# confidence intervals on log-transformed data
# and then back-transformed
`log x, CIs back-transformed` = Animals |>
pull(brain) |>
log() |>
t.test() |>
pluck("conf.int") |>
exp()
) |>
kable()
raw x | log x | log x, CIs back-transformed |
---|---|---|
56.88993 | 3.495101 | 32.95362 |
1092.15293 | 5.355790 | 211.83131 |
Non-parametric tests are light on assumptions, and can be used for highly asymmetric data (as an alternative to using transformations).
Many non-parametric methods rely on replacing the observed data by their ranks.
Rank the \(X\) and \(Y\) variables, and then obtain usual Pearson correlation coefficient.
The plot shows non-parametric Spearman in the the upper triangle and parametric Pearson in the bottom triangle.
A non-parametric alternative to the one-sample t-test
\(H_0: \eta=\eta_0\) where \(\eta\) (Greek letter ‘eta’) is the population median
Based on based on ranking \((|Y-\eta_0|)\), where the ranks for data with \(Y<\eta_0\) are compared to the ranks for data with \(Y>\eta_0\)
For two group comparison, pool the two group responses and then rank the pooled data
Ranks for the first group are compared to the ranks for the second group
The null hypothesis is that the two group medians are the same: \(H_0: \eta_1=\eta_2\).
Wilcoxon rank sum test with continuity correction
data: rangitikei$people by rangitikei$time
W = 30, p-value = 0.007711
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-88.99996 -10.00005
sample estimates:
difference in location
-36.46835
Welch Two Sample t-test
data: rangitikei$people by rangitikei$time
t = -3.1677, df = 30.523, p-value = 0.003478
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
-102.28710 -22.13049
sample estimates:
mean in group 1 mean in group 2
22.71429 84.92308
A permutation (or randomisation) test has the following steps:
One sample hypothesis test example follows:
1-sample Permutation Test
data: tv$TELETIME
T = 79547, p-value = 2.842e-14
alternative hypothesis: true mu is not equal to 0
For small samples, this approach is not powerful.
2-sample Permutation Test
data: TELETIME by SEX
T = 38370, p-value = 0.471
alternative hypothesis: true mu is not equal to 0
Also using a linear model fit (cover later)
[1] "Settings: unique SS : numeric variables centered"
Call:
lmp(formula = TELETIME ~ SEX, data = tv)
Residuals:
Min 1Q Median 3Q Max
-1178.261 -510.793 -7.283 402.989 1130.739
Coefficients:
Estimate Iter Pr(Prob)
SEX 122 51 0.98
Residual standard error: 570.9 on 44 degrees of freedom
Multiple R-Squared: 0.0118, Adjusted R-squared: -0.01066
F-statistic: 0.5255 on 1 and 44 DF, p-value: 0.4723
Read the study guide example for bootstrap tests (not examined)
Basic methods of inference include:
Inference is relatively easy for normally distributed populations.
Student’s t-tests include:
The t-test is generally robust for non-normal populations (especially for large samples).
Power transformations, such as square-root, log, or Box-Cox aim to reduce skewness.
Non-parametric tests can be used for non-normal data, but they are usually less powerful than parametric tests.