Data come from recording observations of the world. E.g.,
Each time you collect one of these observations, it is likely to be different.
We call these random variables (often denoted with an upper case letter, \(X\)), because the particular value that a single observation or measurement will take is uncertain. The value varies across observations.
We collect data from the world to get information about patterns and processes.
Most datasets contain a subset, a sample, of a much bigger population of interest.
We may conduct a survey to collect a sample of data from different places, times, people, or organisms. We would rarely survey all of them.
We might conduct an experiment where we take a sample of elements (people, organisms, objects) and apply some treatment in a lab (e.g., drug, temperature, exercise regime, or other treatment) to study its effects.
If we are not dealing with a sample, if every element of the population of interest is represented in the dataset, we call this a census rather than a sample.
Measuring Devices or Instruments
Measurement Error
Indirect measures
a non-sampling error
Selection stage: an element may be selected but not found
Collection stage: it may not be possible to take a measurement
Documentation stage
Call-backs reduce non-response
TARGET POPULATION the population under study
FRAME operationalises data collection from a target population. e.g. listing of elements in population.
ACTUAL POPULATION is the resulting set of elements on which usable data have been collected.
A sample is a subset of the population.
Datasets usually only contain a sample from the population; rarely do we have the entire population of data!
Why sample?
“You don’t have to eat the whole ox to know that the meat is tough”
– Samuel Johnson (1709-1784)
Statistical inference is the process of using information from sample data to make conclusions about the population.
For example, we want to know \(\mu\), mean length of fish in a population. So, we collect a sample of fish, measure their lengths, calculate the mean \(\bar{x}\), and use \(\bar{x}\) as an estimate of \(\mu\). This is statistical inference.
The sample mean \(\bar{x}\) depends on which particular fish we happened to get in our sample.
Therefore, the sample mean \(\bar{x}\) itself is a random variable.
If we were to take 1000 different samples, we’d get 1000 different means.
A method used to estimate \(\hat{\theta}\) a population parameter \(\theta\) is called an estimator. An estimator includes the study design, methods of data collection, and mathematical operations.
Sampling variance is the sample-to-sample variation in an estimator.
Bias is when our estimator doesn’t get it right on average. That is, the average of estimates over \(\infty\) samples is not centred on the population parameter; \(\text{Mean}(\hat{\theta}) \neq \theta\).
An estimator can have high/low sampling variance and high/low bias.
We want our sample to be representative of (and have similar properties to) the population. The most straightforward way to do this is through randomisation.
We randomise the selection of objects for our sample to avoid bias. If we (consciously or subconsciously) tended to chose the largest fish for our sample, we’d get an upwardly biased estimate of the lengths.
Simple random sampling or EPSEM (equal probability of selection) is the gold standard of random sampling.
Random selection of elements
SRS is easy to handle; suits even for a poor sampling frame
SRS can be costly to implement
SRS estimates are more variable than some alternatives
Suitable for heterogeneous populations
Population is divided into relatively homogeneous groups called strata and a random sample is taken from each stratum.
Sampling Approaches
Advantages of STRS
A convenient method of sampling
population is composed of clusters (groups)
Select certain clusters (randomly) and collect measurements from a random selection of the elements within the chosen clusters
Larger variance than SRS!
Select every \(k^{th}\) element!
Random start within the first block of elements.
Convenient and also the sample will be representative of population
Variance of estimates - generally greater than those of SRS
Inefficient/inappropriate, if cycle or trend is present
Probability proportional to size (PPS)
Multistage
Non-probability sampling methods
Non-probability samples are often treated as random, requiring the assumption that the sample is representative. The validity of this assumption should be carefully considered.
Sample Design | Design Effect (\(d\)) | Effective Sample Size (\(\frac{n}{d}\)) |
---|---|---|
SRS | 1.00 | \(n\) |
STRS | 0.80 to 0.90 | \(\frac{n}{0.9}\) to \(\frac{n}{0.8}\) |
Cluster | 1.02 to 1.26 | \(\frac{n}{1.26}\) to \(\frac{n}{1.02}\) |
SyRS | 1.05 | \(\frac{n}{1.05}\) |
Quota | 2 | \(\frac{n}{2}\) |
Issues to address
Bias occurs due to
A sample may have the same biases as a census along with sampling errors