Code
library(tidyverse)
library(vcdExtra)
data(Toxaemia)
|>
Toxaemia ggplot() +
aes(x=smoke, y=Freq, fill=hyper) +
geom_bar(stat='identity')
Toxaemia
This dataset is from the vcdExtra
package. Two signs of toxaemia, an abnormal condition during pregnancy characterized by high blood pressure (hypertension) and high levels of protein in the urine. If untreated, both the mother and baby are at risk of complications or death. The dataset Toxaemia
represents 13384 expectant mothers in Bradford, England in their first pregnancy, who were also classified according to social class and the number of cigarettes smoked per day.
The dataset is a 5 x 3 x 2 x 2 contingency table, with 60 observations on the following 5 variables:
class
- Social class of mother, a factor with levels: 1, 2, 3, 4, 5
smoke
- Cigarettes smoked per day during pregnancy, a factor with levels: 0, 1-19, 20+
hyper
- Hypertension level, a factor with levels: Low, High
urea
- Protein urea level, a factor with levels: Low, High
Freq
- frequency in each cell, a numeric vector
Obtain relevant graphical displays for this dataset.
Bar charts-
Mosaic type charts
The full dataset is a 5 x 3 x 2 x 2 contingency table, with 60 observations on the following 5 variables. For this question we will focus on two categorical variables from this dataset, hyper
and urea
. This forms a 2 x 2 contingency table since these variables each have two levels.
High Low
High 665 2715
Low 589 9415
Two signs of toxaemia, are high blood pressure (hypertension) and high levels of protein in the urine. We want to ask if in our sample of expectant mothers in Bradford, England, is high blood pressure related to high protein levels? If these two variables are associated this may indicate the presence of toxaemia in the sample, if they are independent toxaemia may not be present.
We can test this question using a Chi-squared test.
The null hypothesis of the chi-squared these is that the two variables are independent and the alternative hypothesis is that the two variables are not independent.
Our null hypothesis is that Hypertension level and the Protein urea level in expectant mothers in Bradford, England are independent.
Our alternative hypothesis that Hypertension level and the Protein urea level in expectant mothers in Bradford, England are not independent.
Set our alpha = 0.05
Pearson's Chi-squared test with Yates' continuity correction
data: tox_display
X-squared = 563.9, df = 1, p-value < 2.2e-16
Since our p-value is less than our alpha level we reject the null hypothesis and conclude that the two variables (hyper & urea) are not independent. We found evidence of an association between hypertension levels and protein in urine levels in our sample of expectant mothers in in Bradford, England.
We can see the expected counts
The genetic information of an organism is stored in its Deoxyribonucleic acid (DNA). DNA is a double stranded helix made up of four different nucleotides. These nucleotides differ in which of the four bases Adenine (A), Guanine (G), Cytosine (C), or Thymine (T) they contain. Nucleotides combine to form amino acids which are the building blocks of proteins. Simply put, three nucleotides form an amino acid and the specific order of a combination dictates what amino acid is formed. A simple pattern that we may want to detect in a DNA sequence is that of the nucleotide at position i+1
based on the nucleotide at position i
. The nucleotide positional data collected by a researcher in a particular case is given in the following table:
i\(i+1) |
A | C | G | T |
---|---|---|---|---|
A | 622 | 316 | 328 | 536 |
C | 428 | 262 | 204 | 306 |
G | 354 | 294 | 174 | 266 |
T | 396 | 330 | 382 | 648 |
Perform a test of association and then obtain the symmetric plot.
A C G T
A 554.8409 370.5104 335.3705 541.2781
C 369.4834 246.7328 223.3322 360.4516
G 334.9983 223.7044 202.4879 326.8094
T 540.6774 361.0523 326.8094 527.4608
Pearson's Chi-squared test
data: tabledata
X-squared = 153.21, df = 9, p-value < 2.2e-16
Pearson's Chi-squared test with simulated p-value (based on 2000
replicates)
data: tabledata
X-squared = 153.21, df = NA, p-value = 0.0004998
First canonical correlation(s): 0.1443355
Row scores:
A C G T
-0.1921802 -0.8894387 -1.0334109 1.4453224
Column scores:
A C G T
-1.1304512 -0.6952989 0.8139424 1.1304056
**Results of the Correspondence Analysis (CA)**
The row variable has 4 categories; the column variable has 4 categories
The chi square of independence between the two variables is equal to 153.2146 (p-value = 1.902013e-28 ).
*The results are available in the following objects:
name description
1 "$eig" "eigenvalues"
2 "$col" "results for the columns"
3 "$col$coord" "coord. for the columns"
4 "$col$cos2" "cos2 for the columns"
5 "$col$contrib" "contributions of the columns"
6 "$row" "results for the rows"
7 "$row$coord" "coord. for the rows"
8 "$row$cos2" "cos2 for the rows"
9 "$row$contrib" "contributions of the rows"
10 "$call" "summary called parameters"
11 "$call$marge.col" "weights of the columns"
12 "$call$marge.row" "weights of the rows"
---
title: "Chapter 5 Workshop"
---
```{r, echo=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, comment=NA, fig.show="hide")
```
# Dataset `Toxaemia`
This dataset is from the `vcdExtra` package. Two signs of *toxaemia*, an abnormal condition during pregnancy characterized by high blood pressure (hypertension) and high levels of protein in the urine. If untreated, both the mother and baby are at risk of complications or death. The dataset `Toxaemia` represents 13384 expectant mothers in Bradford, England in their first pregnancy, who were also classified according to social class and the number of cigarettes smoked per day.
The dataset is a 5 x 3 x 2 x 2 contingency table, with 60 observations on the following 5 variables:
`class` - Social class of mother, a factor with levels: 1, 2, 3, 4, 5
`smoke` - Cigarettes smoked per day during pregnancy, a factor with levels: 0, 1-19, 20+
`hyper` - Hypertension level, a factor with levels: Low, High
`urea` - Protein urea level, a factor with levels: Low, High
`Freq` - frequency in each cell, a numeric vector
# Exercise 5.1 {-}
Obtain relevant graphical displays for this dataset.
Bar charts-
```{r}
library(tidyverse)
library(vcdExtra)
data(Toxaemia)
Toxaemia |>
ggplot() +
aes(x=smoke, y=Freq, fill=hyper) +
geom_bar(stat='identity')
```
```{r}
Toxaemia |>
ggplot() +
aes(x=smoke, y=Freq, fill=hyper) +
geom_bar(stat='identity',
position = "dodge"
)
```
```{r}
Toxaemia |>
ggplot() +
aes(x=smoke, y=Freq, fill=hyper) +
geom_bar(stat ='identity',
position = "dodge") +
facet_grid(urea ~ ., scales = "free")
```
Mosaic type charts
```{r, fig.show='hide', results='hide'}
tab.data <- xtabs(Freq ~ smoke + hyper + urea, data=Toxaemia)
plot(tab.data)
mosaic(tab.data, shade=TRUE, legend=TRUE)
assoc(tab.data, shade=TRUE)
strucplot(tab.data)
sieve(tab.data)
```
The full dataset is a 5 x 3 x 2 x 2 contingency table, with 60 observations on the following 5 variables.
For this question we will focus on two categorical variables from this dataset, `hyper` and `urea`. This forms a 2 x 2 contingency table since these variables each have two levels.
```{r}
# subset the data
tox_2 <- Toxaemia |>
dplyr::select(hyper, urea, Freq)
```
```{r}
# the tidyverse way
tox_display <- tox_2 |>
pivot_wider(names_from = urea,
values_from = Freq,
values_fn = sum) |>
column_to_rownames( var = "hyper") # make values of hyper column row names
tox_display
# xtabs()
```
Two signs of *toxaemia*, are high blood pressure (hypertension) and high levels of protein in the urine. We want to ask if in our sample of expectant mothers in Bradford, England, is high blood pressure related to high protein levels? If these two variables are associated this may indicate the presence of toxaemia in the sample, if they are independent toxaemia may not be present.
We can test this question using a Chi-squared test.
The null hypothesis of the chi-squared these is that the two variables are independent and the alternative hypothesis is that the two variables are not independent.
Our null hypothesis is that Hypertension level and the Protein urea level in expectant mothers in Bradford, England are independent.
Our alternative hypothesis that Hypertension level and the Protein urea level in expectant mothers in Bradford, England are *not* independent.
Set our alpha = 0.05
```{r}
chisq.test(tox_display)
```
Since our p-value is less than our alpha level we reject the null hypothesis and conclude that the two variables (hyper & urea) are not independent. We found evidence of an association between hypertension levels and protein in urine levels in our sample of expectant mothers in in Bradford, England.
We can see the expected counts
```{r}
chisq.test(tox_display)$expected
# compared to our observed
tox_display
# total counts 13384
```
# Exercise 5.2 {-}
The genetic information of an organism is stored in its Deoxyribonucleic acid (DNA). DNA is a double stranded helix made up of four different nucleotides. These nucleotides differ in which of the four bases Adenine (A), Guanine (G), Cytosine (C), or Thymine (T) they contain. Nucleotides combine to form amino acids which are the building blocks of proteins. Simply put, three nucleotides form an amino acid and the specific order of a combination dictates what amino acid is formed. A simple pattern that we may want to detect in a DNA sequence is that of the nucleotide at position `i+1` based on the nucleotide at position `i`. The nucleotide positional data collected by a researcher in a particular case is given in the following table:
| `i\(i+1)` | A | C | G | T |
|-----------|-----|-----|-----|-----|
| A | 622 | 316 | 328 | 536 |
| C | 428 | 262 | 204 | 306 |
| G | 354 | 294 | 174 | 266 |
| T | 396 | 330 | 382 | 648 |
Perform a test of association and then obtain the symmetric plot.
```{r}
tabledata <- data.frame(
A = c(622, 428, 354, 396),
C = c(316, 262, 294, 330),
G = c(328, 204, 174, 382),
T = c(536, 306, 266, 648),
row.names = c("A", "C", "G", "T")
)
```
```{r}
chisq.test(tabledata)$exp
chisq.test(tabledata)
chisq.test(tabledata, simulate.p.value = T)
```
```{r}
# if there is an association we can examine patterns
library(MASS)
corresp(tabledata)
```
```{r}
plot(corresp(tabledata, nf=2))
abline(v=0)
abline(h=0)
```
```{r}
#or
library(FactoMineR)
CA(tabledata)
```
```{r, echo=FALSE}
options(warn=0)
```
- More R code examples are [here](../exercises/Chap5more.R)