Dataset Toxaemia

This dataset is from the vcdExtra package. Two signs of toxaemia, an abnormal condition during pregnancy characterized by high blood pressure (hypertension) and high levels of protein in the urine. If untreated, both the mother and baby are at risk of complications or death. The dataset Toxaemia represents 13384 expectant mothers in Bradford, England in their first pregnancy, who were also classified according to social class and the number of cigarettes smoked per day.

The dataset is a 5 x 3 x 2 x 2 contingency table, with 60 observations on the following 5 variables:

class - Social class of mother, a factor with levels: 1, 2, 3, 4, 5

smoke - Cigarettes smoked per day during pregnancy, a factor with levels: 0, 1-19, 20+

hyper - Hypertension level, a factor with levels: Low, High

urea - Protein urea level, a factor with levels: Low, High

Freq - frequency in each cell, a numeric vector

Exercise 5.1

Obtain relevant graphical displays for this dataset.

Bar charts-

Code
library(tidyverse)

library(vcdExtra)
data(Toxaemia)

Toxaemia |> 
  ggplot() + 
  aes(x=smoke, y=Freq, fill=hyper) + 
  geom_bar(stat='identity')
Code
Toxaemia |> 
  ggplot() + 
  aes(x=smoke, y=Freq, fill=hyper) + 
  geom_bar(stat='identity', 
           position = "dodge"
           )
Code
Toxaemia |> 
  ggplot() + 
  aes(x=smoke, y=Freq, fill=hyper) + 
  geom_bar(stat ='identity', 
           position = "dodge") + 
  facet_grid(urea ~ ., scales = "free")

Mosaic type charts

Code
tab.data <- xtabs(Freq ~ smoke + hyper + urea, data=Toxaemia)

plot(tab.data)
Code
mosaic(tab.data, shade=TRUE, legend=TRUE)
Code
assoc(tab.data, shade=TRUE) 
Code
strucplot(tab.data)
Code
sieve(tab.data)

The full dataset is a 5 x 3 x 2 x 2 contingency table, with 60 observations on the following 5 variables. For this question we will focus on two categorical variables from this dataset, hyper and urea. This forms a 2 x 2 contingency table since these variables each have two levels.

Code
# subset the data
tox_2 <- Toxaemia |> 
  dplyr::select(hyper, urea, Freq)
Code
# the tidyverse way
tox_display <- tox_2 |> 
  pivot_wider(names_from = urea, 
              values_from = Freq,
              values_fn = sum) |>
  column_to_rownames( var = "hyper") # make values of hyper column row names

tox_display
     High  Low
High  665 2715
Low   589 9415
Code
# xtabs() 

Two signs of toxaemia, are high blood pressure (hypertension) and high levels of protein in the urine. We want to ask if in our sample of expectant mothers in Bradford, England, is high blood pressure related to high protein levels? If these two variables are associated this may indicate the presence of toxaemia in the sample, if they are independent toxaemia may not be present.

We can test this question using a Chi-squared test.

The null hypothesis of the chi-squared these is that the two variables are independent and the alternative hypothesis is that the two variables are not independent.

Our null hypothesis is that Hypertension level and the Protein urea level in expectant mothers in Bradford, England are independent.

Our alternative hypothesis that Hypertension level and the Protein urea level in expectant mothers in Bradford, England are not independent.

Set our alpha = 0.05

Code
chisq.test(tox_display)

    Pearson's Chi-squared test with Yates' continuity correction

data:  tox_display
X-squared = 563.9, df = 1, p-value < 2.2e-16

Since our p-value is less than our alpha level we reject the null hypothesis and conclude that the two variables (hyper & urea) are not independent. We found evidence of an association between hypertension levels and protein in urine levels in our sample of expectant mothers in in Bradford, England.

We can see the expected counts

Code
chisq.test(tox_display)$expected
         High      Low
High 316.6856 3063.314
Low  937.3144 9066.686
Code
# compared to our observed
tox_display
     High  Low
High  665 2715
Low   589 9415
Code
# total counts 13384

Exercise 5.2

The genetic information of an organism is stored in its Deoxyribonucleic acid (DNA). DNA is a double stranded helix made up of four different nucleotides. These nucleotides differ in which of the four bases Adenine (A), Guanine (G), Cytosine (C), or Thymine (T) they contain. Nucleotides combine to form amino acids which are the building blocks of proteins. Simply put, three nucleotides form an amino acid and the specific order of a combination dictates what amino acid is formed. A simple pattern that we may want to detect in a DNA sequence is that of the nucleotide at position i+1 based on the nucleotide at position i. The nucleotide positional data collected by a researcher in a particular case is given in the following table:

i\(i+1) A C G T
A 622 316 328 536
C 428 262 204 306
G 354 294 174 266
T 396 330 382 648

Perform a test of association and then obtain the symmetric plot.

Code
tabledata <- data.frame(
  A = c(622, 428, 354, 396),
  C = c(316, 262, 294, 330),
  G = c(328, 204, 174, 382),
  T = c(536, 306, 266, 648), 
  row.names = c("A", "C", "G", "T")
  )
Code
chisq.test(tabledata)$exp
         A        C        G        T
A 554.8409 370.5104 335.3705 541.2781
C 369.4834 246.7328 223.3322 360.4516
G 334.9983 223.7044 202.4879 326.8094
T 540.6774 361.0523 326.8094 527.4608
Code
chisq.test(tabledata)

    Pearson's Chi-squared test

data:  tabledata
X-squared = 153.21, df = 9, p-value < 2.2e-16
Code
chisq.test(tabledata, simulate.p.value = T)

    Pearson's Chi-squared test with simulated p-value (based on 2000
    replicates)

data:  tabledata
X-squared = 153.21, df = NA, p-value = 0.0004998
Code
# if there is an association we can examine patterns 
library(MASS)
corresp(tabledata)
First canonical correlation(s): 0.1443355 

 Row scores:
         A          C          G          T 
-0.1921802 -0.8894387 -1.0334109  1.4453224 

 Column scores:
         A          C          G          T 
-1.1304512 -0.6952989  0.8139424  1.1304056 
Code
plot(corresp(tabledata, nf=2))
abline(v=0)
abline(h=0)
Code
#or
library(FactoMineR)
CA(tabledata)
**Results of the Correspondence Analysis (CA)**
The row variable has  4  categories; the column variable has 4 categories
The chi square of independence between the two variables is equal to 153.2146 (p-value =  1.902013e-28 ).
*The results are available in the following objects:

   name              description                   
1  "$eig"            "eigenvalues"                 
2  "$col"            "results for the columns"     
3  "$col$coord"      "coord. for the columns"      
4  "$col$cos2"       "cos2 for the columns"        
5  "$col$contrib"    "contributions of the columns"
6  "$row"            "results for the rows"        
7  "$row$coord"      "coord. for the rows"         
8  "$row$cos2"       "cos2 for the rows"           
9  "$row$contrib"    "contributions of the rows"   
10 "$call"           "summary called parameters"   
11 "$call$marge.col" "weights of the columns"      
12 "$call$marge.row" "weights of the rows"         
  • More R code examples are here