Chapter 2:
Exploratory Data Analysis (EDA)

Two modes of data analysis

Hypothesis-generating

  • “Exploratory analysis”
  • Aim is to explore data to discover new patterns
  • Results must not be presented as formal tests of a priori hypotheses
  • Testing a hypothesis using the same data that gave rise to the hypothesis is circular reasoning

Hypothesis-testing

  • “Confirmatory analysis”
  • Aim is to evaluate evidence for specific a priori hypotheses
  • The hypotheses and ideas were conceived of before the data were observed
  • Can be used for formal scientific inference

Presenting hypothesis-generating analyses as hypothesis-testing analyses (i.e., pretending the hypotheses were conceived prior to the analysis) is scientifically dishonest, and a major contributor to the replication crisis in science.

Plots for categorical data

Bar graphs

  • Show the frequency of each category (level) in categorical variables
  • The height of each bar is proportional to the frequency
  • Can be “stacked” or “clustered”

Tea data

Data from 300 individuals’ tea-drinking habits (18 questions), perceptions (12 questions), and personal details (4 questions).

data(tea, package = "FactoMineR")
glimpse(tea)
Rows: 300
Columns: 36
$ breakfast        <fct> breakfast, breakfast, Not.breakfast, Not.breakfast, b…
$ tea.time         <fct> Not.tea time, Not.tea time, tea time, Not.tea time, N…
$ evening          <fct> Not.evening, Not.evening, evening, Not.evening, eveni…
$ lunch            <fct> Not.lunch, Not.lunch, Not.lunch, Not.lunch, Not.lunch…
$ dinner           <fct> Not.dinner, Not.dinner, dinner, dinner, Not.dinner, d…
$ always           <fct> Not.always, Not.always, Not.always, Not.always, alway…
$ home             <fct> home, home, home, home, home, home, home, home, home,…
$ work             <fct> Not.work, Not.work, work, Not.work, Not.work, Not.wor…
$ tearoom          <fct> Not.tearoom, Not.tearoom, Not.tearoom, Not.tearoom, N…
$ friends          <fct> Not.friends, Not.friends, friends, Not.friends, Not.f…
$ resto            <fct> Not.resto, Not.resto, resto, Not.resto, Not.resto, No…
$ pub              <fct> Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, Not.pub,…
$ Tea              <fct> black, black, Earl Grey, Earl Grey, Earl Grey, Earl G…
$ How              <fct> alone, milk, alone, alone, alone, alone, alone, milk,…
$ sugar            <fct> sugar, No.sugar, No.sugar, sugar, No.sugar, No.sugar,…
$ how              <fct> tea bag, tea bag, tea bag, tea bag, tea bag, tea bag,…
$ where            <fct> chain store, chain store, chain store, chain store, c…
$ price            <fct> p_unknown, p_variable, p_variable, p_variable, p_vari…
$ age              <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, 32, 31, 56, 6…
$ sex              <fct> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M, M, M, F,…
$ SPC              <fct> middle, middle, other worker, student, employee, stud…
$ Sport            <fct> sportsman, sportsman, sportsman, Not.sportsman, sport…
$ age_Q            <fct> 35-44, 45-59, 45-59, 15-24, 45-59, 15-24, 35-44, 35-4…
$ frequency        <fct> 1/day, 1/day, +2/day, 1/day, +2/day, 1/day, 3 to 6/we…
$ escape.exoticism <fct> Not.escape-exoticism, escape-exoticism, Not.escape-ex…
$ spirituality     <fct> Not.spirituality, Not.spirituality, Not.spirituality,…
$ healthy          <fct> healthy, healthy, healthy, healthy, Not.healthy, heal…
$ diuretic         <fct> Not.diuretic, diuretic, diuretic, Not.diuretic, diure…
$ friendliness     <fct> Not.friendliness, Not.friendliness, friendliness, Not…
$ iron.absorption  <fct> Not.iron absorption, Not.iron absorption, Not.iron ab…
$ feminine         <fct> Not.feminine, Not.feminine, Not.feminine, Not.feminin…
$ sophisticated    <fct> Not.sophisticated, Not.sophisticated, Not.sophisticat…
$ slimming         <fct> No.slimming, No.slimming, No.slimming, No.slimming, N…
$ exciting         <fct> No.exciting, exciting, No.exciting, No.exciting, No.e…
$ relaxing         <fct> No.relaxing, No.relaxing, relaxing, relaxing, relaxin…
$ effect.on.health <fct> No.effect on health, No.effect on health, No.effect o…

Bar charts — one variable

ggplot(tea) +
  geom_bar(aes(x = price)) + 
  ggtitle("Bar chart")

Bar charts — two variables

ggplot(tea) +
  geom_bar(
    aes(x = price, fill = where)
    ) + 
  ggtitle("Stacked bar chart")

ggplot(tea) +
  geom_bar(
    aes(x = price, fill = where), 
    position = "dodge"
    ) +
  ggtitle("Clustered bar chart")

Bar charts - flipped

ggplot(tea) +
  geom_bar(
    aes(x = price, fill = where)
    ) + 
  ggtitle("Stacked bar chart") +
  coord_flip()

ggplot(tea) +
  geom_bar(
    aes(x = price, fill = where), 
    position = "dodge"
    ) +
  ggtitle("Clustered bar chart") +
  coord_flip()

Pie charts (yeah nah)

ggplot(tea) +
  aes(x = "", fill = price) +
  geom_bar() +
  coord_polar("y") + 
  xlab("") + ylab("")

ggplot(tea) +
  aes(x = price) +
  geom_bar() +
  coord_flip()

Pie charts (yeah nah)

ggplot(tea) +
  aes(x = "", fill = price) +
  geom_bar() +
  coord_polar("y") + 
  xlab("") + ylab("")

  • Pie charts are popular but not usually the best way to show proportional data
  • Requires comparison of angles or areas of different shapes
  • Bar charts are almost always better

https://shiny.massey.ac.nz/anhsmith/demos/explore.counts.of.factors/

One-dimensional graphs

Dotplots and strip charts display one-dimensional data (grouped/ungrouped) and are useful to discover gaps and outliers.

Often used to display experimental design data; not great for very small datasets (<20)

data(Animals, package = "MASS")

ggplot(Animals) +
  aes(x = brain) + 
  geom_dotplot() + 
  scale_y_continuous(NULL, breaks = NULL) +
  ggtitle("Dotplot")

One-dimensional graphs

Dotplots and strip charts display one-dimensional data (grouped/ungrouped) and are useful to discover gaps and outliers.

Often used to display experimental design data; not great for very small datasets (<20)

data(Animals, package = "MASS")

Animals |> 
  mutate(
    Animal = fct_reorder(
      rownames(Animals), 
      brain )
    ) |> 
  ggplot() +
  aes( y = Animal, 
       x = brain
       ) + 
  geom_point() + 
  ylab("Animal") + 
  ggtitle("Strip chart")

Histograms

Divide the data range into “bins”, count the occurrences in each bin, and make a bar chart.

Y-axis can show raw counts, relative frequencies, or densities

set.seed(1234); dfm <- data.frame(X = rnorm(50, 100))

p1 <- ggplot(dfm, aes(X)) + geom_histogram(bins = 20) + ylab("count") + ggtitle("Frequency histogram", "Heights of the bars sum to n")
p2 <- ggplot(dfm) + aes(x = X, y = after_stat(count/sum(count))) + geom_histogram(bins = 20) + ylab("relative frequency") +
  ggtitle("Relative frequency histogram", "Heights sum to 1")
p3 <- ggplot(dfm) + aes(x = X, y = after_stat(density)) + geom_histogram(bins = 20) + 
  ggtitle("Density histogram","Heights x widths sum to 1")

library(patchwork); p1+p2+p3

Frequency polygon & kernel density plots

Histogram
a coarse approximation of the density

ggplot(vital) + aes(Life_female) + 
  geom_histogram(bins = 12) +
  geom_freqpoly(bins = 12)

Kernel density
a smooth approximation of the density

ggplot(vital) + aes(Life_female) +
  geom_histogram(bins = 12, aes(y = after_stat(density))) + 
  geom_density()

Kernel density estimation (KDE)

Summary statistics for EDA

Five-number summary

Minimum, lower hinge, median, upper hinge and maximum

set.seed(1234)
my.data <- rnorm(50, 100)
fivenum(my.data)
[1]  97.65430  99.00566  99.46477  99.98486 102.41584
summary(my.data)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  97.65   99.01   99.46   99.55   99.96  102.42 

Boxplots

  • Graphical display of 5-number summary
  • Can show several groups of data on the same graph

Letter-Value table

An extension of the boxplot, suitable for large data sets

Shows:

  • M: Median
  • F: Fourths
  • E: Eighths
  • D: Sixteenths
  • C: Thirty-seconds
  Depth    Lower     Upper       Mid   Spread
M  25.5 99.92736  99.92736  99.92736 0.000000
F  13.0 99.43952 100.70136 100.07044 1.261832
E   7.0 98.93218 101.20796 100.07007 2.275786
D   4.0 98.73494 101.55871 100.14682 2.823770
C   2.5 98.52396 101.75099 100.13747 3.227034
B   1.5 98.17334 101.97793 100.07564 3.804590
A   1.0 98.03338 102.16896 100.10117 4.135573

Hofmann, H; Wickham, H.; Kafadar, K. 2017. “Letter-Value Plots: Boxplots for Large Data.” Journal of Computational and Graphical Statistics 26 (3): 469–77. https://doi.org/10.1080/10618600.2017.1305277.

Letter-Value Plot

An extension of the boxplot, suitable for large data sets

Cumulative frequency graphs

  • Show the left tail area
  • Useful to obtain the quantiles (deciles, percentiles, quartiles etc)
set.seed(123)

d <- data.frame(
  x = rnorm(50, 100)
  )

ggplot(d) + 
  aes(x) + 
  stat_ecdf()

Shiny apps

Lots of examples are available

  • In the study guide and workshops for this course (though not all of them are working currently)
  • On the web

https://shiny.massey.ac.nz/anhsmith/demos/explore.univariate.graphs/

https://shiny.massey.ac.nz/anhsmith/demos/get.univariate.plots/

Quantile-Quantile (Q-Q) plot

Q-Q plots compare the distributions of two data sets by plotting their quantiles against each other.

vital <- read.table(
  "https://www.massey.ac.nz/~anhsmith/data/vital.txt", 
  header=TRUE, sep=",")

quants <- seq(0, 1, 0.05)

vital |> 
  summarise(
    Female = quantile(Life_female, quants),
    Male = quantile(Life_male, quants)
  ) |> 
  ggplot() +
  aes(x = Female, y = Male) +
  geom_point() + 
  geom_abline(slope=1, intercept=0) +
  coord_fixed() +
  ggtitle(
    "Quantiles of life expectancy",
    subtitle = "are lower for males vs females"
    )

Some Q-Q Plot patterns

  • Case a: Quantiles of Y (mean/median etc) are higher than those of X

  • Case b: Spread or SD of Y > spread or SD of X

  • Case c: X and Y follow different distributions

    • R function: qqplot().

Bivariate relationships

A scatter plot shows the relationship between two quantitative variables. It can highlight linear or non-linear relationships, gaps/subgroups, outliers, etc. A lowess smoother or 2D density can help show the relationship.

p1 <- ggplot(horsehearts) +
  aes(x = EXTSYS, y = WEIGHT) +
  geom_point() + ggtitle("Scatterplot")

p1 

Bivariate relationships

A scatter plot shows the relationship between two quantitative variables. It can highlight linear or non-linear relationships, gaps/subgroups, outliers, etc. A lowess smoother or 2D density can help show the relationship.

p1 <- ggplot(horsehearts) +
  aes(x = EXTSYS, y = WEIGHT) +
  geom_point() + ggtitle("Scatterplot")

p1 + 
  geom_smooth(span = 0.8, se = FALSE) + 
  ggtitle("Scatterplot with lowess smoother")

Bivariate relationships

A scatter plot shows the relationship between two quantitative variables. It can highlight linear or non-linear relationships, gaps/subgroups, outliers, etc. A lowess smoother or 2D density can help show the relationship.

p1 <- ggplot(horsehearts) +
  aes(x = EXTSYS, y = WEIGHT) +
  geom_point() + ggtitle("Scatterplot")

p1 + 
  geom_density_2d() +
  ggtitle("Scatterplot with 2D density")

Marginal Plot

Shows both bivariate relationships and univariate (marginal) distributions

p1 <- ggplot(rangitikei) +
  aes(x = people, y = vehicle) + 
  geom_point() + theme_bw()

library(ggExtra)
ggMarginal(p1, type="boxplot")

Shiny apps

Explore (though many of these are currently not working)

https://shiny.massey.ac.nz/anhsmith/demos/explore.bivariate.plots/

https://shiny.massey.ac.nz/anhsmith/demos/get.bivariate.plots/

https://shiny.massey.ac.nz/anhsmith/demos/explore.facet.wrapped.plots/

https://shiny.massey.ac.nz/anhsmith/demos/get.facet.wrapped.plots/

https://shiny.massey.ac.nz/anhsmith/demos/explore.facet.grid.plots/

Pairs plot / scatterplot matrix

library(GGally)
ggpairs(pinetree[,-1])

Pairs plot with a grouping variable

library(GGally)
ggpairs(pinetree[,-1], 
        aes(colour = pinetree$Area))

Correlation coefficients

The Pearson correlation coefficient measures the linear association between two variables.

Correlation Matrix

  • To show all pairwise correlation coefficients
  • Useful to explore the inter-relationship between variables
library(psych)
corr.test(pinetree[,-1])
Call:corr.test(x = pinetree[, -1])
Correlation matrix 
        Top Third Second First
Top    1.00  0.92   0.96  0.97
Third  0.92  1.00   0.95  0.91
Second 0.96  0.95   1.00  0.97
First  0.97  0.91   0.97  1.00
Sample Size 
[1] 60
Probability values (Entries above the diagonal are adjusted for multiple tests.) 
       Top Third Second First
Top      0     0      0     0
Third    0     0      0     0
Second   0     0      0     0
First    0     0      0     0

 To see confidence intervals of the correlations, print with the short=FALSE option

Correlation Plots

library(corrplot)
corrplot(
  cor(pinetree[,-1]),  
  type = "upper", 
  method="number"
  )

Network plots

library(corrr)
pinetree[,-1] |> 
  correlate() |> 
  network_plot(min_cor=0.2)

3-D Plots

A bubble plot, shows the third (fourth) variable as point size (colour).

p1 <- ggplot(pinetree) +
  aes(x = First, 
      y = Second,
      size = Third) + 
  geom_point() +
  ggtitle("Bubble plot")

p1

3-D Plots

A bubble plot, shows the third (fourth) variable as point size (colour).

p1 <- ggplot(pinetree) +
  aes(x = First, 
      y = Second,
      size = Third) + 
  geom_point() +
  ggtitle("Bubble plot")

p1 + aes(colour = Area)

3-D plots are far more useful if you can rotate them

Package plot3D

library("plot3D")

scatter3D(
  x = pinetree$First, 
  y = pinetree$Second, 
  z = pinetree$Top, 
  phi = 0, bty = "g", 
  ticktype ="detailed"
  )

3-D plots are far more useful if you can rotate them

Package plotly

library(plotly)

plot_ly(
  pinetree, 
  x = ~First, 
  y = ~Second, 
  z = ~Top
  ) |> 
  add_markers()

Contour plots

  • 3D plots are difficult to interpret than 2D plots in general

  • Contour plots are another way of looking three variables in two dimensions

library(plotly)
plot_ly(type = 'contour', 
        x=pinetree$First, 
        y=pinetree$Second, 
        z=pinetree$Top)

Conditioning plots

Conditioning Plots (Coplots) show two variables at different ranges of third variable

coplot(Top ~ First | Second*Area, 
       data = pinetree)

Conditioning plots

Conditioning Plots (Coplots) show two variables at different ranges of third variable

# install.packages("remotes")
# remotes::install_github("mpru/ggcleveland")
library(ggcleveland)
gg_coplot(
  pinetree, 
  x = First, 
  y = Top, 
  faceting = Second, 
  number_bins = 6, 
  overlap = 3/4
  )

More R graphs

Build plots in a single layout (R packages patchwork or gridExtra)

p1 <- ggplot(testmarks) +
  aes(y = English, x = Maths) + 
  geom_point()

p2 <- p1 + 
  stat_density_2d(
    geom = "raster",
    aes(fill = after_stat(density)),
    contour = FALSE) + 
  scale_fill_viridis_c() + 
  guides(fill=FALSE)

library(patchwork)
p1 / p2

Time series data

  • A Time Series is an ordered sequence of observations of a variable(s) (often) made at equally spaced time points.

  • Time series Components of variation

    • Trend - representing long term positive (upward) or negative (downward) movement
    • Seasonal - a periodic behaviour happening within a block (say Christmas time) of a given time period (say in a calendar year) but this periodic behaviour will repeat fairly regularly over time (say year after year)
    • Error (Residual)

Time series example

notes = read.table("../data/20dollar.txt", header=TRUE, sep="")

library(forecast)

# value in millions; turn into time series (ts) object
NZnotes20 <- ts(notes$value / 1000, start=1968, frequency=1)

autoplot(
  NZnotes20, 
  xlab="Year", 
  ylab="Value of $20 notes (millions)"
  ) + geom_point()

Autocorrelation function (ACF)

Autocorrelation captures the extent to which neighbouring data values are similar to each other.

The \(k ^ \text{th}\) order ACF or the autocorrelation between \(x_t\) and \(x_{t-k}\) is

\[\frac{\text{Covariance}(x_t, x_{t-k})}{\text{SD}(x_t)\text{SD}(x_{t-k})} = \frac{\text{Covariance}(x_t, x_{t-k})}{\text{Variance}(x_t)}\]

Autocorrelation function (ACF) Plot

The significance of autocorrelations may be judged from the 95% confidence interval band

ggAcf(NZnotes20)

Autocorrelations decay to zero ($20 notes positively depend on the values of $20 notes held in the immediate past rather than too distant past)

PACF (Partial Autocorrelation Function)

A type of correlation after removing the effect of earlier lags

ggPacf(NZnotes20)

Time series trend types

Requires a (parametric) model to fit the trend (covered later)

Non-parametric fits can also be made

Seasonality

Simple scatter plot of the response variable against time may reveal seasonality directly

# Data from:
# https://www.stats.govt.nz/indicators/uv-intensity/

uv = read.table("../data/uv.txt",
                header=TRUE, sep="")

uv <- ts(uv$erythemal.uv, 
         start=c(1990,1), 
         frequency=12)

autoplot(uv, 
         xlab="time", 
         ylab="Erythemal UV")

Sub-series plots

Seasonality is easily seen graphically when grouping variables are used

p1 <- ggseasonplot(uv)
p2 <- ggsubseriesplot(uv)
library(patchwork); p1 + p2

Time series decomposition

  • Additive model
    \(X_t\) = Trend + Seasonal + Error
    (where \(X_t\) is an observation at time \(t\))

  • Multiplicative model
    \(X_t\) = Trend \(\times\) Seasonal + Error
    (trend and seasonal components are not independent)

  • Detrending means removing the trend from the series, making it easier to see the seasonality.

  • Deseasoning means removing the seasonality from the series, making it easier to see the trend.

uv |> decompose(type="additive") |> autoplot() + ggtitle("")

Learning EDA

  • The best way to learn EDA is to try many approaches and find which are informative and which are not.

  • Chatfield (1995) on tackling statistical problems:

    • Do not attempt to analyse the data until you understand what is being measured and why. Find out whether there is prior information such as are there any likely effects.
    • Find out how the data were collected.
    • Look at the structure of the data.
    • The data then need to be carefully examined in an exploratory way before attempting a more sophisticated analysis.
    • Use common sense, and be honest!

Summary

  • Size

    • For small datasets, we cannot be too confident in any patterns we see. More likely for patterns to occur ‘by chance’.
    • Some displays are more affected by sample size than others
  • Shape

    • In can be interesting to display the overall shape of distribution.
    • Are there gaps and/or many peaks (modes)?
    • Is the distribution symmetrical? Is the distribution normal?
  • Outliers

    • Boxplots & scatterplots can reveal outliers
    • More influential than points in the middle
  • Graphs should be simple and informative; certainly not misleading!