Paul Apivat

Posted on Nov 18, 2020 • Edited on Nov 20, 2020

Going beyond summary statistics

#datascience #analytics #analyst #rstats

Datasaurus Introduction

note: This post uses R, Rmarkdown, and packages in the Tidyverse and other R packages.

I recently came across the Datasaurus dataset by Alberto Cairo on #TidyTuesday and wanted to create a series of charts illustrating the lessons associated with this dataset, primarily to: never trust summary statistics alone.

First, some context. Here's Alberto's original tweet from years ago when he created this dataset:

This tweet alone doesn't communicate why we shouldn't trust summary statistics alone, so let's unpack this. First we'll load the various packages and data we'll use.

Load Packages

library(tidyverse)
library(ggcorrplot)
library(ggridges)

Load Data

note : datasaurus and datasaurus_dozen are identical. The former is provided via #TidyTuesday, the latter from this research paper discussing more advanced concepts beyond the scope of this document (i.e., simulated annealing).

You'll also note that datasaurus_dozen and datasaurus_wide are the same data, organized differently. The former in long format and the latter, in wide format - see here for details.

For the most part, we'll use datasaurus_dozen throughout this document. We'll use datasaurus_wide when we get to the correlation section.

datasaurus <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-10-13/datasaurus.csv')
datasaurus_dozen <- read_tsv('DatasaurusDozen.tsv')
datasaurus_wide <- read_tsv('DatasaurusDozen-wide.tsv')

Summary Statistics

First, we'll note that if we just look at summary statistics (i.e., mean and standard deviation), we might conclude that these variables are all the same. Moreover, within each variable, x and y values have very similarly low correlations at ranging from -0.06 to -0.07.

datasaurus_dozen %>%
    group_by(dataset) %>%
    summarize(
        x_mean = mean(x),
        x_sd = sd(x),
        y_mean = mean(y),
        y_sd = sd(y),
        corr = cor(x,y)
    )

There are 13 variables, each with X- and Y- axes. Here are the summary statistics. You'll note that all 13 variables have the same mean and standard deviation for their X and Y values.

Boxplots

You could use boxplots to show slight variation in the distribution and median values of these 13 variables. However, the mean values, indicated with the red circles, are identical.

datasaurus_dozen %>%
    ggplot(aes(x = dataset, y = x, fill = dataset)) +
    geom_boxplot(alpha = 0.6) +
    stat_summary(fun = mean, geom = "point", shape = 20, size = 6, color = "red", fill = "red") +
    scale_fill_brewer(palette = "Set3") +
    theme_classic() +
    theme(legend.position = 'none') +
    labs(
        y = '13 variables',
        x = 'X-values',
        title = "Boxplots: Slight differences in the distribution and median values (X-axis)",
        subtitle = "Identical mean values"
    )

Here's the same plot for y values:

datasaurus_dozen %>%
    ggplot(aes(x = dataset, y = y, fill = dataset)) +
    geom_boxplot(alpha = 0.6) +
    stat_summary(fun = mean, geom = "point", shape = 20, size = 6, color = "red", fill = "red") +
    scale_fill_brewer(palette = "Paired") +
    theme_classic() +
    theme(legend.position = 'none') +
    labs(
        y = '13 variables',
        x = 'Y-values',
        title = "Boxplots: Slight differences in the distribution and median values (Y-axis)",
        subtitle = "Identical mean values"
    )

Ridgeline Plot

We can begin to get a sense for how these variables are different if we plot the distribution in different ways. The ridgeline plot begins to reveal aspects of the data that were hidden before.

We can begin to see that certain variables have markedly different distribution shapes (i.e., v_lines, dots, x_shape, wide_lines), while having the same mean value.

datasaurus_dozen %>%
    ggplot(aes(x = x, y = dataset, fill = dataset)) +
    geom_density_ridges_gradient(scale = 3, quantile_lines = T, quantile_fun = mean) +
    scale_fill_manual(values = c('#a6cee3', '#1f78b4', '#b2df8a', '#33a02c', '#fb9a99', '#e31a1c', '#fdbf6f', '#ff7f00', '#cab2d6', '#6a3d9a', '#ffff99', '#b15928', 'grey')) +
    theme_classic() +
    theme(legend.position = 'none') +
    labs(
        x = "X-values",
        y = "13 variables",
        title = "Ridgeline Plot: More variation in the distribution (X-axis)",
        subtitle = "Identical mean values"
    )

For y values, high_lines, dots, circle and star have obviously different distributions from the rest. Again, the mean values are identical across variables.

datasaurus_dozen %>%
    ggplot(aes(x = y, y = dataset, fill = dataset)) +
    geom_density_ridges_gradient(scale = 3, quantile_lines = T, quantile_fun = mean) +
    scale_fill_manual(values = c('#a6cee3', '#1f78b4', '#b2df8a', '#33a02c', '#fb9a99', '#e31a1c', '#fdbf6f', '#ff7f00', '#cab2d6', '#6a3d9a', '#ffff99', '#b15928', 'grey')) +
    theme_classic() +
    theme(legend.position = 'none') +
    labs(
        x = "Y-values",
        y = "13 variables",
        title = "Ridgeline Plot: More variation in the distribution (Y-axis)",
        subtitle = "Identical mean values"
    )

Correlations

If you skip visualizing the distribution and central tendencies and go straight to seeing how the variables correlate with each other, you could also miss some fundamental differences in the data.

In particular, the x and y values across all 13 variables are highly correlated. With just knowledge of the summary statistics, one could be led to believe that these variables are highly similar.

Below is the code to create an (abbreviated) correlation matrix.

library(ggcorrplot)

# X-values
# selecting rows 2-143 
# turning all values from character to numeric
datasaurus_wide_x <- datasaurus_wide %>%
    slice(2:143) %>%
    select(away, bullseye, circle, dino, dots, h_lines, high_lines, slant_down, slant_up, star, v_lines, wide_lines, x_shape) %>%
    mutate_if(is.character, as.numeric)

# Y-values
# selecting rows 2-143 
# turning all values from character to numeric
datasaurus_wide_y <- datasaurus_wide %>%
    slice(2:143) %>%
    select(away_1, bullseye_1, circle_1, dino_1, dots_1, h_lines_1, high_lines_1, slant_down_1, slant_up_1, star_1, v_lines_1, wide_lines_1, x_shape_1) %>%
    mutate_if(is.character, as.numeric)


# correlation matrix for X values
corr_x <- round(cor(datasaurus_wide_x), 1)

# correlation matrix for Y values
corr_y <- round(cor(datasaurus_wide_y), 1)

head(corr_x[, 1:6])

Visualizing the correlation matrix

Here is a correlation between the x-values between all 13 variables. You can see that all variables, aside from away, are highly correlated with each other.

# correlation between X-values
ggcorrplot(corr_x, hc.order = TRUE, 
           type="lower", 
           outline.color = "white",
           ggtheme = ggplot2::theme_gray,
           colors = c("#d8b365", "#f5f5f5", "#5ab4ac"),
           lab = TRUE)

Here is a correlation between the 'y-values' between all 13 variables. Again, aside from away, all the variables are highly correlated with each other.

# correlation between Y-values
ggcorrplot(corr_y, hc.order = TRUE, 
           type="lower", 
           outline.color = "white",
           ggtheme = ggplot2::theme_gray,
           colors = c("#ef8a62", "#f7f7f7", "#67a9cf"),
           lab = TRUE)

Facets

At this point, the boxplots show us variables with similar median and identical mean; the ridgelines begin to show us that some variables have different distributions. And the correlation matrix suggests the variables are more similar than not.

To really see their differences, we'll need to use facet_wrap.

Here we'll use facet_wrap to examine the histogram for x and y values of all 13 variables. We started to see the differences in distribution between variables from the ridgeline plots, but overlapping histograms provide another perspective.

# facet histogram (both-values)
datasaurus_dozen %>%
    group_by(dataset) %>%
    ggplot() +
    geom_histogram(aes(x=x, fill='red'), alpha = 0.5, bins = 30) +
    geom_histogram(aes(x=y, fill='green'), alpha = 0.5, bins = 30) +
    facet_wrap(~dataset) +
    scale_fill_discrete(labels = c('y', 'x')) +
    theme_classic() +
    labs(
        fill = 'Axes',
        x = '',
        y = 'Count',
        title = 'Faceted Histogram: x- and y-values'
    )

Scatter Plot

However, if there's one thing this dataset is trying to communicate its that there's no substitute for plotting the actual data points. No amount of summary statistics, central tendency or distribution is going to replace plotting actually data points.

Once we create the scatter plot with geom_point, we see the big reveal with this dataset. That despite the similarities in central measures, for the most part similar distributions and high correlations, the 13 variables are wildly different from each other.

datasaurus_dozen %>%
    group_by(dataset) %>%
    ggplot(aes(x=x, y=y, color=dataset)) +
    geom_point(alpha = 0.5) +
    facet_wrap(~dataset) +
    scale_color_manual(values = c('#a6cee3', '#1f78b4', '#b2df8a', '#33a02c', '#fb9a99', '#e31a1c', '#fdbf6f', '#ff7f00', '#cab2d6', '#6a3d9a', '#ffff99', '#b15928', 'grey')) +
    theme_classic() +
    theme(legend.position = "none") +
    labs(
        x = 'X-axis',
        y = 'Y-axis',
        title = 'Faceted Scatter Plot'
    )

There are other less common alternatives to the scatter plot.

Geom Density 2D

While not as clear as the scatter plot, plotting the contours of a 2D density estimate does show how very different the variables are from each other, despite similar summary statistics.

# contours of a 2D Density estimate
datasaurus_dozen %>%
    ggplot(aes(x=x, y=y)) +
    geom_density_2d() +
    theme_classic() +
    facet_wrap(~dataset) +
    labs(
        x = 'X-axis',
        y = 'Y-axis',
        title = 'Contours of a 2D density estimate'
    )

This is a slight variation using stat_density_2d:

# stat density 2d
datasaurus_dozen %>%
    ggplot(aes(x=x, y=y)) +
    stat_density_2d(aes(fill=y), geom = "polygon", colour = 'white') +
    theme_classic() +
    facet_wrap(~dataset) +
    labs(
        x = 'X-axis',
        y = 'Y-axis',
        title = 'Stat Density 2D estimate'
    )

Using the density_2d plots are quite effective in showing how different the variables are and serve as a nice alternative to the more familiar scatter plot.

Hopefully this vignette illustrates the importance of never trusting summary statistics (alone). Moreover, when visualizing, we should go beyond simply visualizing the data's distribution or central tendency, but plotting the actually data points.

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

DEV Community