DEV Community

Anna Zubova
Anna Zubova

Posted on

Deconstructing the Box and Whisker Plot

When trying to understand what a set of data looks like, there are plenty of options as to how to visualize it. It is important to pick the ones that serve the specific question we want to ask.

A histogram is usually the first choice when visualizing data and making a preliminary analysis of a distribution. A box and whisker plot (often referred to as box plot), however, can be used on its own or as an additional tool in data analysis.

A box plot uses 5 important descriptive statistics of a distribution: median value, lower quartile, upper quartile, and maximum and minimum values. It quickly gives us a sense of what data looks like and allows to compare different groups of data in one simple plot.

Here is an example of a basic box plot:

Basic box plot

Limitations

It is important to understand that these 5 statistics cannot be the only measure of spread used to describe a distribution, being inferior to metrics like mean and standard deviation. However, in case the distribution is highly skewed or if there are outliers, it can be a very useful tool to check shape, spread and variability of data.

Box plots are great in showing whether the data is symmetric, but they will not show the type of symmetry. For example, two sets of data can look exactly the same as box plots, but one can have a significant variability of frequencies and another is uniformly distributed. A box plot wouldn’t be the right tool to check for those features. For that reason, box plots are better to be used in combination with other visualization methods like, for example, a histogram.

Histogram vs box plot

Visualizing outliers with box plots

One of the main purposes of the box plot is to quickly visualize outliers to see if it is necessary to remove them for further analysis. But to actually understand what is considered an outlier, let’s look at the following representation of the box plot and PDF of a normal distribution.

Source: https://commons.wikimedia.org/wiki/File:Boxplot_vs_PDF.svg

Source: https://commons.wikimedia.org/wiki/File:Boxplot_vs_PDF.svg

The whiskers actually represent values beyond which the data will be considered outliers. To determine the lower limit, interquartile range times 1.5 is subtracted from the 1st quartile value. To determine the upper limit, we should add 1.5 times the interquartile range to the 3rd quartile value.

In a normal distribution the box plot with whiskers represents 99.3% of all the data; that is, the outliers are only 0.7% of data.

Comparing data

Another very important utility of box plots is to compare data from different groups. Plotting several box plots next to each other gives us a perfect sense of whether the groups are similar.

The things we have to look for are:

  • If boxes overlap.
    If there is no overlapping, it is quite clear that the groups are different.

  • If the medians are in visual range of the box compared with. If not, it is likely that the groups are different.

  • Ranges of the boxes.
    It is helpful to evaluate the comparative range of the boxes to see how much difference there is in the spread of the data.

  • Skewness.
    As skewness is easily observed from the box plots, it can be useful to compare this parameter between two plots.

This preliminary visual analysis can help understand if two groups we are looking at are similar and if we need to apply some other techniques to further measure how different they are.

Let’s look at the data from the World Happiness Report from Kaggle. First, let’s look at the happiness scores from 2017 and 2016.

Happiness scores

The groups are clearly very similar since the medians are located at the same level. The spread of the second plot is slightly wider that the first one.

However, if we compare health and freedom scores, the box plots will show more differences.

Health and freedom scores, 2017

We can actually extract the values of the statistics calculated by the box plot. The object that is returned after creating a plot has all the values stored in it. To see what keys it has, we can run bp.keys(). For example, to extract the median, we can use the following code:

#get values for the medians
#bp is a box plot object

medians = []
for i in bp['medians']:
    medians.append(i.get_data()[1][0])
Enter fullscreen mode Exit fullscreen mode

medians now is equal to [0.6060415506362921, 0.43745428323745705]

To get the upper and lower levels of the boxes, we can implement this code, where we will access the second element from the bp['boxes'] object that represents y-axis values for the lines. After that we will select first and third element that are lower and upper y-axis value of the box:

#get values for boxes' lower and upper values
boxes = []
for i in bp['boxes']:
    boxes.append(i.get_data()[1][0])
    boxes.append(i.get_data()[1][3])
Enter fullscreen mode Exit fullscreen mode

boxes now contains the list [0.36986629664897896, 0.723007529973984, 0.3036771714687345, 0.5165613889694209]

So, the range of the first box (where 50% of the data is located) lies between 0.37 and approximately 0.72, with 0.61 as median value. The second box plot has a range of 0.30 to 0.52 with median value at 0.44.

Notched box plot

One interesting feature of the box plots that is often overlooked is the notched parameter, which allows to compare confidence intervals for the median value. By default, the confidence level is 95%. This option is especially useful to compare groups of the same values, and we would look for visual overlapping of the notches that would indicate similarities/differences in median values.

Notched box plots can be used together with another parameter in Matplotlib’s box plot, bootstrap. By default it is set equal to None. If set to an integer, that would indicate how many times bootstrapping should be performed in order to determine confidence intervals.

Other useful options

There are some other parameters that can be useful when creating a box plot with Matplotlib library.

sym: determines the look of flier points. Setting it equal to empty string will tell Matplotlib that we don’t want to show outliers.

whis: parameter allows to change the reach of whiskers. By default this parameter is equal to 1.5. Lower and upper range of whiskers is determined by Q1 - 1.5*IQR and Q3 + 1.5*IQR accordingly. If whis is set to 'range' string, the whiskers reach to minimum and maximum values.

vert: accepts a boolean value. By default it is set to True, but if set to False, the box plot will appear horizontally.

positions: accepts an array-like parameter. By default it is (1, N+1) where N is the number of box plots. If set to (1,1), 2 box plots will overlap.

widths : sets the width of each box.

labels : sets labels for each box plot.

References and further reading

Visualizations from this blogpost can be found in my GitHib profile.

Top comments (1)

Collapse
 
sherzyang profile image
Sherry Yang

This is so useful Anna!! Thank you for putting it together.