In this explainer, we will learn how to compare two data set distributions using box plots.
Box plots, which are sometimes called box-and-whisker plots, can be a good way to visualize differences among groups that have been measured on the same variable.
Before we look at comparing data sets using box plots, however, let us remind ourselves of the key elements of a box plot.
Key Elements: Box Plots
A box plot is constructed using the minimum and maximum quartiles (Q1 and Q3) and the median of a data set.
- The horizontal axis covers all possible data values.
- The box part of a box-and-whisker plot covers the middle 50% of the values in the data set. This 50% of the data lies between Q1 and Q3 and its range is the interquartile range, .
- The whiskers each cover 25% of the data values.
- The lower whisker covers all the data values from the minimum value up to Q1, that is, the lowest 25% of data values.
- The upper whisker covers all the data values between Q3 and the maximum value, that is, the highest 25% of data values.
- The median sits within the box and represents the center of the data. 50% of the data values lie above the median and 50% lie below the median.
Outliers, or extreme values in a data set, are usually indicated on a box-and-whisker plot by the “star” symbol. If there is one or more outliers in a data set, for the purpose of drawing a box-and-whisker plot, we take the minimum and maximum to be the minimum and maximum values of the data set excluding the outliers.
Let us look at some examples of how to use two box plots to compare two data sets on the same variable.
Example 1: Comparing Data Sets on the Same Measurement Using Box Plots
It is thought that taking math tests in the morning results in higher grades than if tests are taken in the afternoon. Do the data represented in the box plots below confirm this hypothesis or not?
In the box plot for tests taken in the morning, the median score (the vertical line inside the box) sits above the axis approximately at the 87.5 mark, whereas the median for the scores of tests taken in the afternoon sits above about 75 on the axis.
We can therefore conclude that since the median score for morning tests is higher than that for afternoon tests, on average, morning math test scores are higher.
Box plots are often but not always drawn with the variable values on the horizontal axis. Our next example has the variable values on the vertical axis.
Example 2: Comparing Internet Searches for Cats and Dogs Using Box Plots
It has long been thought that cats are the most popular internet pet. To test this theory, data were collected by a well-known search engine on the number of searches for cat-related videos and dog-related videos each month over a five-year period. The results are represented in the box plots below.
Use the information illustrated in the box plots to determine whether or not cats are indeed the premier internet pet.
To decide whether or not cats are more popular than dogs in internet searches, we want to know, on average, which pet has the highest number of searches. The measure of average (or central tendency) that we have available from the box plots is the median. This is shown by the line inside each box. In this case, the median lines are horizontal, since the variable values are on the vertical axis.