In this explainer, we will learn how to compare two data set distributions using box plots.
Box plots, which are sometimes called box-and-whisker plots, can be a good way to visualize differences among groups that have been measured on the same variable.
Before we look at comparing data sets using box plots, however, let us remind ourselves of the key elements of a box plot.
Key Elements: Box Plots
A box plot is constructed using the minimum and maximum quartiles (Q1 and Q3) and the median of a data set.
- The horizontal axis covers all possible data values.
- The box part of a box-and-whisker plot covers the middle 50% of the values in the data set. This 50% of the data lies between Q1 and Q3 and its range is the interquartile range, .
- The whiskers each cover 25% of the data values.
- The lower whisker covers all the data values from the minimum value up to Q1, that is, the lowest 25% of data values.
- The upper whisker covers all the data values between Q3 and the maximum value, that is, the highest 25% of data values.
- The median sits within the box and represents the center of the data. 50% of the data values lie above the median and 50% lie below the median.
Outliers, or extreme values in a data set, are usually indicated on a box-and-whisker plot by the “star” symbol. If there is one or more outliers in a data set, for the purpose of drawing a box-and-whisker plot, we take the minimum and maximum to be the minimum and maximum values of the data set excluding the outliers.
Let us look at some examples of how to use two box plots to compare two data sets on the same variable.
Example 1: Comparing Data Sets on the Same Measurement Using Box Plots
It is thought that taking math tests in the morning results in higher grades than if tests are taken in the afternoon. Do the data represented in the box plots below confirm this hypothesis or not?
In the box plot for tests taken in the morning, the median score (the vertical line inside the box) sits above the axis approximately at the 87.5 mark, whereas the median for the scores of tests taken in the afternoon sits above about 75 on the axis.
We can therefore conclude that since the median score for morning tests is higher than that for afternoon tests, on average, morning math test scores are higher.
Box plots are often but not always drawn with the variable values on the horizontal axis. Our next example has the variable values on the vertical axis.
Example 2: Comparing Internet Searches for Cats and Dogs Using Box Plots
It has long been thought that cats are the most popular internet pet. To test this theory, data were collected by a well-known search engine on the number of searches for cat-related videos and dog-related videos each month over a five-year period. The results are represented in the box plots below.
Use the information illustrated in the box plots to determine whether or not cats are indeed the premier internet pet.
To decide whether or not cats are more popular than dogs in internet searches, we want to know, on average, which pet has the highest number of searches. The measure of average (or central tendency) that we have available from the box plots is the median. This is shown by the line inside each box. In this case, the median lines are horizontal, since the variable values are on the vertical axis.
By reading across from the median lines to the vertical axis, we can find the values of the medians for cat and dog searches.
The median number of searches for cats per month appears to be approximately 15.5 million, whereas for dogs it is approximately 19 million. Since, on average, the number of searches for dogs per month was higher, we can conclude that the data disproves the theory! In fact, in online searches for videos of them, dogs are on average the most popular pet, not cats.
Example 3: Comparing Track Length Using Box Plots
The box plots below represent data collected on the length of recorded tracks from the top 100 rap and the top 100 heavy metal music charts.
- On average, which genre of music has longer tracks?
- Compare the significance of the track length 4.40 minutes for the two music genres.
The measure of average used in a box plot is the median, which is the vertical bar inside the box.
The median track length for rap tracks is 4.00 minutes, whereas the median for heavy metal tracks is 4.80 minutes. Since the median for heavy metal tracks is higher than that for Rap tracks, we can conclude that heavy metal tracks are on average longer than rap tracks.
For rap music tracks, 4.40 minutes is where the right-hand vertical bar of the box sits. This represents the value of Q3, the third quartile for rap.
This means that 75% of rap tracks are less than 4.40 minutes long and only 25% are longer than 4.40 minutes.
For heavy metal tracks, however, 4.40 minutes represents Q1, the first quartile.
This means that 25% of heavy metal tracks are shorter than 4.40 minutes long and 75% of the tracks are longer than 4.40 minutes.
Hence, while 75% of rap tracks are shorter than 4.40 minutes, 75% of heavy metal tracks are longer than 4.40 minutes.
In our next two examples, we compare the distributions of two data sets on the same variable using box plots.
Example 4: Comparing Lion Miles Using Box Plots
Using special tracker collars, the number of miles the Namibian lions Ramy and Mona traveled each night was recorded over a month. The data is represented in the box plots below.
Referring to the box plots, compare the number of milesRamy and Mona traveled at night during the month.
We can see that Ramy and Mona traveled similar ranges of miles at night. Ramy traveled between 2 and 17 miles a night. Mona traveled between 2 and 15 miles a night.
Since Ramy’s box in the plot is wider than Mona’s, we can say that Ramy’s distances varied further from his middle traveling distance than Mona’s did. The width of the box is the interquartile range (Q3–Q1), which represents 50% of the data. Interpreting this from our plots, on half the nights, Ramy traveled between 4 and 10 miles, compared to Mona, who traveled between 9 and 12 miles.
Mona’s median distance of 11 miles was a few miles higher than Ramy’s, which was 5 miles. So, we can say that, on average, Mona traveled further than Ramy.
If we look within each box, notice that the box to the left of Ramy’s median is narrower than the box to the right of the median. Also note that his left whisker is shorter than his right whisker (i.e., the whiskers attached to the box, not Ramy’s actual whiskers!). Both of these features tell us that Ramy traveled shorter distances, less than 5 miles, 50% of the time, whereas Mona traveled longer distances, more than 11 miles (Mona’s median), 50% of the time.
Let us look at Ramy’s box plot separately, in more detail.
We can see that 50% of Ramy’s distances were concentrated between 2 miles (the minimum) and 5 miles (the median) and that half of those were concentrated between 4 miles (Q1) and 5 miles (the median). The other 50% of Ramy’s distances were over a bigger spread of values: 25% between 5 (median) and 10 (Q3) miles and 25% between 10 (Q3) and 17 (the maximum) miles.
Similarly, for Mona (as shown in the box plot below), 25% of the time her distances were concentrated between 11 (her median) and 12 (Q3) miles. And 25% were between 12 (Q3) and 15 miles, her maximum distance.
The lower half of her distances were spread over a wider range of distances: 25% between 2 (her minimum) and 9 miles (Q1) and 25% between 9 (Q1) miles and her median, 11 miles.
In statistical terms, the concentration of half of Ramy’s data in a small range of lower values means that Ramy’s distances were right (or positively) skewed. And since half of Mona’s data were concentrated in a narrow but quite high range, Mona’s distances were left (or negatively) skewed. A third possibility is that if, in a box plot, the box is split evenly on either side of the median and the whiskers are approximately the same length, the data is symmetric.
We have compared three different statistical measures in this example: the shape, average and spread of the data. When looking at whether or not a data set is skewed (or symmetric), we are looking at the shape of the data. When considering the median, we are talking about the “average” and when looking at the range and interquartile range, we are considering the spread of the data.
Example 5: Comparing WNBA and NBA Scores Using Box Plots
The box plots below represent the WNBA and NBA winning scores over a particular season.
Compare the distributions for WNBA and NBA winning scores.
In comparing the distributions, we will look at the “shape”, “average,” and “spread” of each data set and compare. We start with the “average” and the first thing to note that the median WNBA score is around 85 and the median NBA score is somewhat higher, at approximately 114. In general, then, we can say that the NBA winning scores are higher than those for the WNBA.
The next thing we look at is the spread of the two data sets, using the range and the interquartile range (IQR). WNBA scores range from a low of 70 to a high of 110, so their range is , whereas the lowest NBA score is 100 and the highest, approximately, is 128, with a calculated range of . Hence, overall, the WNBA results have a wider spread than those for the NBA.
Similarly, we can compare the interquartile ranges, which are given by .
The WNBA interquartile range, that is, the spread of the middle 50% of the WNBA data, is 20, which is twice that of the middle 50% of the NBA data, where the IQR is 10. So the WNBA winning scores are spread much more widely about the center than the NBA winning scores. Another way of saying this is that the NBA scores are clustered closer around the middle value than the WNBA scores.
We can note also that neither the WNBA or the NBA winning scores appear to have any outliers.
One final thing to consider is the “shape” of the data. We are looking for whether the data are approximately symmetric about their center or the distribution is skewed in a particular direction. We can tell this to some extent by looking at both the whiskers and the boxes in the box plots.
For the WNBA data, the left whisker is shorter than the right. This indicates that the data is a little right (or positively) skewed. That is, the lower scores are more concentrated within a narrower range of values than those in the higher range. The box itself is symmetric on either side of the median, however, so the data is not heavily skewed. For the NBA data, the whiskers are approximately equal and there is only a small lack of symmetry on either side of the median inside the box, indicating that the data is more or less symmetric about the center.
In conclusion, the WNBA winning scores and those for the NBA differ in shape, average, and spread. While the WNBA winning scores are, on average, lower than those of the NBA, the WNBA winning scores have a much larger spread around the center. Also, the WNBA scores are slightly positively skewed, more concentrated at the lower end of the scores, whereas the NBA winning scores are approximately symmetric about their center.
Let us now summarize the key points associated with comparing data sets using box plots.
When comparing the distributions of two data sets on the same measurement using box plots, we can compare the “shape”, “average,” and “spread” of the data sets.
- Shape: The shape of a data set refers to whether or not it is symmetric or skewed. If a data set is distributed symmetrically about the center, the box should be approximately
evenly split by the median and the whiskers should approximately be of equal length.
If a data set is skewed (i.e., more concentrated at one end than the other), then one of the whiskers will be longer than the other and the box will not be evenly split by the median.
- If a data set is right or positively skewed, then the right whisker will be longer than the left and the boxes on the right side will be longer than those on the left.
- If a data set is left or negatively skewed, then the left whisker will be longer than the right and the boxes on the left side are longer than on the right.
- Average: In a box plot, the measure of average used is the median. This is represented by the vertical line inside the box. Comparing the medians of two data sets, we can determine in which data set the values are “on average” higher or lower than in the other, or if there is no difference on average.
- Spread: The spread of the data sets can be compared by looking at the ranges (maximum value,
minimum value) and the interquartile ranges (Q3–Q1).
- The range tells us the overall spread of each data set.
- The interquartile range tells us the spread of the middle 50% of the data, that is, how far the middle 50% of the data deviates from the center.
Box plots can be used to compare multiple data sets where the values are measurements of the same variable. Also, box plots can be drawn vertically as opposed to horizontally, as illustrated below.