Video: Comparing Two Distributions Using Box Plots

In this video, we will learn how to compare two data set distributions using box plots.

16:30

Video Transcript

In this lesson, we’re gonna learn how to compare two sets of data distributions using box plots. And what box plots are, and they’re sometimes called box-and-whisker plots, are a good way to visualize differences among groups that’ve been measured on the same variable. However, what we want to do before we start comparing our data sets using box plots is to remind ourselves of the key elements of a box plot.

So, as I said, what we’re gonna do is have a look at the key elements of a box plot. So we’ve got a sketch here. And we’re looking first at the horizontal axis, which covers all the possible data values. So next, what we’re looking at is the box section of our box plot. And what this is is constructed using the upper and lower quartile, so Q one and Q three, and the median of our data set. And the box part of the box-and-whisker plot covers the middle 50 percent of the values in the data set. This 50 percent of the data lies between Q one and Q three, so our lower and upper quartile, and its range is the interquartile range or IQR. And we find this by subtracting Q one from Q three.

And then we have our whiskers. And these each cover 25 percent of the data values. So the lower whisker covers all the data values from the minimum value up to Q one. So what that is is the lowest 25 percent of the data values. So it’s quarter of our data values. Then we’ve got the upper whisker, which covers all the data values between Q three and the maximum value. And what that is is the highest 25 percent of our data values. So this is the highest quarter of our data values.

So then we take a look at the median. And what this does is sits within the box and represents the center of the data. So 50 percent of the data values lie above the median, and 50 percent of our data values lie below the median.

So then, finally, what we’ll take a look at is our outliers or extreme values. And what they’re usually indicated by is by a star symbol. And what we find is that if there’s one or more outlier in a data set, for the purposes of drawing our box-and-whisker plot, we take the minimum maximum to be the minimum and maximum values of the data, excluding these outliers.

So great, we’ve had a look at the key elements of a box plot. So now let’s take a look at some examples of how to use two box plots to compare two data sets on the same variable.

It is thought that taking math tests in the morning results in higher grades than taking math tests in the afternoon. Do the data represented in the box plots below confirm this hypothesis or not?

Well, if we take a look at the box plot for the test taken in the morning first, we can see that the median is at approximately 87.5, whereas if we look at the scores for the tests taken in the afternoon, we can see that the median sits at about 75 on the axis. We can therefore conclude that since the median score for morning tests is higher than that for afternoon tests on average, morning math test scores are higher. So therefore, in answer to the question, what we can say is that yes taking math tests in the morning results in higher marks than taking math tests in the afternoon.

So what we’ve done here is taken a look at box plots that are drawn with variable values on the horizontal axis. However, this isn’t always the case. Our next example has the variable values on the vertical axis.

It has long been thought that cats are the most popular Internet pet. To test this theory, data were collected by a well-known search engine on the number of searches for cat-related videos and dog-related videos each month over a five-year period. The results are represented in the box plots below. Use the information illustrated in the box plots to determine whether or not cats are indeed the premier Internet pet.

So in order to decide whether or not cats are more popular than dogs in Internet searches, we want to know on average which pet has the highest number of searches. So the measure of average, or we can also call it the central tendency, that we have available from the box plots is the median. And this is shown by the line inside each of our boxes. And we can see that, in this case, so our example here, the median lines are horizontal since the variable values are on the vertical axis.

And what we can see by reading across from the median line to the vertical axis is the value of the medians. We can see that the median for cats is approximately 15.5. And we need to remember that this is millions of searches. Whereas for dogs, we can see that it’s approximately 19 million searches. Since, on average, the number of searches for dogs per month was higher, we can conclude that the data disproves the theory. Because, in fact, in online searches for videos of them, dogs are on average the most popular pet, not cats.

Okay, great, we’ve now looked at a couple of examples that compare data sets. Well, what we will look at now is a question that will do the same but it will also look at individual values and their significance.

The box plots below represent data collected on the length of recorded tracks from the top 100 rap and the top 100 heavy metal music charts. On average, which genre of music has longer tracks? And then compare the significance of the track length 4.40 minutes for the two music genre.

Well, if we take a look at part one first, the measure of average used in a box plot is the median, which is the vertical bar inside the box. Well, if we read down from our medians to the axis, we can see that the median track length for rap tracks is 4.00 minutes, whereas the median for heavy metal tracks is 4.80 minutes. And since the median for heavy metal tracks is higher than that for rap tracks, we can conclude that heavy metal tracks are on average longer than rap tracks.

Although it’s not part of this question, we could however also look at the spread of the track lengths. And if we did that, we would see that the range, which is the maximum value minus the minimum value, of rap tracks is in fact greater. So it has a greater spread of values. But also the IQR or interquartile range, which is Q three minus Q one, is also greater. So we can see actually that the differences of spread of the track lengths of rap tracks is greater than metal tracks.

Okay, so now let’s take a look at the second part of the question. Well, for the second part of the question, what we want to do is concentrate on the value 4.40 minutes. So if we take a look at what this means for rap tracks, well 4.40 minutes is where the right-hand vertical bar of the box sits. And what this represents is the value of Q three, our upper quartile, so the third quartile for rap. And what this means is that 75 percent of rap tracks are less than 4.40 minutes long and only 25 percent are in fact longer than 4.40 minutes long.

However, if we look at heavy metal tracks, 4.40 minutes represents Q one, so the lower quartile. And what this means is that 25 percent of heavy metal tracks are in fact shorter than 4.40 minutes long and 75 percent of them are longer than 4.40 minutes long. So, hence, what we can say about the significance of 4.40 minutes is that 75 percent of rap tracks are shorter than 4.40 minutes. However, 75 percent of heavy metal tracks are in fact longer than 4.40 minutes.

Okay, great, so what we’ve looked at so far is how we compare data sets. But in our next example, what we’re gonna do is compare the distributions of two data sets on the same variable using box plots.

Using special tracker collars, the number of miles the Namibian lions Mason and Charlotte traveled each night was recorded over a month. The data is represented in the box plots below. Referring to the box plots, compare the number of miles Mason and Charlotte traveled at night during the month.

Now, the first thing we’re gonna do is mark on key values onto our box plots. So what we’re gonna do is represent all of Mason’s values in blue and Charlotte’s in pink. And what we can see first is all of our minimum and maximum values. And we can see that Mason and Charlotte traveled similar ranges of miles at night. Mason traveled between two and 17 miles a night, and Charlotte traveled between two and 15 miles a night.

So now what we’ve added is the other key values. So we’ve added our upper and lower quartiles and our medians. And so we can see that since Mason’s box in the plot is wider than Charlotte’s, we can say that Mason’s distances varied further from the middle traveling distance than Charlotte’s did. And this width of our box is the interquartile range, which is Q three minus Q one.

And what this represents is the middle 50 percent of the data. And if we use this to work out the interquartile range for Mason, we can see that it’s gonna be 10 minus four, cause it’s Q three minus Q one, which is gonna be six, whereas the interquartile range for Charlotte is 12 minus nine, which is equal to three. And we can use these when we make our final conclusions.

So another bit of analysis we can do is take a look at the averages, which in this case are our medians. And we can see that Charlotte’s median distance is 11 miles, whereas Mason’s was five miles. So therefore, we can say that, on average, Charlotte traveled further than Mason.

So now if we look within each box, first of all taking a look at Mason’s, we can see that the box to the left of Mason’s median is narrower than the box to the right of the median. And we also note that his left whisker is shorter than his right whisker. So therefore, what did both these features tell us is that Mason traveled shorter distances, so less than five miles, 50 percent of the time, whereas Charlotte traveled longer distances, and that is more than 11 miles, 50 percent of the time.

So if we take a look at Mason’s box plots separately in more detail, we can see that 50 percent of Mason’s distances were concentrated between two miles, which was the minimum, and five miles, which was the median. And that half of those were concentrated between four miles, which was the lower quartile, and five miles, the median. The other 50 percent of Mason’s distances were over a bigger spread of values. There were 25 percent between five, the median, and 10, the upper quartile, and 25 percent between 10, which again like we said was the upper quartile, and 17, the maximum number of miles.

And if we do the same for Charlotte, what we can see is that 25 percent of the time her distances were concentrated between her median, which was 11, and her upper quartile, which was 12. And 25 percent were between her upper quartile of 12 and 15 miles, which was her maximum distance. The lower half of her distances were spread over a wider range of distances because we’ve got 25 percent between her minimum, two, and her lower quartile, which was nine miles. And 25 percent between her lower quartile of nine miles and her median of 11 miles.

So what this means in statistical terms is that the concentration of half of Mason’s data in a small range of lower values means that Mason’s distances were right or positively skewed. Well, if we consider Charlotte’s data, well half of her data were concentrated in a narrow but quite high range. So therefore, Charlotte’s distances were left or negatively skewed.

So not shown in our particular example, but a third possibility is that if, in a box plot, the boxes split evenly on either side of the median and the whiskers are approximately the same length, we can say that the data is symmetric. So what we’ve done here is we’ve compared three different statistical measures in the example. We’ve looked at the shape, the spread, and the average.

So now what we’re gonna do is formulate this into an answer to the question. So what we can say in conclusion is that, on average, Charlotte traveled further than Mason. That’s because her median was 11, whereas his was five. And we use the median as our average measure. Next, if we look at Mason’s data, this has a greater spread as its IQR, so its interquartile range, is six, compared to Charlotte’s, which is an interquartile range of three. And finally, Mason’s data is positively skewed, whereas Charlotte’s data is negatively skewed. And what this is, this is a comparison between the shapes.

So what we’ve done in the example so far is compare data sets. And then in this example, we’ve looked at the distributions between data sets and compared these. So now what we’re gonna take a look at is the key points of the lesson.

So the first thing we’re gonna summarize is the key parts of a box-and-whisker plot. So, first of all, at the end of our whiskers, we have our minimum value and our maximum data values. And it’s worth noting that, on box plots, you might see a line here as I’ve shown. Or you might see them without the vertical lines at either end of the whiskers. Okay, so great, this is the minimum and maximum values.

So now if we take a look at the box part itself, what we have at either end are the quartiles. So at the bottom end, we have the lower quartile or Q one. And at the upper end, we have the upper quartile or Q three. And then the line we have inside the box is our median. And then sometimes in some data sets, we’ll have values or a value that’s way outside of the other values. And this is usually represented by a star, and this is called an outlier.

Okay, great, so now what we’re gonna take a look at is how our data is spread across our box plot. Well, we can see that the data is spread in four sections. 25 percent of the data is from the minimum to Q one. The next 25 percent of our data values are from Q one, or lower quartile, to the median. Then there’s another 25 percent of the data from the median to Q three. And then, finally, our last 25 percent of our data set is from Q three to the maximum value. So therefore, what we can summarize is that the box part of our box plot represents 50 percent of our data set.

And finally, it’s also worth noting that if we take a look at the median or below the median to the minimum value, so from the minimum to the median, is 50 percent of our data set. And from the median to our maximum value is the next 50 percent of our data set. And the distribution of this is something that we compare if we’re comparing the distribution of two data sets.

So now if we were looking to compare distributions using box plots, the first thing we might look at is the average. And in a box plot, the measure of average we use is the median. So comparing the medians of the two data sets, we can determine in which data set the values are on average higher or lower than the other, or if there is no difference on average.

So next, what we can do is take a look at the spread. And the spread of the data sets can be compared by looking at the range, which is our maximum value minus our minimum value, and our interquartile range, which is Q three minus Q one. So first of all, the range tells us the overall spread of each data set. And the interquartile range tells us the spread of the middle 50 percent of the data, that is, how far the middle 50 percent of the data deviates from the center.

Well, finally, the last thing we might look at if we’re comparing the distributions of data sets is the shape. And the shape of a data set refers to whether or not it’s symmetric or skewed. So what exactly does this mean? Well, it means that if a data set is distributed symmetrically about the center, the box should be approximately evenly split by the median. And the whiskers should be approximately of equal length. If a data set is skewed, i.e., more concentrated at one end than the other, then one of the whiskers will be longer than the other. And the box will not be evenly split by the median.

And as we can see in our example, if a data set is right or positively skewed, then the right whisker will be longer than the left. And the boxes on the right side are longer than those on the left, because the median is further over to the left in our box. However, as in this little sketch that we’ve drawn here, if a data set is left or negatively skewed, then the left whisker will be longer than the right. And the boxes on the left side are longer than on the right.

Nagwa uses cookies to ensure you get the best experience on our website. Learn more about our Privacy Policy.