In this lesson, we’re gonna learn
how to compare two sets of data distributions using box plots. And what box plots are, and they’re
sometimes called box-and-whisker plots, are a good way to visualize differences
among groups that’ve been measured on the same variable. However, what we want to do before
we start comparing our data sets using box plots is to remind ourselves of the key
elements of a box plot.
So, as I said, what we’re gonna do
is have a look at the key elements of a box plot. So we’ve got a sketch here. And we’re looking first at the
horizontal axis, which covers all the possible data values. So next, what we’re looking at is
the box section of our box plot. And what this is is constructed
using the upper and lower quartile, so Q one and Q three, and the median of our data
set. And the box part of the
box-and-whisker plot covers the middle 50 percent of the values in the data set. This 50 percent of the data lies
between Q one and Q three, so our lower and upper quartile, and its range is the
interquartile range or IQR. And we find this by subtracting Q
one from Q three.
And then we have our whiskers. And these each cover 25 percent of
the data values. So the lower whisker covers all the
data values from the minimum value up to Q one. So what that is is the lowest 25
percent of the data values. So it’s quarter of our data
values. Then we’ve got the upper whisker,
which covers all the data values between Q three and the maximum value. And what that is is the highest 25
percent of our data values. So this is the highest quarter of
our data values.
So then we take a look at the
median. And what this does is sits within
the box and represents the center of the data. So 50 percent of the data values
lie above the median, and 50 percent of our data values lie below the median.
So then, finally, what we’ll take a
look at is our outliers or extreme values. And what they’re usually indicated
by is by a star symbol. And what we find is that if there’s
one or more outlier in a data set, for the purposes of drawing our box-and-whisker
plot, we take the minimum maximum to be the minimum and maximum values of the data,
excluding these outliers.
So great, we’ve had a look at the
key elements of a box plot. So now let’s take a look at some
examples of how to use two box plots to compare two data sets on the same
It is thought that taking math
tests in the morning results in higher grades than taking math tests in the
afternoon. Do the data represented in the box
plots below confirm this hypothesis or not?
Well, if we take a look at the box
plot for the test taken in the morning first, we can see that the median is at
approximately 87.5, whereas if we look at the scores for the tests taken in the
afternoon, we can see that the median sits at about 75 on the axis. We can therefore conclude that
since the median score for morning tests is higher than that for afternoon tests on
average, morning math test scores are higher. So therefore, in answer to the
question, what we can say is that yes taking math tests in the morning results in
higher marks than taking math tests in the afternoon.
So what we’ve done here is taken a
look at box plots that are drawn with variable values on the horizontal axis. However, this isn’t always the
case. Our next example has the variable
values on the vertical axis.
It has long been thought that cats
are the most popular Internet pet. To test this theory, data were
collected by a well-known search engine on the number of searches for cat-related
videos and dog-related videos each month over a five-year period. The results are represented in the
box plots below. Use the information illustrated in
the box plots to determine whether or not cats are indeed the premier Internet
So in order to decide whether or
not cats are more popular than dogs in Internet searches, we want to know on average
which pet has the highest number of searches. So the measure of average, or we
can also call it the central tendency, that we have available from the box plots is
the median. And this is shown by the line
inside each of our boxes. And we can see that, in this case,
so our example here, the median lines are horizontal since the variable values are
on the vertical axis.
And what we can see by reading
across from the median line to the vertical axis is the value of the medians. We can see that the median for cats
is approximately 15.5. And we need to remember that this
is millions of searches. Whereas for dogs, we can see that
it’s approximately 19 million searches. Since, on average, the number of
searches for dogs per month was higher, we can conclude that the data disproves the
theory. Because, in fact, in online
searches for videos of them, dogs are on average the most popular pet, not cats.
Okay, great, we’ve now looked at a
couple of examples that compare data sets. Well, what we will look at now is a
question that will do the same but it will also look at individual values and their
The box plots below represent data
collected on the length of recorded tracks from the top 100 rap and the top 100
heavy metal music charts. On average, which genre of music
has longer tracks? And then compare the significance
of the track length 4.40 minutes for the two music genre.
Well, if we take a look at part one
first, the measure of average used in a box plot is the median, which is the
vertical bar inside the box. Well, if we read down from our
medians to the axis, we can see that the median track length for rap tracks is 4.00
minutes, whereas the median for heavy metal tracks is 4.80 minutes. And since the median for heavy
metal tracks is higher than that for rap tracks, we can conclude that heavy metal
tracks are on average longer than rap tracks.
Although it’s not part of this
question, we could however also look at the spread of the track lengths. And if we did that, we would see
that the range, which is the maximum value minus the minimum value, of rap tracks is
in fact greater. So it has a greater spread of
values. But also the IQR or interquartile
range, which is Q three minus Q one, is also greater. So we can see actually that the
differences of spread of the track lengths of rap tracks is greater than metal
Okay, so now let’s take a look at
the second part of the question. Well, for the second part of the
question, what we want to do is concentrate on the value 4.40 minutes. So if we take a look at what this
means for rap tracks, well 4.40 minutes is where the right-hand vertical bar of the
box sits. And what this represents is the
value of Q three, our upper quartile, so the third quartile for rap. And what this means is that 75
percent of rap tracks are less than 4.40 minutes long and only 25 percent are in
fact longer than 4.40 minutes long.
However, if we look at heavy metal
tracks, 4.40 minutes represents Q one, so the lower quartile. And what this means is that 25
percent of heavy metal tracks are in fact shorter than 4.40 minutes long and 75
percent of them are longer than 4.40 minutes long. So, hence, what we can say about
the significance of 4.40 minutes is that 75 percent of rap tracks are shorter than
4.40 minutes. However, 75 percent of heavy metal
tracks are in fact longer than 4.40 minutes.
Okay, great, so what we’ve looked
at so far is how we compare data sets. But in our next example, what we’re
gonna do is compare the distributions of two data sets on the same variable using
Using special tracker collars, the
number of miles the Namibian lions Mason and Charlotte traveled each night was
recorded over a month. The data is represented in the box
plots below. Referring to the box plots, compare
the number of miles Mason and Charlotte traveled at night during the month.
Now, the first thing we’re gonna do
is mark on key values onto our box plots. So what we’re gonna do is represent
all of Mason’s values in blue and Charlotte’s in pink. And what we can see first is all of
our minimum and maximum values. And we can see that Mason and
Charlotte traveled similar ranges of miles at night. Mason traveled between two and 17
miles a night, and Charlotte traveled between two and 15 miles a night.
So now what we’ve added is the
other key values. So we’ve added our upper and lower
quartiles and our medians. And so we can see that since
Mason’s box in the plot is wider than Charlotte’s, we can say that Mason’s distances
varied further from the middle traveling distance than Charlotte’s did. And this width of our box is the
interquartile range, which is Q three minus Q one.
And what this represents is the
middle 50 percent of the data. And if we use this to work out the
interquartile range for Mason, we can see that it’s gonna be 10 minus four, cause
it’s Q three minus Q one, which is gonna be six, whereas the interquartile range for
Charlotte is 12 minus nine, which is equal to three. And we can use these when we make
our final conclusions.
So another bit of analysis we can
do is take a look at the averages, which in this case are our medians. And we can see that Charlotte’s
median distance is 11 miles, whereas Mason’s was five miles. So therefore, we can say that, on
average, Charlotte traveled further than Mason.
So now if we look within each box,
first of all taking a look at Mason’s, we can see that the box to the left of
Mason’s median is narrower than the box to the right of the median. And we also note that his left
whisker is shorter than his right whisker. So therefore, what did both these
features tell us is that Mason traveled shorter distances, so less than five miles,
50 percent of the time, whereas Charlotte traveled longer distances, and that is
more than 11 miles, 50 percent of the time.
So if we take a look at Mason’s box
plots separately in more detail, we can see that 50 percent of Mason’s distances
were concentrated between two miles, which was the minimum, and five miles, which
was the median. And that half of those were
concentrated between four miles, which was the lower quartile, and five miles, the
median. The other 50 percent of Mason’s
distances were over a bigger spread of values. There were 25 percent between five,
the median, and 10, the upper quartile, and 25 percent between 10, which again like
we said was the upper quartile, and 17, the maximum number of miles.
And if we do the same for
Charlotte, what we can see is that 25 percent of the time her distances were
concentrated between her median, which was 11, and her upper quartile, which was
12. And 25 percent were between her
upper quartile of 12 and 15 miles, which was her maximum distance. The lower half of her distances
were spread over a wider range of distances because we’ve got 25 percent between her
minimum, two, and her lower quartile, which was nine miles. And 25 percent between her lower
quartile of nine miles and her median of 11 miles.
So what this means in statistical
terms is that the concentration of half of Mason’s data in a small range of lower
values means that Mason’s distances were right or positively skewed. Well, if we consider Charlotte’s
data, well half of her data were concentrated in a narrow but quite high range. So therefore, Charlotte’s distances
were left or negatively skewed.
So not shown in our particular
example, but a third possibility is that if, in a box plot, the boxes split evenly
on either side of the median and the whiskers are approximately the same length, we
can say that the data is symmetric. So what we’ve done here is we’ve
compared three different statistical measures in the example. We’ve looked at the shape, the
spread, and the average.
So now what we’re gonna do is
formulate this into an answer to the question. So what we can say in conclusion is
that, on average, Charlotte traveled further than Mason. That’s because her median was 11,
whereas his was five. And we use the median as our
average measure. Next, if we look at Mason’s data,
this has a greater spread as its IQR, so its interquartile range, is six, compared
to Charlotte’s, which is an interquartile range of three. And finally, Mason’s data is
positively skewed, whereas Charlotte’s data is negatively skewed. And what this is, this is a
comparison between the shapes.
So what we’ve done in the example
so far is compare data sets. And then in this example, we’ve
looked at the distributions between data sets and compared these. So now what we’re gonna take a look
at is the key points of the lesson.
So the first thing we’re gonna
summarize is the key parts of a box-and-whisker plot. So, first of all, at the end of our
whiskers, we have our minimum value and our maximum data values. And it’s worth noting that, on box
plots, you might see a line here as I’ve shown. Or you might see them without the
vertical lines at either end of the whiskers. Okay, so great, this is the minimum
and maximum values.
So now if we take a look at the box
part itself, what we have at either end are the quartiles. So at the bottom end, we have the
lower quartile or Q one. And at the upper end, we have the
upper quartile or Q three. And then the line we have inside
the box is our median. And then sometimes in some data
sets, we’ll have values or a value that’s way outside of the other values. And this is usually represented by
a star, and this is called an outlier.
Okay, great, so now what we’re
gonna take a look at is how our data is spread across our box plot. Well, we can see that the data is
spread in four sections. 25 percent of the data is from the
minimum to Q one. The next 25 percent of our data
values are from Q one, or lower quartile, to the median. Then there’s another 25 percent of
the data from the median to Q three. And then, finally, our last 25
percent of our data set is from Q three to the maximum value. So therefore, what we can summarize
is that the box part of our box plot represents 50 percent of our data set.
And finally, it’s also worth noting
that if we take a look at the median or below the median to the minimum value, so
from the minimum to the median, is 50 percent of our data set. And from the median to our maximum
value is the next 50 percent of our data set. And the distribution of this is
something that we compare if we’re comparing the distribution of two data sets.
So now if we were looking to
compare distributions using box plots, the first thing we might look at is the
average. And in a box plot, the measure of
average we use is the median. So comparing the medians of the two
data sets, we can determine in which data set the values are on average higher or
lower than the other, or if there is no difference on average.
So next, what we can do is take a
look at the spread. And the spread of the data sets can
be compared by looking at the range, which is our maximum value minus our minimum
value, and our interquartile range, which is Q three minus Q one. So first of all, the range tells us
the overall spread of each data set. And the interquartile range tells
us the spread of the middle 50 percent of the data, that is, how far the middle 50
percent of the data deviates from the center.
Well, finally, the last thing we
might look at if we’re comparing the distributions of data sets is the shape. And the shape of a data set refers
to whether or not it’s symmetric or skewed. So what exactly does this mean? Well, it means that if a data set
is distributed symmetrically about the center, the box should be approximately
evenly split by the median. And the whiskers should be
approximately of equal length. If a data set is skewed, i.e., more
concentrated at one end than the other, then one of the whiskers will be longer than
the other. And the box will not be evenly
split by the median.
And as we can see in our example,
if a data set is right or positively skewed, then the right whisker will be longer
than the left. And the boxes on the right side are
longer than those on the left, because the median is further over to the left in our
box. However, as in this little sketch
that we’ve drawn here, if a data set is left or negatively skewed, then the left
whisker will be longer than the right. And the boxes on the left side are
longer than on the right.