Video Transcript
In this video, we will learn how to
identify outliers from a data set. We’ll first look at how we would do
that from a graph and then consider how we would calculate that mathematically.
Sometimes in a data set, there are
data points whose values are much bigger or much smaller than the main group of
data. And we call these data points
outliers or extreme values. Let’s consider the graph below. Most of the data points fall
between 15 and 60. And that means this 120 data point
would be considered an outlier because it is substantially larger than the rest of
the data points.
Sometimes outliers are a genuine
data point. For example, there are people who
are genuinely much taller than the average height of a human. And it’s important for us to
consider outliers when analyzing a data set, since extreme values can lead us to
false conclusions about our data set.
For example, suppose you are an
airline seat designer. To design the passenger seats, you
need to know the average height of an adult person. If you use the average of the
heights in the image above, the height of the very tall person would make the
overall mean larger than it should be. Which means the seat you design
would be larger than necessary, and your boss would not be happy. As that would mean there would be
less seats, meaning less passengers, meaning less profit.
Again, this outlier is a genuine
data point. But there are cases when we’re
analyzing a data set that we need to remove extreme values so we don’t get false
conclusions. While some outliers are genuine
data points, sometimes outliers indicate to us an error or misrepresentation. And it’s good to check and see if
there’s been an error in recording the data at these points. In our next example, we’ll see how
we might spot a potential outlier from different graphs.
The table below shows the number of
messages exchanged on the smart phones of 14 students over a single month. The data has also been plotted on a
dot plot. Are there any outliers in this data
set? If so, specify the value or values
of these outliers.
If we’re just given a table of
data, the outlier isn’t always apparent. One benefit of a graph like this is
we’re able to see the data points in relation to each other. We can very quickly see that we
have one data point that is far away from most of the others. The data points here primarily fall
between 2800 and 5500 with only one point falling outside this range. On the graph, we can identify this
as the data point 9754. In this data, there is one outlier,
and it’s 9754.
This question is only asking us to
identify the outlier. We need to note that if we were
analyzing this data, we would have to decide whether this point should be included
in our analysis or not. We might ask ourselves something
like: Is it feasible for a student to exchange 9754 messages in a month or has an
error been made in the recording of this data? Once we answered some questions
like that, we would be able to make a final decision about whether or not to include
this value in the final analysis. Either way, it’s right to label it
as an outlier.
In our next example, we’ll consider
a different kind of diagram.
Which of the statements is correct
for the distribution represented by the diagram? (A) The distribution is
symmetric. (B) The distribution has an outlier
at six. (C) The distribution has a gap from
21 to 29. (D) The distribution has a cluster
from seven to 20. Or (E) the distribution has a peak
at 22.
To find which of these is correct,
we’ll consider them one at a time. First, the distribution is
symmetric. To find out if this distribution is
symmetric, you can sketch a line over the distribution. This image is not symmetric. The six is stretching out the
image. And so, we can’t say that the
distribution is symmetric.
Now, let’s consider statement (B)
the distribution has an outlier at six. To consider a six as an outlier, we
want to look at the spread of the data points. All of the other data points fall
between 21 and 29. The distance from 6 to 21 is
15. We can say that the data point six
is quite far removed from the rest of the data. And therefore, it is a true
statement to say that this distribution has an outlier at six. But we want to go ahead and check
the other three statements.
The distribution has a gap from 21
to 29. Since the majority of the points
fall between 21 and 29, there is no gap there. Option (D) the distribution has a
cluster from seven to 20. There are, in fact, no data points
from seven to 20, which means there can’t be a cluster there. And finally, option (E) the
distribution has a peak at 22. If we look closely at 22, there’s
only one data point there. Looking across our diagram, we see
that the peak happens at 26, which means option (E) is not correct, leaving us with
only one true statement. The distribution has an outlier at
six.
Here’s another data set to
consider.
The data in the table below is the
average recorded speed in miles per hour of the first serve of the top 10 tennis
players in the world. For part (1), calculate the mean
first serve speed in miles per hour. Part (3), by comparing the means
you found in the first two parts of the question, make a conclusion about the
validity of the 1025 miles per hour data point.
Beginning in part (1), we need to
calculate the mean speed, the average speed for these 10 players. To calculate the mean speed, we’ll
need to sum all of the speeds together and divide that value by the number of
players. This means we’ll add up all 10 of
the values from the table and then divide by 10. When we do that, we get 2107 over
10, which then becomes 210.7 miles per hour. If we consider all 10 data points
in this table and average them, we come up with an average of 210.7 miles per
hour. That’s part (1).
In part (2), we wanna to do the
same thing, but we want to ignore this 1025 data point. By ignoring that data point, we now
are only averaging nine of the players together. When we sum the remaining nine, we
get 1082 over nine. When we divide that out, we get
120.2 miles per hour, rounded to one decimal place.
For part (3), we’ll need to compare
these two values. When we included 1025, we found the
average speed was 210.7 miles per hour. But when we look at the table, 210
is significantly higher than all other nine of the values. Apart from the 1025, which is also
still substantially far from 210, all of the other values are significantly
less. 210 is not a very valid
representation of the averages of these speeds.
If we think about the second
average where we ignored the 1025, we found an average of 120.2. Looking at our table, four of the
values are below 120.2, and five of the values are above 120.2. But all nine of these values are
very near 120.2. And so, we could say that 120.2 is
a much fairer representation of the average. But it’s also worth considering, at
this point, how did this value of 1025 come into our table?
Is it realistic to think that the
fastest tennis server in the world is 10 times faster than anyone else in the
world? At minimum, we say that 1025 is an
outlier, but a reasonable conclusion is that it must in fact be an error in the
table. Our summary of part (3) can say
1025 is an outlier for this data set and is likely an error.
In our past three examples, we’ve
solved by looking at data on a graph to find out whether or not there were
outliers. But we also can confirm by
calculation whether or not a point is an outlier. Let’s look at how we would do that
now.
To do these calculations, we’ll
need the interquartile range, the lower and upper quartiles in our calculation. So, first, let’s remind ourselves
of what these are.
The interquartile range, or IQR, of
a data set is a measure of how the data values are spread around the center of the
data set. The first or lower quartile, 𝑄
one, marks the center of the lowest half of the data set. So, 25 percent of the data sits
below 𝑄 one, and 75 percent of the data sits above 𝑄 one. The second quartile, 𝑄 two, is the
median and marks the middle of the data set. 50 percent of the data falls below
𝑄 two, and 50 percent of the data falls above 𝑄 two.
And the third or upper quartile, 𝑄
three, marks the center of the top half of the data set. 75 percent of the data lies below
𝑄 three and 25 percent above it. And the interquartile range is
equal to the upper quartile minus the lower quartile, 𝑄 three minus 𝑄 one. It represents a measure of the
middle 50 percent of the data.
Using this information, we can find
out how to identify outliers in a data set. To identify outliers by
calculation, a data point is considered an outlier if it is either greater than
quartile three plus 1.5 the interquartile range. Or if it is less than quartile one
minus 1.5 times the interquartile range. Sometimes this is called the 1.5
times IQR rule. So, let’s see an example of a data
set where we can use this 1.5-times-IQR rule.
The numbers of matches won by 12
teams in the national league are 11, five, six, six, nine, 10, 19, 14, 11, nine,
nine, and six. Is it true or false that 19 is an
outlier of the data?
To identify whether 19 is an
outlier or not, we’ll need the interquartile range. And to do that, we’ll have to
identify quartile one and quartile three. This means our first step is to put
the data in order of size. Now, we have our 12 data points in
size order.
We know that the median will come
in the middle of these 12 data points and that the median is quartile two. 𝑄 one is the middle of the lower
half of the data. Since there are six data points
below the median, 𝑄 one will be located between the third and fourth. And similarly, 𝑄 three is the
middle of the upper half of the data. There are six points above quartile
two. And that means 𝑄 three will be
located in the middle of those. It will be between the ninth and
10th value.
Because the third and the fourth
value is six, we would call quartile one six. And because the ninth and 10th
values are the same, quartile three is equal to 11. The interquartile range equals 𝑄
three minus 𝑄 one. For us, that’s 11 minus six. And so, we have an IQR of five. To find out if 19 is, in fact, an
outlier, we’ll use the 1.5 times IQR rule. This rule tells us that a value is
an outlier if it’s greater than 𝑄 three plus 1.5 times the IQR or less than 𝑄 one
minus 1.5 times the IQR.
Since we’re looking at a data point
that’s above 𝑄 three, we’ll look for the greater-than option. And that means we want to know is
19 greater than the quartile three plus 1.5 times the interquartile range? The IQR is five. 𝑄 three is 11. 1.5 times five is 7.5, plus 11
equals 18.5. 19 is greater than 18.5. And so, we can say it’s a true
statement that 19 is an outlier of this data set.
We’ll now look at one final
example.
The table shows the heights in
meters of the tallest buildings in a city. If there are any outliers in the
data, find their values.
Because we’re just given a table of
data and we want to find out if there are any outliers, we can use the 1.5-times-IQR
rule. An outlier 𝑥 would be less than 𝑄
one minus 1.5 times the IQR or, or the outlier would be greater than 𝑄 three plus
1.5 times IQR.
Our first step here is to calculate
the interquartile range and find these boundaries. And to do that, the first thing we
do is put the data in size order. We also know that each quartile is
25 percent of the data. That would be one-fourth of the
data. Since we have 12 building heights,
we can divide 12 by four, which is three. And that means our first quartile
will occur after the third data point, our second quartile after the sixth data
point, and our third quartile after the ninth data point.
Since quartile one is between the
third and fourth data point, we need to average the third and fourth data points to
find its value. 𝑄 one equals 561 plus 607 divided
by two, which is 584. We need to do the same thing for 𝑄
three. We average the ninth and 10th data
points, 714 plus 725 divided by two, which is 719.5. The interquartile range is 𝑄 three
minus 𝑄 one, for us, 719.5 minus 584, which equals 135.5.
Let’s make a list of what we
know. 𝑄 one is 584. 𝑄 three is 719.5. And our IQR is 135.5. We’re now ready to go back and use
these rules to calculate the upper and lower bounds for outliers. The lower bound for outliers will
be 𝑥 less than 𝑄 one minus 1.5 times the IQR. And the upper bound for outliers
will be such that 𝑥 is greater than 𝑄 three plus 1.5 times the IQR. Let’s plug in the values we have,
𝑄 one, 584, and the IQR, 135.5. When we do that calculation, we get
380.75.
And that means in order for there
to be an outlier on the low end, it needs to be less than 380.75. Our smallest data point is 502. And that means we don’t have an
outlier on the lower end. We’ll check the upper end. Plug in the values for 𝑄 three and
the IQR. And we find that the upper bound
for outliers is 922.75. In order for there to be an outlier
on the upper end, it would need to be greater than 922.75. Our largest data point is 901,
which is less than this value. And since none of our data values
are less than the lower bound for the outliers or greater than the upper bound for
outliers, there are no outliers in this data set.
We can summarize with a few key
points. An outlier or extreme value in a
data set is a data point whose value is either much smaller or much larger than the
majority of the data set. Mathematically, we calculate
outliers with the 1.5-times-the-IQR rule. A data point is classified as an
outlier if it is either less than 𝑄 one minus 1.5 times the IQR or greater than 𝑄
three plus 1.5 times the IQR. And finally, potential outliers can
be identified using a graph of the data set.