Video: Outliers of a Data Set

In this video, we will learn how to identify outliers from a data set.

17:46

Video Transcript

In this video, we will learn how to identify outliers from a data set. We’ll first look at how we would do that from a graph and then consider how we would calculate that mathematically.

Sometimes in a data set, there are data points whose values are much bigger or much smaller than the main group of data. And we call these data points outliers or extreme values. Let’s consider the graph below. Most of the data points fall between 15 and 60. And that means this 120 data point would be considered an outlier because it is substantially larger than the rest of the data points.

Sometimes outliers are a genuine data point. For example, there are people who are genuinely much taller than the average height of a human. And it’s important for us to consider outliers when analyzing a data set, since extreme values can lead us to false conclusions about our data set.

For example, suppose you are an airline seat designer. To design the passenger seats, you need to know the average height of an adult person. If you use the average of the heights in the image above, the height of the very tall person would make the overall mean larger than it should be. Which means the seat you design would be larger than necessary, and your boss would not be happy. As that would mean there would be less seats, meaning less passengers, meaning less profit.

Again, this outlier is a genuine data point. But there are cases when we’re analyzing a data set that we need to remove extreme values so we don’t get false conclusions. While some outliers are genuine data points, sometimes outliers indicate to us an error or misrepresentation. And it’s good to check and see if there’s been an error in recording the data at these points. In our next example, we’ll see how we might spot a potential outlier from different graphs.

The table below shows the number of messages exchanged on the smart phones of 14 students over a single month. The data has also been plotted on a dot plot. Are there any outliers in this data set? If so, specify the value or values of these outliers.

If we’re just given a table of data, the outlier isn’t always apparent. One benefit of a graph like this is we’re able to see the data points in relation to each other. We can very quickly see that we have one data point that is far away from most of the others. The data points here primarily fall between 2800 and 5500 with only one point falling outside this range. On the graph, we can identify this as the data point 9754. In this data, there is one outlier, and it’s 9754.

This question is only asking us to identify the outlier. We need to note that if we were analyzing this data, we would have to decide whether this point should be included in our analysis or not. We might ask ourselves something like: Is it feasible for a student to exchange 9754 messages in a month or has an error been made in the recording of this data? Once we answered some questions like that, we would be able to make a final decision about whether or not to include this value in the final analysis. Either way, it’s right to label it as an outlier.

In our next example, we’ll consider a different kind of diagram.

Which of the statements is correct for the distribution represented by the diagram? (A) The distribution is symmetric. (B) The distribution has an outlier at six. (C) The distribution has a gap from 21 to 29. (D) The distribution has a cluster from seven to 20. Or (E) the distribution has a peak at 22.

To find which of these is correct, we’ll consider them one at a time. First, the distribution is symmetric. To find out if this distribution is symmetric, you can sketch a line over the distribution. This image is not symmetric. The six is stretching out the image. And so, we can’t say that the distribution is symmetric.

Now, let’s consider statement (B) the distribution has an outlier at six. To consider a six as an outlier, we want to look at the spread of the data points. All of the other data points fall between 21 and 29. The distance from 6 to 21 is 15. We can say that the data point six is quite far removed from the rest of the data. And therefore, it is a true statement to say that this distribution has an outlier at six. But we want to go ahead and check the other three statements.

The distribution has a gap from 21 to 29. Since the majority of the points fall between 21 and 29, there is no gap there. Option (D) the distribution has a cluster from seven to 20. There are, in fact, no data points from seven to 20, which means there can’t be a cluster there. And finally, option (E) the distribution has a peak at 22. If we look closely at 22, there’s only one data point there. Looking across our diagram, we see that the peak happens at 26, which means option (E) is not correct, leaving us with only one true statement. The distribution has an outlier at six.

Here’s another data set to consider.

The data in the table below is the average recorded speed in miles per hour of the first serve of the top 10 tennis players in the world. For part (1), calculate the mean first serve speed in miles per hour. Part (3), by comparing the means you found in the first two parts of the question, make a conclusion about the validity of the 1025 miles per hour data point.

Beginning in part (1), we need to calculate the mean speed, the average speed for these 10 players. To calculate the mean speed, we’ll need to sum all of the speeds together and divide that value by the number of players. This means we’ll add up all 10 of the values from the table and then divide by 10. When we do that, we get 2107 over 10, which then becomes 210.7 miles per hour. If we consider all 10 data points in this table and average them, we come up with an average of 210.7 miles per hour. That’s part (1).

In part (2), we wanna to do the same thing, but we want to ignore this 1025 data point. By ignoring that data point, we now are only averaging nine of the players together. When we sum the remaining nine, we get 1082 over nine. When we divide that out, we get 120.2 miles per hour, rounded to one decimal place.

For part (3), we’ll need to compare these two values. When we included 1025, we found the average speed was 210.7 miles per hour. But when we look at the table, 210 is significantly higher than all other nine of the values. Apart from the 1025, which is also still substantially far from 210, all of the other values are significantly less. 210 is not a very valid representation of the averages of these speeds.

If we think about the second average where we ignored the 1025, we found an average of 120.2. Looking at our table, four of the values are below 120.2, and five of the values are above 120.2. But all nine of these values are very near 120.2. And so, we could say that 120.2 is a much fairer representation of the average. But it’s also worth considering, at this point, how did this value of 1025 come into our table?

Is it realistic to think that the fastest tennis server in the world is 10 times faster than anyone else in the world? At minimum, we say that 1025 is an outlier, but a reasonable conclusion is that it must in fact be an error in the table. Our summary of part (3) can say 1025 is an outlier for this data set and is likely an error.

In our past three examples, we’ve solved by looking at data on a graph to find out whether or not there were outliers. But we also can confirm by calculation whether or not a point is an outlier. Let’s look at how we would do that now.

To do these calculations, we’ll need the interquartile range, the lower and upper quartiles in our calculation. So, first, let’s remind ourselves of what these are.

The interquartile range, or IQR, of a data set is a measure of how the data values are spread around the center of the data set. The first or lower quartile, 𝑄 one, marks the center of the lowest half of the data set. So, 25 percent of the data sits below 𝑄 one, and 75 percent of the data sits above 𝑄 one. The second quartile, 𝑄 two, is the median and marks the middle of the data set. 50 percent of the data falls below 𝑄 two, and 50 percent of the data falls above 𝑄 two.

And the third or upper quartile, 𝑄 three, marks the center of the top half of the data set. 75 percent of the data lies below 𝑄 three and 25 percent above it. And the interquartile range is equal to the upper quartile minus the lower quartile, 𝑄 three minus 𝑄 one. It represents a measure of the middle 50 percent of the data.

Using this information, we can find out how to identify outliers in a data set. To identify outliers by calculation, a data point is considered an outlier if it is either greater than quartile three plus 1.5 the interquartile range. Or if it is less than quartile one minus 1.5 times the interquartile range. Sometimes this is called the 1.5 times IQR rule. So, let’s see an example of a data set where we can use this 1.5-times-IQR rule.

The numbers of matches won by 12 teams in the national league are 11, five, six, six, nine, 10, 19, 14, 11, nine, nine, and six. Is it true or false that 19 is an outlier of the data?

To identify whether 19 is an outlier or not, we’ll need the interquartile range. And to do that, we’ll have to identify quartile one and quartile three. This means our first step is to put the data in order of size. Now, we have our 12 data points in size order.

We know that the median will come in the middle of these 12 data points and that the median is quartile two. 𝑄 one is the middle of the lower half of the data. Since there are six data points below the median, 𝑄 one will be located between the third and fourth. And similarly, 𝑄 three is the middle of the upper half of the data. There are six points above quartile two. And that means 𝑄 three will be located in the middle of those. It will be between the ninth and 10th value.

Because the third and the fourth value is six, we would call quartile one six. And because the ninth and 10th values are the same, quartile three is equal to 11. The interquartile range equals 𝑄 three minus 𝑄 one. For us, that’s 11 minus six. And so, we have an IQR of five. To find out if 19 is, in fact, an outlier, we’ll use the 1.5 times IQR rule. This rule tells us that a value is an outlier if it’s greater than 𝑄 three plus 1.5 times the IQR or less than 𝑄 one minus 1.5 times the IQR.

Since we’re looking at a data point that’s above 𝑄 three, we’ll look for the greater-than option. And that means we want to know is 19 greater than the quartile three plus 1.5 times the interquartile range? The IQR is five. 𝑄 three is 11. 1.5 times five is 7.5, plus 11 equals 18.5. 19 is greater than 18.5. And so, we can say it’s a true statement that 19 is an outlier of this data set.

We’ll now look at one final example.

The table shows the heights in meters of the tallest buildings in a city. If there are any outliers in the data, find their values.

Because we’re just given a table of data and we want to find out if there are any outliers, we can use the 1.5-times-IQR rule. An outlier 𝑥 would be less than 𝑄 one minus 1.5 times the IQR or, or the outlier would be greater than 𝑄 three plus 1.5 times IQR.

Our first step here is to calculate the interquartile range and find these boundaries. And to do that, the first thing we do is put the data in size order. We also know that each quartile is 25 percent of the data. That would be one-fourth of the data. Since we have 12 building heights, we can divide 12 by four, which is three. And that means our first quartile will occur after the third data point, our second quartile after the sixth data point, and our third quartile after the ninth data point.

Since quartile one is between the third and fourth data point, we need to average the third and fourth data points to find its value. 𝑄 one equals 561 plus 607 divided by two, which is 584. We need to do the same thing for 𝑄 three. We average the ninth and 10th data points, 714 plus 725 divided by two, which is 719.5. The interquartile range is 𝑄 three minus 𝑄 one, for us, 719.5 minus 584, which equals 135.5.

Let’s make a list of what we know. 𝑄 one is 584. 𝑄 three is 719.5. And our IQR is 135.5. We’re now ready to go back and use these rules to calculate the upper and lower bounds for outliers. The lower bound for outliers will be 𝑥 less than 𝑄 one minus 1.5 times the IQR. And the upper bound for outliers will be such that 𝑥 is greater than 𝑄 three plus 1.5 times the IQR. Let’s plug in the values we have, 𝑄 one, 584, and the IQR, 135.5. When we do that calculation, we get 380.75.

And that means in order for there to be an outlier on the low end, it needs to be less than 380.75. Our smallest data point is 502. And that means we don’t have an outlier on the lower end. We’ll check the upper end. Plug in the values for 𝑄 three and the IQR. And we find that the upper bound for outliers is 922.75. In order for there to be an outlier on the upper end, it would need to be greater than 922.75. Our largest data point is 901, which is less than this value. And since none of our data values are less than the lower bound for the outliers or greater than the upper bound for outliers, there are no outliers in this data set.

We can summarize with a few key points. An outlier or extreme value in a data set is a data point whose value is either much smaller or much larger than the majority of the data set. Mathematically, we calculate outliers with the 1.5-times-the-IQR rule. A data point is classified as an outlier if it is either less than 𝑄 one minus 1.5 times the IQR or greater than 𝑄 three plus 1.5 times the IQR. And finally, potential outliers can be identified using a graph of the data set.

Nagwa uses cookies to ensure you get the best experience on our website. Learn more about our Privacy Policy.