In this explainer, we will learn how to identify outliers from a data set.
Sometimes in a data set there are data points whose values are much bigger or much smaller than the main group of data. Such data points are called “outliers” or “extreme values.” In the graph below, most of the data has values between about 15 and 50. The data point with the value near 100 is an outlier, since its value is substantially larger than the rest of the data points.
An outlier can be a genuine data point; for example, there are people who are much taller than the average height of a human. But an outlier may also be a misrepresentation or an error.
It is important to consider outliers when analyzing a data set, since extreme values can lead us to false conclusions about our data set. For example, suppose you are an airplane seat designer. To design the passenger seats, you need to know the mean (average) height of an adult person.
If you were to use the heights of all the people in the picture above to calculate the mean height, the height of the very tall person would make the overall mean larger than it should be. This would give you a false impression of the mean height. Your seats would then be larger than necessary and your boss would not be happy as there would be less seats, meaning less passengers meaning less profit!
The next two examples show how we might spot a potential outlier from different graphs of data.
Example 1: Spotting a Potential Outlier from a Graph
The table below shows the number of messages exchanged on the smart phones of 14 students over a single month. The data have also been plotted in a dot plot.
Are there any outliers in this data set? If so, specify the value or values of these outliers.
If we consider the graph, we can see that one data point at the far right end is a long way from the rest of the data. The value appears to be fairly close to 10 000 and looking in the table we can see that there is one data point with the value 9 754.
Given that the rest of the data appear to have values less than 5 500, we can conclude that 9 754 is an outlier.
If we were analyzing this data, we would have to consider whether this point should be included in our analysis or not. We might ask ourselves, “is it feasible for a student to exchange 9 754 messages in a month or has an error been made in recording the data?”
Example 2: Checking for Outliers in a Graph
Which of the statements is correct for the distribution represented by the diagram?
- The distribution has an outlier at 6.
- The distribution is symmetric.
- The distribution has a gap from 21 to 29.
- The distribution has a cluster from 7 to 20.
- The distribution has a peak at 22.
To find which of the statements is correct, we will consider them one by one.
(A) This is the statement: The distribution has an outlier at 6. Looking at the graph, there is a single data point marked with a cross above the value “6.”
The next value on the axis with a cross above it, signifying a data point, is 21; and between the data point at 6 and this one, there are values.
Given that the rest of the data clusters between 21 and 29, we can say that the data point at 6 is quite far removed from the rest of the data. Hence, the distribution does have an outlier at 6 and answer (A) is correct.
(B) This is the statement: The distribution is symmetric. If we trace out the distribution, we can see that, with the outlier at 6, the shape of the distribution is not symmetric.
Answer (B) is then incorrect. In fact, the distribution is stretched by the data point at 6.
Note that if we were to remove this data point, all of our remaining data would be between 21 and 29.
Tracing the distribution of this new data set, without the point at 6, we can see that it is fairly symmetric, or at least more so than the original data set. It is clear then that the outlier at 6 distorts the distribution of the original data set.
(C) This is the statement: The distribution has a gap from 21 to 29. This is clearly untrue as in fact the majority of the data lie between 21 and 29.
(D) This is the statement: The distribution has a cluster from 7 to 20. This is clearly untrue. There is a cluster of data points, but the cluster is from 21 to 29.
(E) This is the statement: The distribution has a peak at 22. This statement is incorrect. There is a single data point with the value 22 and this does not represent a peak. The value with the most data points above it on the graph is 26, with 3 data points valued at 26.
Tracing out the curve of the distribution, we can see that above 26 is the highest point on the curve. We can therefore say that there is a peak at 26, but not at 22.
In conclusion, answer (A) is the only correct answer: The distribution has an outlier at 6.
Let us look at an example to see how an outlier can affect calculations in a data set.
Example 3: How an Outlier Can Affect a Data Set
The data in the table below is the average recorded speed, in miles per hour, of the first serve of the top 10 tennis players in the world.
- Calculate the mean 1st serve speed in miles per hour.
- Recalculate the mean 1st serve speed without the data point 1 025 mph.
- By comparing the means you found in the first two parts of the question, make a conclusion about the validity of the 1 025 mph data point.
To calculate the mean 1st serve speed, we need to add up all the speeds and divide by the number of players whose speeds are in the data set. The sum of all the speeds is . Hence, we have
The mean 1st serve speed for the top ten tennis players is therefore 210.7 miles per hour.
To recalculate the mean without the value of 1 025 miles per hour, we have the new sum . We now only have 9 values, so our new mean speed is
With the value 1 025 mph included in our calculation, we found a mean 1st serve speed of 210.7 mph. Apart from the recorded speed of 1 025 miles per hour, this mean is substantially higher than any of the other speeds recorded. Since the mean is a measure of the center of the data, a mean this high does not make sense as it is not close in value to any of the data.
Also, considering the recorded speed of 1 025 mph, it does not seem possible for even one of the top players in the world to have an average first serve speed of 1 025 mph! We must conclude that this value is an outlier and must in fact be an error. Calculating the mean without this value, we find the mean 1st serve speed is 120.2 mph, which makes much more sense, since its value is close to the majority of the data.
We can often see, just by looking at the data or a graph of a data set, whether or not there are outliers, but we can also confirm by calculation whether or not a point is an outlier. We will need the interquartile range (IQR) and the lower and upper quartiles (Q1 and Q3) in our calculations so let us first remind ourselves of what these are.
The Quartiles and Interquartile Range of a Data Set
The interquartile range (or IQR) of a data set is a measure of how the data values are spread around the center of the data set.
- The first (or lower) quartile (Q1) marks the center of the lowest half of a data set. So, 25% of the data sit below Q1, and 75% of the data sit above Q1.
- The second quartile (Q2), which is the median, marks the middle of a data set. So, 50% of the data set is below the median and 50% is above the median.
- The third (or upper) quartile (Q3) marks the center of the top half of a data set. So, 75% of the data set is below Q3 and 25% is above it.
- The interquartile range, IQR, is given by The interquartile range is a measure of the middle 50% of the data.
How To: Identifying Outliers by Calculation
To identify outliers in a data set, we use the interquartile range () and the upper and lower quartiles, Q1 and Q3. A data point is an outlier if it is either or
This is often called the rule.
In our next examples, we will use the rule to determine whether or not there are any outliers in the data sets.
Example 4: Identifying an Outlier Using the 1.5 × IQR Rule
The numbers of matches won by 12 teams in the national league are 11, 5, 6, 6, 9, 10, 19, 14, 11, 9, 9, and 6. Is it true or false that 19 is an outlier of the data?
To identify whether 19 is an outlier or not, we will need the interquartile range (IQR), which is . So we must identify the quartiles Q1 and Q3. To do this, we first put the data in order of size. 5 is the smallest value so we put this first, followed by the three 6s and so on until we have
There are 12 data points in our set and since 25% (or one-quarter) of the data is below Q1 and 25% is above Q3, to find Q1 and Q3 we split the data set into quarters. Since , each quarter contains 3 values.
Q1 is the value at the boundary between the first 25% and the next 25% of values (i.e., between the 3rd and 4th values in the ordered list). Since both values on either side of Q1 have a value of 6, so does Q1. Similarly, Q3 is between the 9th and 10th values and since both values on either side of Q3 have a value of 11, Q3 is also 11. The interquartile range is then
We have and , so we can now work out whether or not the value 19 is an outlier. If it is greater than , it is an outlier:
Since a data point in the top quarter must be greater than 18.5 to be classified as an outlier, we can say that 19 is indeed an outlier.
Example 5: Finding Outliers in a Data Set Using the 1.5 × IQR Rule
The table shows the heights, in metres, of the tallest buildings in a city. If there are any outliers in the data, find their values.
To find whether there are any outliers in the data or not, we will calculate the two boundaries for outliers:
If there are any values in the data set less than the first or greater than the second boundary, we can classify these values as outliers. We will need to work out the quartiles Q1 and Q3 and then the interquartile range (IQR) to calculate the boundaries. And to do this, we need first to put the data in order of size. The smallest value is 502, then 550, 561, and so on until we have the following list:
There are 12 building heights in the list and . So each 3 data points represent one quarter of the data set. The lower quartile, Q1, is halfway between the 3rd and 4th values and 25% of the data is below this. The upper quartile, Q3, is halfway between the 9th and 10th values and 25% of the data lies above this.
To find Q1, we have taken the mean of the 3rd and 4th values with the result that . Similarly, for Q3, the mean of the 9th and 10th values gives us .
We can now use these values to find the IQR:
Now that we have Q1, Q3, and the IQR, we can use these to work out the boundaries for outliers. The lower bound is
There are no values in our data set that are less than 380.75. Hence, there are no outliers among the lower values in the data set. For the upper bound, we have
There are no values in our data set that are greater than 922.75. Hence, there are no outliers among the higher values in the data set. If we consider the data in a dot plot as below, 901 is our highest, or maximum, value and 502 is our lowest, that is, the minimum value.
The value 901 looks as if it could be an outlier as it is quite far from where the rest of the data sits on the line.
However, if we mark our outlier boundaries on the plot, as below, we can see that 901 is to the left of the upper bound. That is, it is smaller than the upper bound; hence, it is not an outlier.
We can conclude that since no data values are either less than the lower bound for outliers or greater than the upper bound for outliers, there are no outliers in this data set.
We will conclude by reminding ourselves of the key points associated with outliers.
- An “outlier” or “extreme value” in a data set is a data point whose value is either much smaller or much larger than the majority of the data set.
- It is often possible to see, by looking at a data set or a graph of a data set, whether or not there are any potential outliers.
- To identify outliers mathematically in a data set, we use the interquartile range . A data point is classed as an outlier if it is either or This is often called the rule.
- Outliers can affect the results of calculations of summary statistics such as the mean
of a data set, potentially distorting information we might gain. Therefore, if there is
one or more outlier in a data set, we should consider each of these values separately to
determine the nature of the outlier.
An outlier could be the result of a human or computing error, in which case it should be removed from the data set. Or an outlier might be a genuine value, in which case the person analyzing the data must judge whether it is more useful to keep the data point in the set for further analysis or not.