Video Transcript
In this video, we’ll learn how to
find the Spearman’s rank correlation coefficient. You’ll already be familiar with the
concept of correlation. You’ll know that Pearson’s product
moment correlation coefficient can give an indication of the existence, strength,
and direction of a linear relationship between two quantitative, that is, numerical,
variables. But Pearson’s coefficient can only
be calculated for quantitative data. If our data is nonnumerical, that
is, descriptive or qualitative, and has some order or ranking, we can’t use
Pearson’s correlation coefficient, but we can use Spearman’s rank correlation
coefficient.
In this video, we’ll see how to
calculate Spearman’s correlation coefficient using the formula and to determine if
and what type of association we have between paired data sets. That’s bivariate data. We can also calculate Spearman’s
correlation coefficient for numerical data. And it’s useful in particular if we
have, for example, some outliers in our data set.
Remember that when we plot paired
numerical data on a scatter plot, we’re looking for a relationship between the two
variables. If we have a linear relationship,
we can use Pearson’s product moment correlation coefficient to determine the
strength and direction of the relationship. We know that if the correlation
coefficient is close to positive one, we have a strong direct or positive
correlation between the variables. And if 𝑟 is close to negative one,
we have a strong inverse or negative correlation. If 𝑟 is zero, we have no
correlation. And if our relationship is
nonlinear, then we can’t use Pearson’s correlation coefficient. And we recall that Pearson’s
correlation coefficient takes values from negative one to plus one.
Now, with ranked or ordered data,
again, the correlation coefficient lies between plus and negative one. But now the interpretation is
slightly different. If Spearman’s rank correlation
coefficient is close to or exactly one, we have perfect agreement or association
between the ranks. If 𝑟 s is zero, then there is no
agreement or association between the ranks of our bivariate data. And if 𝑟 s is negative one, we
have perfect opposing or inverse association between the ranks of our bivariate
data.
Note also that sometimes Spearman’s
rank correlation coefficient is referred to as Spearman’s 𝜌. That’s the Greek letter 𝜌. To calculate Spearman’s
coefficient, if our data is not already ranked, this is our first step. We then find the differences
between the ranks, that’s 𝑑 𝑖, for each data pair. We then square each of these and
take the sum of the squares. And if 𝑛 is the number of
bivariate data points, 𝑖 takes values from one to 𝑛, and so we have 𝑛 differences
squared.
Now, considering this formula and
what we know about possible values of Spearman’s rank coefficient, that is, negative
one is less than or equal to 𝑟 s, which is less than or equal to one, is it true or
false that when the ranks of each of two corresponding elements in two groups of
data 𝑋 and 𝑌 are identical, Spearman’s rank correlation coefficient is equal to
one? Well, we know that Spearman’s rank
correlation coefficient is used to determine the relationship between the order or
ranking of bivariate data and that if the correlation coefficient is equal to one,
we have perfect correlation or perfect agreement. So let’s look at this from the
perspective of an example.
Suppose we have two judges, 𝑋 and
𝑌, ranking five cakes from best, which is one, to worst, which is five. And suppose the ranking of the
judges agrees exactly. If we didn’t work out the
differences in ranks, then because the judges agree, the differences are all equal
to zero. And then, of course, all the
differences squared are equal to zero.
And remember that Spearman’s rank
correlation coefficient is one minus six times the sum of all the differences
squared over 𝑛 times 𝑛 squared minus one. And in our example, all the
differences squared are equal to zero. So the sum of the differences
squared is also zero. So that in our formula, in the
numerator, we have zero. We have five cakes, so 𝑛 is equal
to five. And our correlation coefficient is
one minus six times zero over five times five squared minus one. And since our second term is equal
to zero, since anything multiplied by zero is zero, our correlation coefficient is
equal to one.
So certainly, in our example, where
the ranks of the two groups are identical, Spearman’s rank correlation coefficient
is indeed equal to one. But if we think in more general
terms, the difference for each data pair is its rank in 𝑌 subtracted from its rank
in 𝑋. And if the two ranks agree, then
their difference is zero. And if this is true for all 𝑖,
then 𝑑 𝑖 squared is equal to zero so that the sum of the 𝑑 𝑖 squared is also
zero. And if the sum of the 𝑑 𝑖
squared, the differences squared is equal to zero, then our second term will always
be zero.
And if our second term is zero,
then Spearman’s rank correlation coefficient must be equal to one. The statement then that if the
ranks of two corresponding elements in two groups of data 𝑋 and 𝑌 are identical,
Spearman’s rank correlation coefficient is equal to one is true. In this example, we used data that
was already ranked. But more often than not, we start
off with a bivariate data set, and we have to rank the data ourselves.
Is it true or false that when
Spearman’s rank correlation coefficient for two groups of data equals one, it means
that the data points perfectly lie on a straight line?
We know that when Spearman’s rank
correlation coefficient is equal to one, we have perfect agreement between the ranks
of the data. And if Spearman’s rank correlation
coefficient is equal to one, then the term containing the sum of the differences
squared must be equal to zero. So let’s look at this final
example. Suppose we have the time in minutes
it took for five students to take a test and their marks as a percentage. And now suppose we rank both our
time and our marks, taking the shortest time and the lowest marks as one and the
highest as five. And now, if we calculate the
difference in ranks, each of the differences are zero because the ranks are in
perfect agreement.
Now, if we square all the
differences, each of these is equal to zero because zero squared is zero. And so the sum of the differences
squared is also zero. And if we put this into our
formula, the sum of the 𝑑 𝑖 squared is equal to zero, so the second term is equal
to zero as we would expect. But now suppose we plot our
original data. We can see from our scatter plot
that although Spearman’s rank is equal to one, the data points themselves do not lie
perfectly on a straight line. And this means that our statement
is false.
In general, the fact that the ranks
of the data are equal means that if we plotted the ranks, they would lie on a
perfect straight line. But this is not necessarily the
case for the original data.
Let’s look now at an example of how
to calculate Spearman’s rank correlation coefficient for some quantitative bivariate
data.
Find the Spearman’s rank
correlation coefficient between the product price and its lifetime from the given
data. Round your answer to four decimal
places.
We’re given a table with lifetime
in years and price in dollars. And we’re asked to find Spearman’s
rank correlation coefficient between the paired data. We use the term paired because each
pair of data refers uniquely to one product so that the product with a lifetime of
one year has a price of 79 dollars, for example. Now to use the given formula to
calculate Spearman’s rank correlation coefficient, we need to know the number of
pairs of data 𝑛. And we need to know the difference
in ranks for each pair of data, and we then work out the sum of the differences
squared.
Now, since the lifetime data is
actually ordered sequentially already, that is, it goes from one to six with no
omissions, the lifetime data is already ranked. So we can simply use the data
itself as the rank. However, for the sake of clarity,
let’s write this down again in a new row. And next we need to rank our price
data. Noticing that a low price
corresponds to a low rank in lifetime, we can begin our price ranking at one also so
that we rank the price 79 as one. Our next lowest price is 103
dollars, which can be ranked as second. Our third lowest is 105, which is
ranked third, and so on so that 125 is ranked fourth, 160 dollars is ranked fifth,
and 214 dollars is ranked sixth.
Our next step is to find the
difference in ranks for each pair of data. We subtract the price rank from the
lifetime rank so that, in the first column, we have one minus one is equal to
zero. And for a lifetime of five years
and a price of 160 dollars, we have five minus five is equal to zero. Next, four minus four is equal to
zero, two minus three is negative one, six minus six is zero, and three minus two is
equal to positive one. Our next calculation is the
difference in ranks squared so that we have zero squared is zero and so on for the
rest of our differences. And now to use our Spearman’s rank
correlation coefficient, we need the sum of the differences squared, that is, zero
plus zero plus zero plus one plus zero plus one, which is equal to two.
It’s worth noting at this point
that if we were to sum the differences in ranks, we get zero, and this should always
be the case. In our case, we have zero plus zero
plus zero plus negative one plus zero plus positive one, and that’s equal to
zero. In order to use the formula, we
also need to know the number of data pairs, and we have six data pairs so 𝑛 is
equal to six.
So now making some room, we have
everything we need for our formula so that Spearman’s rank correlation coefficient
for this data is one minus six times two all over six times six squared minus
one. That is one minus 12 over 6 times
35, where six times 35 is 210, which is approximately equal to one minus
0.05714. This gives us Spearman’s rank
correlation coefficient approximately equal to 0.94286. And so to four decimal places,
Spearman’s rank correlation coefficient for this data is 0.9429. Since this value is very close to
positive one, we can interpret this as a very strong direct relationship or
association between a product lifetime in years and its price in dollars. That is, the higher the price, the
longer the product lasts.
It’s perhaps worth noting that had
our coefficient been negative at negative 0.9429, our interpretation would be the
exact opposite. In that case, we would interpret
the value as the higher the price, the shorter the lifetime. The relationship would still be
extremely strong since now negative 0.9429 is very close to negative one. But in this case, it would be an
inverse association. Often when we have bivariate data
that we wish to find Spearman’s rank correlation coefficient for, we find that we
have tied ranks.
This occurs when ranking data. If two or more data points are
identical, their rank is then the average of the place numbers they take up in the
ordered list. Suppose, for example, we have a
data set for the variable 𝑋 with values 20, 30, 20, 10, and five. If we wish to rank our data from
low to high, we note that five is the lowest value, so this comes with rank one. 10 is the next lowest value, so
this has rank two.
But now we have two values of 20 so
that the value of 20 takes up both third and fourth places in our ordered list. So we take the average of the place
numbers that these two 20s take up. That’s three plus four divided by
two and that’s equal to 3.5 so that both instances of 20 are ranked 3.5. And since third and fourth places
are now taken up, we rank our final piece of data fifth.
So let’s see how this works in an
example.
The table represents the power
output and rotor diameter of several helicopters. Find the Spearman’s rank
correlation coefficient, and round your answer to four decimal places.
We’re given a set of paired data
for the power output and rotor diameter of some helicopters. We use the term paired data because
each pair of data is unique to one helicopter. So, for example, the helicopter
with a power output of 1,218 kilowatts has a rotor diameter of 10.2 meters. And to calculate Spearman’s rank
correlation coefficient, we’ll use the formula given. In this formula, 𝑛 corresponds to
the number of data pairs. 𝑑 𝑖 corresponds to the difference
in ranks for each pair, where 𝑖 takes value from one to 𝑛, and we calculate the
sum of the differences squared.
The first thing we need to do then
is to rank each of our two data sets. And to do this, let’s make some
room. If we begin by ranking the power
output, we could start at either the lowest or the highest power output. It should make no difference to the
Spearman’s correlation coefficient, provided we stick to the same direction for the
rotor diameter rankings. So let’s start with the last power
output, which is 944, which we rank as one. And to avoid confusion later on,
let’s strike this out. Our next lowest power output is
1,218, so we can strike this out and rank this two. And the next lowest is 1,864, which
we can rank third. 3,324 can be ranked fourth, 3,552
is ranked fifth, 3,758 is ranked sixth, and our highest power output is 4,698, which
is ranked seventh.
And now for our rotor diameters,
our lowest value is 10.2 meters. But this occurs twice, so
effectively we have tied ranks for the first place. How this works statistically,
however, is we take the average of the places that these data points would
occupy. That is first and second places so
that the ranks of the two data points with values 10.2 or one plus two over two. That is the first place and the
second place over two, which is 1.5, so that both of our instances of a rotor
diameter of 10.2 meters are ranked 1.5. And we can strike these two
out.
Now, our third lowest value is 14,
so we can strike this out. And since first and second places
have already been taken by the 10.2 values, we must rank this third. Our next lowest value is 16.2,
which we rank fourth, followed by 16.3, which is ranked fifth, followed by 17.7,
which is ranked sixth, and finally 18.59, which is ranked seventh.
Now, to use our formula, we need
the differences in ranks squared for each data pair. So let’s first take the differences
in ranks. To do this, we subtract the
diameter rank from the power rank for each pair so that, in our first data column,
we have two minus 1.5, which is 0.5, for our next column, three minus three, which
is zero, one minus 1.5, which is negative 0.5. We have seven minus seven is zero,
five minus four is one, four minus five is negative one, and six minus six is
zero.
Our next step is to work out the
differences squared. In our first column, 0.5 squared is
0.25. In our second data column, zero
squared is zero. In our third column, negative 0.5
squared is 0.25. In our fourth column, zero squared
is zero. In our fifth column, one squared is
one. In our sixth column, negative one
squared is one. And in our final column, zero
squared is zero.
Now for a formula, we want the sum
of the differences squared. That is 0.25 plus zero plus 0.25
plus zero plus one plus one plus zero, which is 2.5. Now, before we use the formula,
let’s just check that the sum of the differences is equal to zero as it should
be. We have 0.5 plus zero plus negative
0.5 plus zero plus one plus negative one plus zero, and that is indeed equal to
zero.
Now we have seven pairs of data so
that our 𝑛 is equal to seven. And so Spearman’s rank correlation
coefficient is one minus six times 2.5 over seven times seven squared minus one. That is one minus 15 over 336. If you do this on your calculator,
it’s very important at this point to separate the one from the fraction. And to do this, we calculate 15
divided by 336; that’s 0.04464. And so Spearman’s rank correlation
coefficient for this data is 0.9554 to four decimal places.
We complete this video by noting
some key points. Spearman’s rank correlation
coefficient applies to ordered bivariate data. It takes values from negative one
to positive one. 𝑟 is close to positive or negative
one, corresponds to strong direct or inverse agreement, and the sum of the
differences in ranks is always equal to zero.