Video: Calculating the Spearman’s Correlation Coefficient for a Given Dataset

Find the Spearman’s correlation coefficient between sales and advertising from the given data.

07:59

Video Transcript

Find the Spearman’s correlation coefficient between sales and advertising from the given data.

So, we can see we’ve been given a table, which consists of four pairs of data for advertising and sales, presumably the amount of money spent on advertising and then the amount of money taken in sales, or perhaps the number of units sold. We’re asked to find the Spearman’s correlation coefficient between these two variables for this set of data.

Now, the Spearman’s correlation coefficient or Spearman’s rank correlation coefficient as it’s also called is a way of quantifying the degree of rank correlation between two variables. It measures the tendency for one variable to increase as the other does, although not necessarily in a linear way. For example, in this fictional data set here, we can see that every time there is an increase in 𝑥, there is an increase in 𝑦 although the relationship between the two variables is not a straight line. This would correspond with the value of the Spearman’s rank correlation coefficient, which we often denote as 𝑟 and then a subscript 𝑠 of one. And that’s the maximum value that it can take.

In fact, the value of this correlation coefficient is between negative one and positive one inclusive: a value of positive one, meaning that there is perfect positive rank correlation between the pairs of data and a value of negative one, meaning the opposite. There is perfect negative rank correlation between the pairs of data. Which means that the largest value in 𝑥 would be paired with the smallest value in 𝑦. The second largest value in 𝑥 would be paired with the second smallest value in 𝑦, and so on.

The Spearman’s rank correlation coefficient doesn’t use the original raw data itself. Instead, it uses their ranks, which we’ll look at how to assign in a moment. We have a formula for calculating it. It is one minus six multiplied by the sum of 𝑑_(𝑖) squared over 𝑛 multiplied by 𝑛 squared minus one. Now, 𝑑_(𝑖) means the difference in the ranks for the 𝑖th pair of data. That’s the pair of data 𝑥_(𝑖), 𝑦_(𝑖). And 𝑛 is the number of pairs of data. So, in this case, 𝑛 will be equal to four. So, before we can calculate this correlation coefficient, we first need to assign the ranks to each of our variables. So, we add two new rows to our table for the advertising rank and the sales rank. It doesn’t matter whether we choose to assign rank one to the smallest or largest piece of data as long as we’re consistent about what we do for both variables. So, let’s choose to assign rank one to the smallest.

Looking at the row of data for advertising, we can see that the smallest piece of data is this value here of 800. So, that gets rank one. We then see that we have two values which are the same. There are two values of 1000. Now, these would take the second and third places in an ordered list of the advertising data. So, we assigned them each the average of these ranks. That’s the average of two and three, which is 2.5. As these pieces of data are the same, they both get the same rank. Finally, the largest piece of data for advertising is 1500. And this would be the fourth value in an ordered list. So, this gets rank four.

We then apply the ranks to the sales data in the same way. And straightaway, we see that the two smallest values are identical. They’re both 4500. These would be the first and second values in an ordered list of the sales data. So, they get an average rank of 1.5. That’s an average of one and two. We then assign rank three to the next smallest piece of data — that’s 5000 — and finally rank four to the largest piece of data — that’s 6500. By assigning the ranks to the tight pieces of data in this way, we ensure that the sum of the ranks awarded to each variable will be the same. In this case, the sum is 10.

Next, we need to calculate the difference in the ranks awarded to each pair of data. Now, it doesn’t matter which way around we subtract the ranks, although again we should be consistent for each pair. So, we’ll take the rank of the advertising data and subtract the rank of the sales data. This is going to give our values of 𝑑_(𝑖). Remember, 𝑑_(𝑖) was the difference in the ranks awarded to the 𝑖th pair of data. So, first, we have 2.5 minus three, which gives negative 0.5, then one minus 1.5, which also gives negative 0.5, 2.5 minus 1.5, which is one, and four minus four, which is zero.

Now, there’s a helpful check that we can perform at this stage. It should always be the case that the sum of the differences in the ranks is equal to zero. We can see that we have some positive and some negative values. And if we sum our values, we have negative 0.5 plus negative 0.5, which is negative one plus one, which is zero. And then adding zero, we still have zero. So, it is indeed the case that the sum of this row of our table is zero, which gives us a little bit more confidence than what we’ve done so far is correct.

Finally, for the table, we need a row in which we calculate each of these differences squared because looking at our formula for the Spearman’s correlation coefficient, we can see that it uses the sum of 𝑑_(𝑖) squared, not the sum of 𝑑_(𝑖). And this is important because as we’ve seen, the sum of 𝑑_(𝑖) will always be equal to zero. So filling in this row of the table, we have negative 0.5 squared for each of the first two values, which is 0.25, one squared, which is one, and zero squared, which is zero. Now, it’s for this reason that it doesn’t actually matter which way around we subtract the ranks because we’re going to square the differences anyway. And whether we square negative 0.5 or 0.5, we’ll still get the same result of 0.25.

Next, we need to find the sum of these squared differences. So, we have 0.25 plus 0.25 plus one plus zero, which is equal to 1.5. And now, we’re ready to substitute into our formula for calculating the Spearman’s correlation coefficient. So, substituting the sum of 𝑑_(𝑖) squared equals 1.5 and 𝑛 equals four, we have that 𝑟 sub 𝑠 is equal to one minus six multiplied by 1.5 over four multiplied by four squared minus one. Now, just notice at this point that that one is not included in the numerator of the fraction. It’s one minus and then we have the fraction. A common mistake is to think that that fraction line extends all the way and that one is a part of the numerator. It isn’t. So, by drawing the one a little bit bigger, we’ve hopefully avoided making that mistake.

So, six times 1.5 is nine. And in the denominator, four squared is 16 minus one is 15. So, we have one minus nine over four multiplied by 15. We can simplify a factor of three in numerator and denominator of our fraction, giving one minus three over four times five or one minus three over 20. That gives 17 over 20. And if we convert to a decimal, we see that the value of the Spearman’s correlation coefficient between sales and advertising is 0.85. This tells us that there is a fairly strong degree of positive rank correlation between advertising and sales in this data set. As the value for advertising increases, so too does the value for sales in general, although not necessarily in a linear way. We could of course confirm this by producing a scatter plot of the two variables.

Nagwa uses cookies to ensure you get the best experience on our website. Learn more about our Privacy Policy.