In this video, we’re gonna learn about Pearson’s product moment correlation coefficient, or just Pearson’s correlation coefficient, for short. Now, you should already know a bit about correlation. For example, about positive and negative correlation, or direct and inverse correlation, and the fact that there are different strengths of correlation depending on how good your line of best fit is and predicting the value of one variable given the value of the other variable. Put simply, when you draw your line of best fit on a scatterplot, it will either have a positive or negative slope or gradient. And the points will either be very close to that line or spread out further away. Pearson’s product moment correlation coefficient is a number that houses to quantify and interpret the type and strength of the correlation on a scale of negative one through zero to positive one.
Now very quickly, you should recall that if your scatterplot looks something like this, then the higher values of 𝑥 tend to be associated with higher values of 𝑦; lower values of 𝑥 tend to be associated with lower values of 𝑦. And the line of best fit has a positive slope or gradient, so we call this positive correlation. And the fact that the points are all pretty close to that line of best fit, tells us that it’s strong correlation. And in this case, we’ve got a line of best fit that’s got a negative slope or gradient, and we can see that larger 𝑥 values tend to be associated with lower 𝑦 values and vice versa. So we call this “negative correlation”. And the fact that the points aren’t quite as close to the line of best fit as they were in the last example, tells us it’s slightly weaker correlation.
Now just making rough judgements about how close the points are to that line of best fit is not really very mathematical. So there are several different methods of calculating a correlation coefficient, a number that quantifies how strong the correlation is. Now they will come out with slightly different values but they all work on a scale from negative one for perfect negative correlation, which would look something like that with all points exactly on the line of best fit, through zero for a graph like that, where there’s no correlation at all, all the way up to positive one for a graph like that, where all the points sit exactly on the line of best fit. And that line of best fit has a positive slope.
And here are a few examples. So in the top left one, we’ve got a correlation coefficient of nought point seven six. And we can see that’s getting fairly close to one, and it is suggesting that there is a positive slope. And those points are fairly close to that line. And we can see that, as the correlation coefficient gets close to zero, there’s no clear trend there. It’s not clear that high values of 𝑥 are associated with high values of 𝑦 or low values of 𝑥 are associated with high values of 𝑦. It’s just a bit of a-a mess, a random splattering of points. Now somewhere in the middle, right around nought point four one here, we can see, if I split the data into four quadrants, so going through the mean 𝑥 point and the mean 𝑦 point here, we can see that there are slightly more bits of data in this and this quadrant. So there’s a very slight tendency towards higher 𝑥s being associated with higher 𝑦s and lower 𝑥s with lower 𝑦s and so on. But it’s not a strong correlation. So that lack of clarity in the correlation is reflected in this correlation coefficient of nought point four one. It’s not really close to zero. It’s not really close to one.
Let’s see an example then of how to use Pearson’s correlation coefficient to work out how much correlation we think there is between two sets of data. Now, we’ve got the gross domestic product, in billions of pounds, for the United Kingdom, and the number of fires that were deliberately started in England and Wales that year. And we’ve got this for nineteen fifty, nineteen sixty, nineteen seventy, nineteen eighty and nineteen ninety. So we’ve got five data points over a forty-year period and we’re interested in seeing whether there’s any correlation between the-the state of the economy, the size of the gross domestic product, and the number of fires that’d started deliberately that year.
Now first of all, we’re gonna do a scatterplot of this data and draw in a line of best fit, so we can get a feel for what sort of correlation that we get. Then we’re going to calculate the Pearson’s product moment correlation coefficient to quantify how much correlation we think there is, between these two sets of data. When you’re choosing your variables, we normally have 𝑥 as the independent variable, the thing that we can change that’s causing things to happen, and 𝑦 the dependent variable, the thing that’s affected by changing the value of 𝑥. Now the problem we’ve got with this set of data, is it the case that in excellent economy years when people celebrate by going out and starting fires, or is it the people going round deliberately starting fires are causing the UK economy to grow every year. Well in reality I suspect, the two things are completely unrelated. One isn’t causing the other at all. But just randomly, I’m gonna use 𝑥 as the GDP figure and 𝑦 as the number of fires.
So this is what the scatterplot looks like, and it does suggest some sort of positive correlation. And if we added a line of best fit, that’s roughly what it would look like. Let’s go on and calculate the strength of the correlation in this case then. We usually call the Pearson’s product moment correlation coefficient 𝑟, and here’s the formula for working it out. Now that looks pretty horrible, but let’s go through and explain what everything means. This symbol here means “the sum of”. So this bit means the sum of the 𝑥𝑦. So we take the- each 𝑥-value and multiply it by the corresponding 𝑦-value, and then we take those results and we add those all up. This bit is the sum of all the 𝑥-values, and this bit is the sum of all the 𝑦-values. So we multiply those together and then divide by the number of pieces of data that we have. For this bit, we take all of our 𝑥-values and we square them, and then then we add up all those squares. For this bit, we’re just adding up all the 𝑥-values, and then we’re taking that answer and squaring it, and dividing by how many bits of data we’ve got. And then we do the same for the corresponding 𝑦-values. And then we multiply those two results together, and take the square root of that, and then complete this calculation.
Now a shorter version of that formula is this one here: 𝑆𝑥𝑦 over the square root of 𝑆𝑥𝑥 times 𝑆𝑦𝑦. And in that case, all of this lot here is 𝑆𝑥𝑦, all of this lot here is 𝑆𝑥𝑥, and all of this lot here is 𝑆𝑦𝑦. Now don’t worry about this too much for now, but 𝑆𝑥𝑥 over the number of pieces of data you’ve got gives us an answer which is the variance of the 𝑥 data, a measure of the variability of the 𝑥-data. And 𝑆𝑦𝑦 over 𝑛 is the variance of the 𝑦-data or a measure of the variability of the 𝑦-data. And 𝑆𝑥𝑦 over 𝑛 is called the covariance of the 𝑥- and 𝑦-data, the measure of the variability of the 𝑥- and 𝑦-data together.
So let’s see how to use this all in practice. Well, first we need to extend the table of values that we had to start off with. We’re gonna add columns for the square of each 𝑥-value, the square of each 𝑦-value, and the product of each 𝑥 and 𝑦 pair. Now we’re gonna add a row to the bottom, where we’re gonna sum up all those values. So first, all of the 𝑥 values added together gives us three thousand five hundred and eighty-eight. If I add all the 𝑦 values together, I get forty-four thousand and thirty-eight. So now I’m gonna square each of the 𝑥-values in turn. So three hundred and sixty-nine squared is one hundred and thirty-six thousand one hundred and sixty-one, five hundred and fifteen squared is two hundred and sixty-five thousand two hundred and twenty-two, and so on. So if I add up all those 𝑥 squared values, I get two million nine hundred and forty-two thousand four hundred and four. Now, working through the 𝑦-values, five hundred and forty-five squared is two hundred and ninety-seven thousand and twenty-five, eight hundred and ninety-one squared is seven hundred and ninety-three thousand and eight hundred and eighty-one, and so on. And if I add all those values up, I get eight hundred and ninety-two million seven hundred and forty-three thousand three hundred and ninety-six.
Now I need to multiply together the 𝑥- and the 𝑦-values. So three hundred and sixty-nine times five hundred and forty-five gives me two hundred and one thousand one hundred and five, five hundred and fifteen times eight hundred and ninety-one gives me four hundred and fifty-eight thousand eight hundred and sixty-five, and so on. And if I add up all the 𝑥𝑦-values, I get forty-four million four hundred and seventy-eight thousand nine hundred and sixty-nine.
So now I can calculate the 𝑆𝑥𝑥 value, so that’s the sum of the 𝑥 squareds minus the sum of 𝑥 all squared, and then, that’s divided by 𝑛, the number of pieces of data. Well I’ve got one, two, three, four, five bits of data. So 𝑛 equals five. And when I do that calculation, I get three hundred and sixty-seven thousand six hundred and fifty-five point two.
Now to work out 𝑆𝑦𝑦, that’s the sum of the 𝑦 squared terms minus the sum of the 𝑦-terms all squared over 𝑛. So, there’s my sum of 𝑦 squareds. That’s the sum of the 𝑦s and I’m gonna square that value. And 𝑛 is five, so I’m dividing by five. And that gives me five hundred and four million eight hundred and seventy-four thousand three hundred and seven point two.
And now, let’s work out S𝑥𝑦. Well, that’s the sum of the 𝑥𝑦s multiplied together, and then we’re taking away the sum of the 𝑥s times the sum of the 𝑦s all over 𝑛. So, the sum of the 𝑥ys is that. The sum of the 𝑥s was three thousand five hundred and eighty-eight, and the sum of the 𝑦s was forty-four thousand and thirty-eight. So, we multiply those together and then divide that by 𝑛, which is five, and that gives us twelve million eight hundred and seventy-seven thousand three hundred point two.
So, let’s write our 𝑆𝑥𝑥, 𝑆𝑦𝑦 and 𝑆𝑥𝑦 values down here. And I can assure you it’s just a pure coincidence that they all ended in nought point two. They don’t normally do that, and that’s just a quirk of these particular numbers.
Now if you are doing a test or an exam, at this point in the question, they might ask you for the variance of 𝑥, or the variance of 𝑦, or the covariance of 𝑥 and 𝑦. So what you do is take those values you’ve got there and divide them by 𝑛. So each of those values divided by five would give us these answers over here. But of course, that’s not what we’re looking for today. We’re looking for the Pearson’s product moment correlation coefficient, which is 𝑆𝑥𝑦 over the square root of 𝑆𝑥𝑥 times 𝑆𝑦𝑦. So plugging those values in for 𝑆𝑥𝑦, 𝑆𝑥𝑥, and 𝑆𝑦𝑦 gives us that calculation, which gives us an answer of nought point nine four five one seven six three one one seven, and so on and so on, as far as your calculator will go. Now typically, you’d normally give your answer to one or two decimal places, and depending on what the question asked for. So rounding up to two decimal places, tells us that our Pearson’s product moment correlation coefficient is nought point nine five, which is very close to one. So that’s strong positive correlation.
Now actually, the interpretation of that number is far from easy. Now some people have suggested this as a set of guidelines for which words you might use to describe which values. But it really depends on what type of data you’re using. I remember an 𝑟-value of nought point four is still closer to zero, no correlation, than it is to one, perfect positive correlation. So even medium association isn’t suggesting much of a link between the two sets of data. But the 𝑟-value can be useful when comparing associations between different pairs of data sets. For example, is GDP more closely associated with the number of fire started deliberately in the country or the proportion of the working age population who are employed? The actual 𝑟-values may be difficult to interpret, but if one is higher than the other, then it indicates a greater degree of association between that pair of variables. Now it’s also true that there are lots of other statistical methods that can help you to interpret the significance of the association between variables. But they’re beyond the scope of this video. Also, one or two outliers or erroneous bits of data can have a huge effect on the line of best fit and the Pearson’s correlation coefficient score.
Take a look at these two scatterplots here. Now they’re basically the same data except the one on the left has got this outline piece of data included, and that’s been removed from the one on the right. Now if that was a genuine piece of data, then we couldn’t remove it from our data set. But if it was a mistake, somebody had written down the wrong figure, then we would be okay to remove it. Now in the first case, we’ve got a correlation coefficient of nought point three six, which indicates weak, positive, or not very much correlation between the two. But if that was a dodgy piece of data and then we got rid of it, our correlation coefficient would go up to nought point seven six, which should be quite strong positive correlation. Now if you didn’t feel that you could justify removing that piece of data based on the evidence that you had, then it may be more appropriate to use a different method of correlation analysis, so like Spearman’s rank correlation or Kendall’s tau correlation, because they’re less sensitive to individual data values being more extreme than the Pearson’s correlation coefficient.
So, back to our original question then. The Pearson’s correlation coefficient between the UK gross domestic product, in billions of pounds, and the number of fires that was started deliberately that year, in England and Wales, is nought point nine five. What does that actually mean? Well, it doesn’t imply causality. So, it’s not the case that when the GDP is good and high, then people deliberately go out and start fires. Even though, it does seem to be the case that in years that the economy is doing well, there are more deliberately started fires. So you’re fairly limited in what you can actually say. There are all sorts of reasons why there may not be a real actual link between those two things, a causal link between them. For example, we haven’t got many data points, so it could be that we’ve just picked out some unrepresentative bits of data from the general situation. It could just be that there’s a coincidence in this data or even then maybe this have been a change in the way that data has been recorded for the GDP or for the number of fires started deliberately, which makes it look like there’s an association between the two. It could even be the case that some other factor is causing both of these things to occur. Another thing to consider is that there may not be a linear relationship between these two, even though we’ve got a pretty high correlation coefficient. It could be that actually a curve of best fit would work much better. In fact, when I add in the data for two thousand and two thousand and ten, we can see that a curve would fit this data much better than a straight line. That’s beyond the scope of this video, again to actually show you how to deal with that, but it’s certainly something that’s worth noticing and commenting on in your analysis of your data.
So, we’ve talked about how to calculate the value of the Pearson’s product moment correlation coefficient and what it means, and we’ve also included some reasons why you need to be very cautious about how you use it to interpret your data.
So I’ll just post more data on the screen now for you to have a go and try to do a calculation for yourself. So I recommend that you pause the video, and then I’ll go through the answer in just a moment. So you’ve gotta find the value of the Pearson’s product moment correlation coefficient for the data shown, giving your answer to two decimal places. Okay. So remember, the first thing we’ve got to do is add three columns for 𝑥 squared, 𝑦 squared, and 𝑥𝑦 and add a row at the bottom for all the totals. Now, we need to square all the 𝑥-values. So five squared is twenty-five, seven squared is forty-nine, and so on. Then we can square all the 𝑦-values. So three squared is nine, eight squared is sixty-four, and so on. And then, we can multiply all of the 𝑥 and 𝑦 pairs together. So five times three is fifteen, seven times eight is fifty-six, and so on. Then we need to add up each of those columns. And we can note that we’ve got five pieces of data, so 𝑛 is equal to five. Then, we can use each of these formulae to work out the 𝑆𝑥𝑥, 𝑆𝑦𝑦 and
𝑋𝑦𝑦 [𝑆𝑥𝑦] values. So, starting with 𝑆𝑥𝑥, the sum of the 𝑥 squareds is four hundred and three, the sum of 𝑥 is forty-three, so I need to square that and then divide by five, the number of data points, and that gives us thirty-three point two. Then, working out 𝑆𝑦𝑦, the sum of the 𝑦 squared terms is two hundred and ninety; the sum of 𝑦 is thirty-six so we’ve got to square that. So two hundred and ninety minus thirty-six squared over the number piece of data five gives us thirty point eight.
See? I told you not all the numbers end in nought point two. And for 𝑆𝑥𝑦, we’ve got the sum of the 𝑥𝑦s, which is three hundred thirty-eight minus the sum of the 𝑥s times the sum of the 𝑦s. So that’s forty-three times thirty-six all over the number pieces of data, which is five. That gives us twenty-eight point four. And to work out the Pearson’s product moment correlation coefficient, we take the 𝑆𝑥𝑦 value and divide it by the square root of the 𝑆𝑥𝑥 value times the 𝑆𝑦𝑦 value. So that’s twenty-eight point four over the square root of thirty-three point two times thirty point nine which gives us, to two decimal places, nought point eight nine. Now my calculator said nought point eight eight eight one two four six eight two four and so on. But the question only asked for the answer to two decimal places.