### Video Transcript

In this video, we’re gonna learn about Pearson’s product moment correlation
coefficient, or just Pearson’s correlation coefficient, for short. Now, you should already know a bit about correlation. For example, about
positive and negative correlation, or direct and inverse correlation, and the fact that there
are different strengths of correlation depending on how good your line of best fit is and
predicting the value of one variable given the value of the other variable. Put simply, when you draw your line of best fit on a scatterplot, it will
either have a positive or negative slope or gradient. And the points will either be very
close to that line or spread out further away. Pearson’s product moment correlation coefficient is a number that houses to
quantify and interpret the type and strength of the correlation on a scale of negative
one through zero to positive one.

Now very quickly, you should recall that if your scatterplot looks something
like this, then the higher values of 𝑥 tend to be associated with higher values
of 𝑦; lower values of 𝑥 tend to be associated with lower values of
𝑦. And the line of best fit has a positive slope or gradient, so we call this
positive correlation. And the fact that the points are all pretty close to that line of best
fit, tells us that it’s strong correlation. And in this case, we’ve got a line of best fit that’s got a negative slope
or gradient, and we can see that larger 𝑥 values tend to be associated with lower
𝑦 values and vice versa. So we call this “negative correlation”. And the fact that the points aren’t
quite as close to the line of best fit as they were in the last example, tells us it’s
slightly weaker correlation.

Now just making rough judgements about how close the points are to that line
of best fit is not really very mathematical. So there are several different methods of
calculating a correlation coefficient, a number that quantifies how strong the correlation is.
Now they will come out with slightly different values but they all work on a scale from
negative one for perfect negative correlation, which would look something like that with all points exactly on the line of
best fit, through zero for a graph like that, where there’s no correlation
at all, all the way up to positive one for a graph like that, where all
the points sit exactly on the line of best fit. And that line of best fit has a positive
slope.

And here are a few examples. So in the top left one, we’ve got a correlation
coefficient of nought point seven six. And we can see that’s getting fairly close
to one, and it is suggesting that there is a positive slope. And those points are
fairly close to that line. And we can see that, as the correlation coefficient gets close to
zero, there’s no clear trend there. It’s not clear that high values of
𝑥 are associated with high values of 𝑦 or low values of
𝑥 are associated with high values of 𝑦. It’s just a bit of a-a
mess, a random splattering of points. Now somewhere in the middle, right around nought
point four one here, we can see, if I split the data into four quadrants, so going through the mean
𝑥 point and the mean 𝑦 point here, we can see that there are
slightly more bits of data in this and this quadrant. So there’s a very slight tendency
towards higher 𝑥s being associated with higher 𝑦s and lower
𝑥s with lower 𝑦s and so on. But it’s not a strong correlation. So that lack of clarity in the correlation is reflected in this correlation
coefficient of nought point four one. It’s not really close to zero.
It’s not really close to one.

Let’s see an example then of how to use Pearson’s correlation coefficient to
work out how much correlation we think there is between two sets of data. Now, we’ve got the gross domestic product, in billions of pounds, for the United
Kingdom, and the number of fires that were deliberately started in England and Wales that
year. And we’ve got this for nineteen fifty, nineteen sixty,
nineteen seventy, nineteen eighty and nineteen
ninety. So we’ve got five data points over a forty-year period and we’re
interested in seeing whether there’s any correlation between the-the state of the economy, the
size of the gross domestic product, and the number of fires that’d started deliberately that
year.

Now first of all, we’re gonna do a scatterplot of this data and draw in a
line of best fit, so we can get a feel for what sort of correlation that we get. Then we’re going to calculate the Pearson’s product moment correlation
coefficient to quantify how much correlation we think there is, between these two sets of
data. When you’re choosing your variables, we normally have 𝑥 as the
independent variable, the thing that we can change that’s causing things to happen, and
𝑦 the dependent variable, the thing that’s affected by changing the value of
𝑥. Now the problem we’ve got with this set of data, is it the case that in
excellent economy years when people celebrate by going out and starting fires, or is it the
people going round deliberately starting fires are causing the UK economy to grow every year. Well in reality I suspect, the two things are completely unrelated. One
isn’t causing the other at all. But just randomly, I’m gonna use 𝑥 as the GDP
figure and 𝑦 as the number of fires.

So this is what the scatterplot looks like, and it does suggest some sort of
positive correlation. And if we added a line of best fit, that’s roughly what it would look like.
Let’s go on and calculate the strength of the correlation in this case then. We usually call the Pearson’s product moment correlation coefficient
𝑟, and here’s the formula for working it out. Now that looks pretty horrible, but let’s go through and explain what
everything means. This symbol here means “the sum of”. So this bit means the sum of the
𝑥𝑦. So we take the- each 𝑥-value and multiply it by the
corresponding 𝑦-value, and then we take those results and we add those all up. This bit is the sum of all the 𝑥-values, and this bit is the
sum of all the 𝑦-values. So we multiply those together and then divide by the
number of pieces of data that we have. For this bit, we take all of our 𝑥-values and we square them, and then
then we add up all those squares. For this bit, we’re just adding up all the 𝑥-values, and then we’re
taking that answer and squaring it, and dividing by how many bits of data we’ve got. And then
we do the same for the corresponding 𝑦-values. And then we multiply those two
results together, and take the square root of that, and then complete this calculation.

Now a shorter version of that formula is this one here: 𝑆𝑥𝑦 over the
square root of 𝑆𝑥𝑥 times 𝑆𝑦𝑦. And in that case, all of this lot here is 𝑆𝑥𝑦, all of this lot
here is 𝑆𝑥𝑥, and all of this lot here is 𝑆𝑦𝑦. Now don’t worry about this too much for now, but 𝑆𝑥𝑥 over the
number of pieces of data you’ve got gives us an answer which is the variance of the
𝑥 data, a measure of the variability of the 𝑥-data. And 𝑆𝑦𝑦 over 𝑛 is the variance of the 𝑦-data or a
measure of the variability of the 𝑦-data. And 𝑆𝑥𝑦 over 𝑛 is called the covariance of the 𝑥-
and 𝑦-data, the measure of the variability of the 𝑥- and
𝑦-data together.

So let’s see how to use this all in practice. Well, first we need to extend the table of values that we had to start off
with. We’re gonna add columns for the square of each 𝑥-value, the
square of each 𝑦-value, and the product of each 𝑥 and 𝑦
pair. Now we’re gonna add a row to the bottom, where we’re gonna sum up all those
values. So first, all of the 𝑥 values added together gives us three thousand five hundred and eighty-eight. If I add all the
𝑦 values together, I get forty-four thousand and thirty-eight. So now I’m gonna
square each of the 𝑥-values in turn. So three hundred and sixty-nine squared is one hundred and thirty-six
thousand one hundred and sixty-one, five hundred and fifteen squared is two hundred and sixty-five thousand
two hundred and twenty-two, and so on. So if I add up all those 𝑥 squared values, I get two million nine hundred and forty-two thousand four hundred and
four. Now, working through the 𝑦-values, five hundred and
forty-five squared is two hundred and ninety-seven thousand and twenty-five, eight hundred and ninety-one squared is seven hundred and ninety-three
thousand and eight hundred and eighty-one, and so on. And if I add all those values up, I get eight hundred and ninety-two million seven hundred and
forty-three thousand three hundred and ninety-six.

Now I need to multiply together the 𝑥- and the 𝑦-values. So three hundred and sixty-nine times five hundred and forty-five
gives me two hundred and one thousand one hundred and five, five hundred and fifteen times eight hundred and ninety-one gives me
four hundred and fifty-eight thousand eight hundred and sixty-five, and so on. And if I add up all the 𝑥𝑦-values, I get forty-four million four hundred and seventy-eight thousand nine
hundred and sixty-nine.

So now I can calculate the 𝑆𝑥𝑥 value, so that’s the sum
of the 𝑥 squareds
minus the sum of 𝑥 all squared, and then, that’s divided by 𝑛, the number of pieces of data.
Well I’ve got one, two, three, four, five bits of data. So 𝑛 equals five. And when I do that calculation, I get three hundred and sixty-seven thousand six hundred and fifty-five
point two.

Now to work out 𝑆𝑦𝑦, that’s the sum of the 𝑦
squared terms minus the sum of the 𝑦-terms all squared over
𝑛. So, there’s my sum of 𝑦 squareds. That’s the sum of the 𝑦s and I’m gonna square that value. And 𝑛 is five, so I’m dividing by five. And that gives me five hundred and four million eight hundred and
seventy-four thousand three hundred and seven point two.

And now, let’s work out S𝑥𝑦. Well, that’s the sum of the
𝑥𝑦s multiplied together, and then we’re taking away the sum of the
𝑥s times the sum of the 𝑦s all over 𝑛. So, the sum of
the 𝑥ys is that. The sum of the 𝑥s was three thousand five hundred and
eighty-eight, and the sum of the 𝑦s was forty-four thousand and
thirty-eight. So, we multiply those together and then divide that by 𝑛,
which is five, and that gives us twelve million eight hundred and seventy-seven
thousand three hundred point two.

So, let’s write our 𝑆𝑥𝑥, 𝑆𝑦𝑦 and 𝑆𝑥𝑦
values down here. And I can assure you it’s just a pure coincidence that they all ended in
nought point two. They don’t normally do that, and that’s just a quirk of these
particular numbers.

Now if you are doing a test or an exam, at this point in the question, they
might ask you for the variance of 𝑥, or the variance of 𝑦, or the
covariance of 𝑥 and 𝑦. So what you do is take those values you’ve
got there and divide them by 𝑛. So each of those values divided by five would give us these
answers over here. But of course, that’s not what we’re looking for today. We’re looking for the
Pearson’s product moment correlation coefficient, which is 𝑆𝑥𝑦 over the square root of
𝑆𝑥𝑥 times 𝑆𝑦𝑦. So plugging those values in for 𝑆𝑥𝑦, 𝑆𝑥𝑥, and
𝑆𝑦𝑦 gives us that calculation, which gives us an answer of nought point nine four five one seven six three one one seven,
and so on and so on, as far as your calculator will go. Now typically, you’d normally give
your answer to one or two decimal places, and depending on what the question asked for. So
rounding up to two decimal places, tells us that our Pearson’s product moment correlation coefficient is
nought point nine five, which is very close to one. So that’s strong positive correlation.

Now actually, the interpretation of that number is far from easy. Now some people have suggested this as a set of guidelines for which words
you might use to describe which values. But it really depends on what type of data you’re
using. I remember an 𝑟-value of nought point four is still
closer to zero, no correlation, than it is to one, perfect positive correlation.
So even medium association isn’t suggesting much of a link between the two sets of data. But the 𝑟-value can be useful when comparing associations
between different pairs of data sets. For example, is GDP more closely associated with the
number of fire started deliberately in the country or the proportion of the working age
population who are employed? The actual 𝑟-values may be difficult to interpret, but if one is
higher than the other, then it indicates a greater degree of association between that pair of
variables. Now it’s also true that there are lots of other statistical methods that can
help you to interpret the significance of the association between variables. But they’re
beyond the scope of this video. Also, one or two outliers or erroneous bits of data can have a
huge effect on the line of best fit and the Pearson’s correlation coefficient score.

Take a look at these two scatterplots here. Now they’re basically the same
data except the one on the left has got this outline piece of data included, and that’s been
removed from the one on the right. Now if that was a genuine piece of data, then we couldn’t remove it from our
data set. But if it was a mistake, somebody had written down the wrong figure, then we would be
okay to remove it. Now in the first case, we’ve got a correlation coefficient of nought
point three six, which indicates weak, positive, or not very much correlation between
the two. But if that was a dodgy piece of data and then we got rid of it, our correlation
coefficient would go up to nought point seven six, which should be quite strong
positive correlation. Now if you didn’t feel that you could justify removing that piece of data
based on the evidence that you had, then it may be more appropriate to use a different method
of correlation analysis, so like Spearman’s rank correlation or Kendall’s tau correlation,
because they’re less sensitive to individual data values being more extreme than the
Pearson’s correlation coefficient.

So, back to our original question then. The Pearson’s correlation coefficient
between the UK gross domestic product, in billions of pounds, and the number of fires that was
started deliberately that year, in England and Wales, is nought point nine five.
What does that actually mean? Well, it doesn’t imply causality. So, it’s not the case that when the GDP is
good and high, then people deliberately go out and start fires. Even though, it does seem to be the case that in years that the economy is
doing well, there are more deliberately started fires. So you’re fairly limited in what you can actually say. There are all sorts
of reasons why there may not be a real actual link between those two things, a causal link
between them. For example, we haven’t got many data points, so it could be that we’ve just
picked out some unrepresentative bits of data from the general situation. It could just be that there’s a coincidence in this data or even then maybe
this have been a change in the way that data has been recorded for the GDP or for the number
of fires started deliberately, which makes it look like there’s an association between the
two. It could even be the case that some other factor is causing both of these
things to occur. Another thing to consider is that there may not be a linear relationship
between these two, even though we’ve got a pretty high correlation coefficient. It could be
that actually a curve of best fit would work much better. In fact, when I add in the data for two thousand and two
thousand and ten, we can see that a curve would fit this data much better than a
straight line. That’s beyond the scope of this video, again to actually show you how to
deal with that, but it’s certainly something that’s worth noticing and commenting on in your
analysis of your data.

So, we’ve talked about how to calculate the value of the Pearson’s product
moment correlation coefficient and what it means, and we’ve also included some reasons why you
need to be very cautious about how you use it to interpret your data.

So I’ll just post more data on the screen now for you to have a go and try to
do a calculation for yourself. So I recommend that you pause the video, and then I’ll go
through the answer in just a moment. So you’ve gotta find the value of the Pearson’s product
moment correlation coefficient for the data shown, giving your answer to two decimal places. Okay. So remember, the first thing we’ve got to do is add three columns for
𝑥 squared, 𝑦 squared, and 𝑥𝑦 and add a row at the bottom for all the totals. Now, we need to square all the 𝑥-values. So five squared is twenty-five, seven squared is
forty-nine, and so on. Then we can square all the 𝑦-values. So three squared is
nine, eight squared is sixty-four, and so on. And then, we can multiply
all of the 𝑥 and 𝑦 pairs together. So five times three is
fifteen, seven times eight is fifty-six, and so on. Then we need to add up each of those columns. And we can note that we’ve got five pieces of data, so 𝑛 is equal to
five. Then, we can use each of these formulae to work out the 𝑆𝑥𝑥,
𝑆𝑦𝑦 and ~~𝑋𝑦𝑦~~ [𝑆𝑥𝑦] values. So, starting with 𝑆𝑥𝑥, the sum of the 𝑥 squareds is four
hundred and three, the sum of 𝑥 is forty-three, so I need to square
that and then divide by five, the number of data points, and that gives us
thirty-three point two. Then, working out 𝑆𝑦𝑦, the sum of the 𝑦 squared terms is
two hundred and ninety; the sum of 𝑦 is thirty-six so we’ve got to square
that. So two hundred and ninety minus thirty-six squared over the number piece of
data five gives us thirty point eight.

See? I told you not all the numbers end in nought point two. And for 𝑆𝑥𝑦, we’ve got the sum of the 𝑥𝑦s, which is
three hundred thirty-eight minus the sum of the 𝑥s times the sum
of the 𝑦s. So that’s forty-three times thirty-six all over the
number pieces of data, which is five. That gives us twenty-eight point
four. And to work out the Pearson’s product moment correlation coefficient, we take
the 𝑆𝑥𝑦 value and divide it by the square root of the 𝑆𝑥𝑥 value
times the 𝑆𝑦𝑦 value. So that’s twenty-eight point four over the square root
of thirty-three point two times thirty point nine which gives us, to two decimal
places, nought point eight nine. Now my calculator said nought point eight eight eight one two four six
eight two four and so on. But the question only asked for the answer to two decimal places.