Lesson Video: Pearson’s Correlation Coefficient Mathematics

Start Practising

In this video, we will learn how to calculate and use Pearson’s correlation coefficient, r, to describe the strength and direction of a linear relationship.

19:32

Video Transcript

In this video, we’ll learn how to calculate and use Pearson’s correlation coefficient 𝑟 to describe the strength and direction of a linear relationship. We’ll begin by reminding ourselves some of the terms and ideas related to correlation, which we’ll explore with some examples. And then, we’ll calculate Pearson’s product-moment correlation coefficient by hand using the formula.

Bivariate data is data where two numerical or quantitative variables are uniquely paired across subjects in an experiment. Suppose, for example, that we have 𝑛 people in a sample and we measure their heights and their weights. For each person or subject, we have a unique pair of measurements. If we call 𝑋 height in meters and 𝑌 weight in kilograms, then the pair of measurements for each subject or person gives us a data point in our bivariate data set. Now, suppose we’d like to know if there’s a relationship or correlation between a person’s height and weight. To give us some idea, we first plot our data on a scatter plot. And if we find that our data follow a linear pattern, then we can say there’s a linear correlation between 𝑋 and 𝑌 or, in this case, height and weight.

It’s important to remember though when looking at correlation that we’re not saying that a change in one variable causes a change in another variable. We’re simply describing the relationship between the variables. And a scatter plot can give us some information about our data. We can see from this scatter plot, for example, that someone who is quite tall might be expected to be relatively heavy. If there is correlation between our variables, the scatter plot can tell us the direction of our correlation. If our 𝑋- and 𝑌-values increase together, then we say we have positive or direct correlation. And if as the 𝑋-values increase, the 𝑌-values decrease, then we say we have negative or inverse correlation.

If our scatter plot indicates no pattern at all, then we have no correlation. And if we have a nonlinear relationship between 𝑋 and 𝑌, then of course there’s no linear correlation. We can also to some extent tell from a scatter plot how strong the linear relationship is by how closely the points sit together in the linear pattern. So, for example, in the left-hand diagram, where the points are closely following a linear pattern, we say the correlation between 𝑋 and 𝑌 is very strong, whereas the data on the right-hand diagram is loosely spread in the linear pattern. And we would say that this is weak or moderate direct linear correlation.

So from our scatter diagrams, we have an idea of the direction and the strength of the correlation. But since we’re mathematicians, we’d like something a little more precise to measure our relationships. And this is where the correlation coefficient comes in. This idea was developed by an English mathematician called Karl Pearson and hence is known as Pearson’s correlation coefficient or Pearson’s product-moment correlation coefficient. It’s denoted by 𝑟 subscript 𝑥𝑦 or simply 𝑟. 𝑟 takes values from negative one to plus one. And the closer it is to either positive or negative one, the stronger the linear relationship or correlation. Let’s look now at our first example where we estimate Pearson’s correlation coefficient from a scatter plot.

What is the most likely value of the product-moment correlation coefficient for the data shown in the diagram? Is it (A) negative 0.58, (B) zero, (C) negative 0.94, (D) 0.78, or (E) 0.37?

When estimating Pearson’s correlation coefficient from a scatter plot, there are two things we look at. The first is the direction of the linear pattern, which in our case is top left to bottom right. The second thing is the spread of the data points around a possible line of best fit. That is, how close our data points are to a potential line of best fit. Let’s consider first the direction of the linear relationship. Generally speaking, we know that if our linear pattern is from bottom left to top right, then we have direct or positive linear correlation. Conversely, if our data follow a linear pattern from top left to bottom right, we say our data is negatively or inversely correlated.

The product-moment correlation coefficient for directly or positively correlated data takes values between zero and one, whereas if our data are negative or inversely correlated, the coefficient is between negative one and zero. With our data, the linear pattern is from top left to bottom right. So our data correspond to the second case, negative or inverse correlation. Our coefficient then must be between negative one and zero. This means we can eliminate option (D) and option (E) since they’re both positive. We’ve established that the direction of our correlation is negative.

So now, let’s look at the spread of the data about a possible line of best fit. We know that the wider the spread away from a possible line of best fit, the weaker the correlation. And the closer the data are to a potential line of best fit, the stronger the correlation. The product-moment correlation coefficient takes values between negative one and positive one. And the closer the coefficient is to positive or negative one, the stronger the correlation. On the other hand, the closer the correlation coefficient to zero, the weaker the correlation. In the given data plot, most of the points are very close to a possible line of best fit. And remembering that our coefficient is negative, this means our coefficient must be close to negative one. We can certainly eliminate option (B) since we know that a correlation coefficient of zero means no correlation at all.

So we’re left with options (A) and (C). Option (A), with a value of negative 0.58, would indicate a moderate correlation. That’s because it’s just over halfway between zero and negative one. So since our correlation is very strong, we can eliminate option (A). Option (C) is the closest to negative one, with a value negative 0.94. So the most likely product-moment correlation coefficient for the data shown is option (C), with a value of negative 0.94.

It’s worth noting also that if all of the data points lie exactly on the line, we have either perfect direct or positive linear correlation or perfect inverse or negative linear correlation. If we have perfect direct correlation, the coefficient 𝑟 is equal to one. And with perfect inverse correlation, the coefficient is equal to negative one. Let’s now look at some examples where we’ll interpret different values of Pearson’s correlation coefficient.

Which of the following correlation coefficients indicates the weakest inverse correlation? Is it option (A) negative 0.48, option (B) negative 0.22, option (C) negative 0.75, or option (D) negative 0.83?

We’re given four correlation coefficients, and we want to determine which of these indicates the weakest inverse correlation. We know that Pearson’s correlation coefficient takes values between negative one and positive one. And we know that if the value is between negative one and zero, then we have inverse or negative correlation. So all of the given options represent an inverse relationship. If the coefficient is between zero and positive one, then our correlation is positive or direct. We also know that the closer the coefficient is to positive or negative one, the stronger the correlation and that the closer the coefficient is to zero, the weaker the correlation. And what this means is that the greater the magnitude of the correlation coefficient, the stronger the correlation.

So now, if we look at the magnitudes of our four options, then the magnitude, that is, the absolute value, of option (A) is 0.48. The magnitude of option (B) is 0.22. The magnitude of options (C) is 0.75. And the magnitude of option (D) is 0.83. Remember, we’re looking for the correlation coefficient that indicates the weakest correlation. This means the correlation coefficient with the smallest magnitude, that is, whose magnitude is closest to zero. We see that our option (B) has a magnitude closest to zero, and this indicates that (B) represents the weakest correlation. Our answer is therefore (B) with a value of negative 0.22.

Let’s now consider another example.

Which of the following is the most appropriate interpretation of a product-moment correlation coefficient of 0.8? Is it (A) a strong negative linear correlation, (B) a moderate negative linear correlation, (C) a moderate positive linear correlation, (D) a strong positive linear correlation, or (E) no correlation?

We know that Pearson’s product-moment correlation coefficient 𝑟 subscript 𝑥𝑦 or just 𝑟 takes values between negative one and positive one. We also know that if 𝑟 is less than zero and greater than or equal to negative one, then we have inverse or negative correlation and that if 𝑟 is greater than zero and less than or equal to positive one, we have direct or positive correlation. We’re asked which of the given options is the most appropriate interpretation of a product-moment correlation coefficient with a value of 0.8. Since this value is positive, we know that we have direct or positive correlation. This means we can eliminate any of our options that specify negative correlation. So we can eliminate options (A) and (B), since these both specify a negative correlation. We can also eliminate option (E), since no correlation would give us a correlation coefficient of zero. And our correlation coefficient is nonzero; it’s 0.8.

This leaves us with options (C) and (D), a moderate positive linear correlation or a strong positive linear correlation. If we consider the magnitude of the correlation coefficient, the stronger the correlation, the closer the magnitude to one. And the closer the magnitude is to zero, the weaker the correlation. This means then that approximately midway between zero and positive or negative one, we have moderate correlation. Since the given coefficient is 0.8, which is close to positive one, we can therefore say that this represents strong positive correlation. And so the most appropriate interpretation of a product-moment correlation coefficient of 0.8 is option (D), a strong positive linear correlation.

So now that we know how to interpret Pearson’s product-moment correlation coefficient, let’s look at how we might actually calculate it. There are a few equivalent ways to write the formula for Pearson’s product-moment correlation coefficient. And the one we’re going to use is shown. You may, however, see the correlation coefficient written as 𝑆 subscript 𝑥𝑦 over the square root of 𝑆 subscript 𝑥𝑥 multiplied by 𝑆 subscript 𝑦𝑦, where the terms are as shown. Looking now at our formula, we recall that the capital Σ represents the sum, 𝑛 represents the number of data pairs, and 𝑥𝑦 stands for the product of the 𝑥- and 𝑦-values within each data pair. Let’s look at an example of how to use the formula when we’re given the summary statistics.

A data set can be summarized by the following. 𝑛 is equal to eight. The sum of the 𝑥-values is 78. The sum of the 𝑦-values is negative 73. The sum of the products 𝑥𝑦 is negative 752. The sum of the 𝑥 squareds is 792. The sum of the 𝑦-values squared is 735. Calculate the product-moment correlation coefficient for this data set, giving your answer to three decimal places.

We’re given the summary statistics for a set of bivariate data that we can use to calculate Pearson’s product-moment correlation coefficient. In the formula shown, we’re given a value for 𝑛, and that’s equal to eight. And this is the number of data pairs in our data set. We’re given the sum of the 𝑥-values, and that’s 78, the sum of the 𝑦-values, which is negative 73, the sum of the products 𝑥𝑦, that’s negative 752, the sum of the 𝑥 squared values, which is 792, and the sum of the 𝑦-values squared, which is 735. So all we need to complete our formula are the sum of the 𝑥-values squared and the sum of the 𝑦-values squared.

The sum of our 𝑥-values squared is 78 squared. And that’s 6084. The sum of the 𝑦-values all squared is negative 73 squared, which is 5329. And adding these to our list and making some space, we can now substitute our summary statistics into our formula for the correlation coefficient. We have the product-moment correlation coefficient 𝑟 subscript 𝑥𝑦 as shown. And evaluating our numerator and the two square roots in the denominator, we have negative 322 divided by the square root of 252 multiplied by the square root of 551. And this evaluates to negative 0.864 to three decimal places.

The correlation coefficient must lie between negative one and positive one. And in our case, this is true. And in fact, since our correlation coefficient is close to negative one, we can interpret this as a strong negative correlation. The product-moment correlation coefficient for the data set summarized by the given statistics is negative 0.864 to three decimal places. Note that you may also see the correlation coefficient written simply as 𝑟.

In our final example, we’ll calculate Pearson’s correlation coefficient from scratch.

The data table shows the high jump and long jump results achieved by 15 competitors in the women’s heptathlon in the 2016 Rio Olympics. Calculate, to the nearest thousand, the value of the product-moment correlation coefficient between the long jump and high jump results. What does this correlation coefficient reveal about the relationship between the long jump and high jump results?

We’re given a table of values for two variables, long jump and high jump scores for 15 women athletes in the Rio Olympics. This is bivariate data, which means that two measurements are recorded for each individual athlete, how far they jumped in the long jump and how high they jumped in the high jump. So, for example, athlete one jumped 5.51 meters in the long jump and 1.65 meters in the high jump. And there are two parts to this question. We’re first asked to calculate the product-moment correlation coefficient and then we’re asked for an interpretation of this value.

For the first part of the question, we’re going to use the formula shown for the correlation coefficient 𝑟 subscript 𝑥𝑦 which you may see written as 𝑟. And to use this formula, we recall that the capital Σ symbol signifies the sum, also that 𝑛 is the number of data points or pairs. In our case, we have 15 athletes so that 𝑛 is equal to 15. So let’s begin by calling the long jump in meters the variable 𝑋 and the high jump in meters the variable 𝑌. To calculate our coefficient, we’re going to need the various expressions within the formula. We’ll need the products 𝑥𝑦, the 𝑥-values squared, and the 𝑦-values squared. And so, we add some rows to our table to help us with our calculations. We also add a column to the end of our table for our sums.

So let’s first work out the products 𝑥𝑦 for each athlete. For our first athlete, the product is 5.51 multiplied by 1.65. And that’s equal to 9.0915. And so we put this into the first empty cell of our new row for 𝑥𝑦. Similarly, for our second athlete, we have 5.72 multiplied by 1.77, which is 10.1244. And this goes into the second cell for the 𝑥𝑦 row. And so we continue in this way, filling in the rest of the row, where we’ve restricted ourselves to three decimal places for the sake of space. In the second new row of our table, we want the 𝑥-values squared. So for example, our first entry will be 5.51 squared, and that’s 30.3601. Putting this in our table, we restrict ourselves now to two decimal places for the sake of space. And squaring the remainder of our 𝑥-values, we can fill in the table as shown.

Next, we take the squares of the high jump values, that’s the 𝑦-values squared, and fill in our table as shown. So now let’s fill in our sums column, where the sum of the 𝑋’s, for example, is the long jump scores all summed. And that’s equal to 91.43. The sum of all high jump scores, that’s the sum of the 𝑌-values, is 27.21. The sum of the products 𝑥𝑦 is 166.1151; that’s to four decimal places. The sum of the 𝑥-values squared is 558.4923. And the sum of the squares of the 𝑦’s is 49.4361. So in our sums column, we have the sum of the 𝑋-values, the sum of the 𝑌-values, the sum of the product 𝑥𝑦, the sum of the 𝑥-values squared, and the sum of the 𝑦-values squared. So now, we have everything we need for our formula.

With 𝑛 is 15, that’s the number of athletes, and all of the sums from our table, our correlation coefficient can be calculated as shown. Using our calculators, we can evaluate the numerator and denominator as shown. So we have 3.9162 divided by 4.5567, each to four decimal places, which is 0.8594 to four decimal places. To three decimal places then, that is, to the nearest thousandth, the value of the product-moment correlation coefficient between the long jump and high jump results is 0.859.

For the second part of the question regarding the relationship between the long jump and high jump results, our coefficient is very close to positive one. This means that there’s a strong positive, that is, direct, linear correlation between long jump and high jump results for the women athletes in the Rio Olympics.

Let’s now complete this video by reminding ourselves of some of the key points we’ve covered. We know that correlation does not mean causation. It simply indicates that a linear relationship exists between two variables and gives us some idea of the strength and direction of the relationship. We know that the product-moment correlation coefficient applies to bivariate numerical data. The coefficient takes values between negative and positive one. The closer 𝑟 is to negative or positive one, the stronger the correlation. And conversely, the closer 𝑟 is to zero, the weaker the correlation between the two variables.

And then, if the correlation coefficient is equal to zero, there’s no linear correlation. A positive correlation coefficient indicates a direct or positive linear relationship between the variables, whereas a negative coefficient indicates an inverse or negative linear correlation. And to actually calculate Pearson’s product-moment correlation coefficient for a bivariate data set, we use the formula shown where the coefficient can be labeled either 𝑟 subscript 𝑥𝑦 or simply 𝑟.

Lesson Video: Pearson’s Correlation Coefficient Mathematics

Video Transcript

Join Nagwa Classes