Video: Pearson’s Correlation Coefficient | Nagwa Video: Pearson’s Correlation Coefficient | Nagwa

Video: Pearson’s Correlation Coefficient

In this video, we will learn how to calculate and use Pearson’s correlation coefficient 𝑟 to describe the strength and direction of a linear relationship.

18:34

Video Transcript

In this video, we’ll learn how to calculate and use Pearson’s correlation coefficient 𝑟 to describe the strength and direction of a linear relationship. We’ll begin by reminding ourselves some of the terms and ideas related to correlation, which we’ll explore with some examples. And then we’ll calculate Pearson’s product-moment correlation coefficient by hand using the formula.

Bivariate data is data where two numerical or quantitative variables are uniquely paired across the subjects in an experiment. Suppose we have 𝑛 people in a sample and we measure their heights and weights. For each person or subject, we have a unique pair of measurements. If we call 𝑥 height in meters and 𝑦 weight in kilograms, then the pair of measurements for each person or subject give us a data point in our bivariate data set. Now suppose we’d like to know if there’s a relationship or correlation between a person’s height and weight. To give us some idea, we first plot our data on a scatter plot. And if we find that our data follows a linear pattern, then we can say there’s a linear correlation between 𝑥 and 𝑦 or height and weight.

It’s important to remember though when looking at correlation that we’re not saying that a change in one variable causes a change in another variable. We’re simply describing the relationship between the variables. And a scatter plot can give us some information about our data. We can see from this scatter plot, for example, that someone who is quite tall might be expected to be relatively heavy. And if there is correlation between our variables, the scatter plot can tell us the direction of correlation. If our 𝑥- and 𝑦-values increase together, then we say we have positive or direct correlation. And if as the 𝑥-values increase the 𝑦-values decrease, then we say we have negative or inverse correlation. If, on the other hand, there’s no pattern at all, we say there’s no correlation between 𝑥 and 𝑦. And if we have a nonlinear relationship between 𝑥 and 𝑦, then there’s no linear correlation.

We can also, to some extent, tell from a scatter plot how strong the linear relationship is by how closely the points sit together in the linear pattern. So, for example, in the left-hand diagram, where points are closely following a linear pattern, we say the correlation is very strong, whereas the data on the right-hand diagram is loosely spread in the linear pattern. And we would say that this is weak or moderate direct linear correlation. This is overly well, but we’re mathematicians and we want something a little more precise to measure our relationships. And this is where the correlation coefficient comes in.

This idea was developed by an English mathematician called Karl Pearson and hence is known as Pearson’s correlation coefficient or Pearson’s product-moment correlation coefficient. And it’s denoted by 𝑟 subscript 𝑥𝑦 or simply 𝑟. 𝑟 takes values from negative one to plus one. And the closer it is to either positive or negative one, the stronger the linear relationship or correlation. Let’s look at our first example where we estimate Pearson’s correlation coefficient from a scatter plot.

What is the most likely value of the product-moment correlation coefficient for the data shown in the diagram? Is it (A) negative 0.58, (B) zero, (C) negative 0.94, (D) 0.78, or (E) 0.37?

In estimating Pearson’s correlation coefficient from a scatter plot, there are two things we look at. The first is the direction of the linear pattern, which in our case is top left to bottom right. And the second thing is the spread of the data points around a possible line of best fit, that is, how close our data points are to a potential line of best fit. Generally speaking, we know that if a linear pattern of data is from bottom left to top right, then we have positive or direct correlation. Conversely, if our data follow a linear pattern from top left to bottom right, we say our data is negatively or inversely correlated. And if our data is directly correlated, that’s positively, then our coefficient is between zero and one, whereas if our data is inversely correlated, the coefficient is between negative one and zero.

In our case, our linear pattern is from top left to bottom right, so ours is the second case. This means our correlation coefficient must be between negative one and zero. And this means we can eliminate both (D) and (E) since these are both positive. And now if we look at the spread of the data, we know that the wider the spread away from a potential line of best fit, the weaker the correlation and that the closer the data points are to a potential line of best fit, the stronger the correlation. We know that Pearson’s correlation coefficient takes values from negative one to positive one and that the closer the coefficient is to positive or negative one, the stronger the correlation. And we know that the closer the correlation coefficient gets to zero, the weaker the correlation.

In the given plot, most of the data points are very close to a possible line of best fit. And remembering that our coefficient is negative, this means our coefficient must be close to negative one. We can eliminate (B) since we know that a correlation coefficient of zero means there’s no correlation at all, and we have very strong correlation. And so we’re left with option (A) and option (C). Option (A) with the value negative 0.58 would indicate a moderate correlation. That’s because it’s just over halfway between zero and negative one. And since our correlation is very strong, we can eliminate option (A). Option (C) is the closest to negative one with a value negative 0.94. So the most likely product-moment correlation coefficient for the data shown is (C) is equal to negative 0.94.

It’s worth noting also that if all of the data points lie exactly on the line, we have either perfect direct positive correlation or perfect inverse negative linear correlation. In the case of perfect direct correlation, the coefficient 𝑟 is equal to one. And for perfect inverse correlation, the coefficient 𝑟 is equal to negative one. Let’s now look at some examples where we’ll interpret different values of Pearson’s correlation coefficient.

Which of the following correlation coefficients indicates the weakest inverse correlation? Is it (A) negative 0.48, (B) negative 0.22, (C) negative 0.75, or (D) negative 0.83?

We’re given four correlation coefficients and we want to determine which of these represents the weakest inverse correlation. We know that Pearson’s correlation coefficient takes values between negative one and plus one. We know that if our coefficient is between negative one and zero, we have inverse correlation. That’s negative correlation. And if the value of 𝑟 is between zero and positive one, then we have direct or positive correlation. We also know that the closer the coefficient is to positive or negative one, the stronger the correlation and that the closer the coefficient is to zero, the weaker the correlation. And what this means is that the greater the magnitude of the correlation coefficient, the stronger the correlation.

So if we now look at the magnitudes of our four options, the magnitude of option (A) is 0.48, the magnitude of option (B) is 0.22, the magnitude of option (C) is 0.75, and the magnitude of option (D) is 0.83. And remember, we’re looking for the correlation coefficient that indicates the weakest correlation. This means the correlation coefficient with the smallest magnitude, that is, whose magnitude is closest to zero. And we can see that our option (B) has a magnitude closest to zero. And since the coefficient with the smallest magnitude is option (B), this indicates the weakest inverse correlation. So (B) is equal to negative 0.22 is our answer.

Let’s consider another example.

Which of the following is the most appropriate interpretation of a product-moment correlation coefficient of 0.8? Is it (A) a strong negative linear correlation, or (B) a moderate negative linear correlation, or (C) a moderate positive linear correlation, (D) a strong positive linear correlation, or (E) no correlation?

We know that Pearson’s product-moment correlation coefficient takes values between negative one and positive one. We also know that if 𝑟 is less than zero and greater than or equal to negative one, then we have inverse or negative correlation and that if 𝑟 is greater than zero and less than or equal to positive one, we have direct or positive correlation. We’re asked which of the given options is the most appropriate interpretation of a product-moment correlation coefficient of positive 0.8. And the correlation coefficient of positive 0.8 means that we have direct or positive correlation. And noting that the product-moment correlation coefficient applies to linear correlation, we can eliminate any of our options, which interpret the coefficient as a negative linear correlation.

This means we can eliminate option (A) and option (B) since these both specify negative linear correlation. We can also eliminate option (E) since this specifies no correlation and we do not have zero correlation. This leaves us with option (C) and option (D), a moderate positive linear correlation or a strong positive linear correlation. If we consider the magnitude of correlation coefficients — the stronger the correlation, the closer the magnitude to positive or negative one. The closer the magnitude is to zero, the weaker the correlation — then this means approximately midway between zero and positive or negative one, we have moderate correlation.

Since the given coefficient is 0.8, which is close to positive one, we can say this represents strong positive linear correlation. We can therefore eliminate option (C), which refers to a moderate positive linear correlation. And so the most appropriate interpretation of a product-moment correlation coefficient of 0.8 is option (D) a strong positive linear correlation.

So now we know how to interpret Pearson’s product-moment correlation coefficient, let’s look at how we might actually calculate it.

There are a few equivalent ways to write the formula for Pearson’s product-moment correlation coefficient and the one we’ll use is shown. Recalling that capital Σ, the Greek letter, represents the sum, we have the sum of the products 𝑥𝑦 minus the product of 𝑛, which is the number of data pairs, with 𝑥 bar, which is the mean of the 𝑥-values, and 𝑦 bar, which is the mean of the 𝑦-values, divided by the product of two square roots. These are the square roots of the sum of the 𝑥- and 𝑦-values squared minus 𝑛, the number of data pairs, times the mean squared for 𝑥 and 𝑦. This can be abbreviated to 𝑆 𝑥𝑦 over the square root of 𝑆 𝑥𝑥 times 𝑆 𝑦𝑦, where 𝑆 𝑥𝑦 is the covariance of 𝑥 and 𝑦, which is a measure of how 𝑥 and 𝑦 change together, and where 𝑆 𝑥𝑥 and 𝑆 𝑦𝑦 are the variation in 𝑥 and 𝑦, respectively. These are often referred to as the sums of squares.

Let’s first look at an example of how to use the abbreviated form to calculate the correlation coefficient. And in our final example, we’ll use the full formula to calculate the correlation coefficient for a data set from scratch.

A data set has summary statistics 𝑆 𝑥𝑥 is equal to 36.875, 𝑆 𝑦𝑦 is 73.875, and 𝑆 𝑥𝑦 is 32.375. Calculate the product-moment correlation coefficient for this data set, giving your answer correct to three decimal places.

We are given the summary statistics for a data set, where 𝑆 𝑥𝑥 is 36.875; that’s the variation in 𝑥. 𝑆 𝑦𝑦 is 73.875; that’s the variation in 𝑦. And 𝑆 𝑥𝑦 is 32.375; that’s the covariance of 𝑥 and 𝑦. And using these summary statistics, we want to calculate the product-moment correlation coefficient. To do this, we can use the abbreviated form of the correlation coefficient formula. That is, the correlation coefficient 𝑟 𝑥𝑦 is 𝑆 𝑥𝑦 divided by the square root of 𝑆 𝑥𝑥 times 𝑆 𝑦𝑦.

Since we’re given the summary statistics, we simply need to substitute these into the formula. So that 𝑟 𝑥𝑦 is 32.375, that’s 𝑆 𝑥𝑦, divided by the square root of the product of 36.875, which is 𝑆 𝑥𝑥, and 73.875, which is 𝑆 𝑦𝑦. That is 32.375 over the square root of 2724.140625, which is 32.375 divided by 52.19330 to five decimal places. And that’s approximately 0.62029. And so to three decimal places, the product-moment correlation coefficient for this data set is 𝑟 𝑥𝑦 is equal to 0.620.

In our final example, we’ll calculate Pearson’s correlation coefficient from scratch.

The data table shows the high jump and long jump results achieved by 15 competitors in the women’s heptathlon in the 2016 Rio Olympics. Calculate to the nearest thousandth the value of the product-moment correlation coefficient between the long jump and high jump results. What does the correlation coefficient reveal about the relationship between the long jump and high jump results?

We’re given the table of values for two variables: long jump and high jump scores for 15 women athletes in the Rio Olympics. This is bivariate data, which means that two measurements are recorded for each individual athlete: how far that athlete jumped in the long jump and how high they jumped in the high jump. So, for example, athlete one jumped 5.51 meters in the long jump and 1.65 meters in the high jump. And there are two parts to this question. We’re asked to calculate the product-moment correlation coefficient and to determine what this coefficient reveals about the relationship between the long jump and high jump results.

For the first part of the question, we’ll use the formula for the product-moment correlation coefficient shown, recalling that the capital Greek letter Σ means the sum, 𝑥 bar is the mean of the 𝑥-values, and 𝑦 bar is the mean of the 𝑦-values. 𝑛 is the number of data points or, in our case, this is the number of athletes, and 𝑟 𝑥𝑦 is the coefficient. So let’s begin by calling the long jump in meters the variable 𝑥 and the high jump in meters the variable 𝑦.

To calculate our coefficient, we’re going to need the various expressions within the formula, for example, the product 𝑥𝑦. And so we add some rows to our table. We’ve also added a column to the end of our table where we can take the sums. So let’s first work out the product 𝑥𝑦 for each athlete. For our first athlete, the product 𝑥𝑦, which is long jump times high jump, is 5.51 times 1.65. And that’s equal to 9.0915, and so we put this in the first new cell of our table. And similarly, for our second athlete, we have the product 5.72 times 1.77, and that’s equal to 10.1244, which goes in our second cell. And if we continue on in this way for all our 15 athletes, we have the product shown, where we’ve limited ourselves to three decimal places for the sake of space.

In the second new row of our table, we want the values of 𝑥 squared; that’s the long jump value squared. So, for example, our first value squared is 5.51 squared, which is 30.3601. For our second athlete, the long jump score squared is 5.72 squared and that’s 32.7184, which to two decimal places just for our table is 32.72. And continuing in this way, we can fill in the 𝑥 squared values for all of our athletes. And we can take the squares of the high jump values for 𝑦 squared. So, for example, the high jumps go for athlete one was 1.65 meters and 1.65 meters squared is 2.7225, which we’ve rounded to 2.723 just for the table.

So now let’s fill in our sums column, where the sum of the 𝑥’s, for example, is the long jump scores all summed. And that’s 91.43. The sum of all the high jump scores, that’s the 𝑦-values, is 27.21. The sum of the 𝑥𝑦’s is 166.1151 to four decimal places. The sum of the 𝑥 squareds is 558.4923. And the sum of the 𝑦 squareds is 49.4361. We have 15 athletes, so we know 𝑛 is 15. So if we first work out our means, we have 𝑥 bar is the sum of the 𝑥’s over 𝑛. That’s 91.43 over 15, and that’s 6.0953 to four decimal places. And similarly, we find the mean of the 𝑦’s is 1.814. And making some space, we can write down the sum of 𝑥𝑦. That’s 166.1151. And we can write down the sums of the 𝑥 squareds and the 𝑦 squareds. So we have everything we need for our formula.

Substituting all of these values into our formula, our numerator evaluates to 0.26108 and our denominator evaluates to 0.30378. Dividing then gives us 𝑟 approximately equal to 0.8594. So to three decimal places, the correlation coefficient 𝑟 is equal to 0.859. Since this value is close to positive one, we can say that there’s a very strong direct correlation or positive linear relationship between long jump and high jump results for the women athletes in the Rio Olympics.

Let’s now complete this video by reminding ourselves of some of the key points we’ve covered. We know that correlation does not mean causation. It simply indicates that a linear relationship exists between two variables, and it gives us some idea of the strength and direction of the relationship. We know that the product-moment correlation coefficient applies to bivariate numerical data. It takes values between negative and positive one. And the closer 𝑟 is to negative or positive one, the stronger the correlation. And conversely, the closer 𝑟 is to zero, the weaker the correlation between the two variables. And if the correlation coefficient is zero, then there is no correlation at all.

If the coefficient is positive, we say there is a direct or positive linear correlation between the variables, whereas if the coefficient is negative, we say there is an inverse or negative linear correlation. And to actually calculate Pearson’s product- moment correlation coefficient for a bivariate data set, we use the formula shown.

Join Nagwa Classes

Attend live sessions on Nagwa Classes to boost your learning with guidance and advice from an expert teacher!

  • Interactive Sessions
  • Chat & Messaging
  • Realistic Exam Questions

Nagwa uses cookies to ensure you get the best experience on our website. Learn more about our Privacy Policy