In this explainer, we will learn how to calculate and use Pearson’s correlation coefficient, , to describe the strength and direction of a linear relationship.
You may recall learning about correlation, when two sets of data have a statistical relationship with each other. For linear correlation, we can determine how strongly two data sets are correlated with one another by how closely they follow a line of best fit, as seen below.
Strong correlation: all points are close to the line of best fit.
Weak correlation: all points follow the line of best fit, but some are further away than others.
No correlation: there is no clear line of best fit and the points are scattered.
However, it is not always clear from this approach the exact strength of the correlation; therefore, another method of finding the strength of correlation is needed.
For linear correlation, we can use Pearson’s correlation coefficient (also known as the Pearson product-moment correlation coefficient) to determine the strength of linear correlation between two sets of data. The coefficient, known as , can take values in the interval and can tell us how strongly two variables are correlated depending on the value that takes.
How To: Determining the Strength of Linear Correlation Using Pearson’s Correlation Coefficient
In order to determine the strength of linear correlation between two sets of data, we can use Pearson’s correlation coefficient, . More specifically,
- if two variables have a strong positive (direct) correlation, then is close to 1;
- if two variables have a weak positive (direct) correlation, then is positive, but closer to 0 than 1;
- if two variables have a strong negative (inverse) correlation, then is close to ;
- if two variables have a weak negative (inverse) correlation, then is negative, but closer to 0 than ;
- if there is no correlation, then is close to 0.
This can also be seen on the number line below.
We can compare the correlation coefficient with scatter diagrams to help us visualize different correlations, as seen in the diagrams below.
In the following example, we will demonstrate how to use the definition of Pearson’s correlation coefficient to determine the strength and direction of correlation.
Example 1: Determining the Type of Correlation between Two Variables Using the Value of the Correlation Coefficient
Which of the following is the most appropriate interpretation of a product-moment correlation coefficient of 0.8?
- A strong positive linear correlation
- No correlation
- A moderate positive linear correlation
- A strong negative linear correlation
- A moderate negative linear correlation
Answer
As the product-moment correlation coefficient tells us the strength of a linear association between two variables, we can use the number line below to help us determine the interpretation of a coefficient of 0.8.
As 0.8 is a positive number, we know the data sets are positively correlated. As 0.8 is relatively close to 1, we know it is strongly correlated. Therefore, there is a strong positive linear correlation.
We can also use the definition of Pearson’s correlation coefficient to help match the description of the correlation of two data sets with the most appropriate correlation coefficient.
Example 2: Determining the Most Appropriate Correlation Coefficient given a Description of the Correlation
Which of the following correlation coefficients indicates the weakest inverse correlation?
Answer
We know that an inverse correlation has a negative product-moment correlation coefficient, which is the case for all the options in the example. We also know that the weaker the correlation, the closer the value to zero. Using a number line can help us determine which value is most appropriate.
Therefore, since is closest to zero, it indicates the weakest inverse correlation.
In the next example, we will use the definition of Pearson’s correlation coefficient to match the most appropriate correlation coefficient to a scatter diagram of a set of data.
Example 3: Determining the Most Appropriate Correlation Coefficient given a Scatter Diagram
What is the most likely value of the product-moment correlation coefficient for the data shown in the diagram?
- 0
- 0.78
- 0.37
Answer
The product-moment correlation coefficient tells us how well a set of points fit a line of best fit. Therefore, by adding a line of best fit, we can more easily determine the strength and direction of correlation.
An estimate of the line of best has been applied to the data. As the line of best has a negative slope, we know it has an inverse correlation, so the correlation coefficient lies between and 0. Since the majority of the data points lie on, or close to, the line of best fit, there must be a strong negative correlation. We might, for instance, estimate that the interval in which the correlation coefficient lies is . The option of would represent a moderate negative correlation, and therefore, our answer is .
We have seen that we can estimate and interpret the value of Pearson’s correlation coefficient and its relationship with a line of best fit. Next, we will determine how Pearson’s correlation coefficient is calculated.
Definition: Pearson’s Correlation Coefficient
The correlation coefficient determines the strength of correlation between two variables and and is calculated by using the formula where is the number of paired values of and .
Pearson’s correlation coefficient is used for continuous bivariate data in order to determine the strength and direction of linear correlation between the two data sets.
If you read more widely around the topic of Pearson’s correlation coefficient, you will come across another form of the formula, which is where , , and are defined as
In the next example, we will calculate the product-moment correlation coefficient when summary statistics, such as the values of the sum of each of , , , , and , as well as the value of , are given.
Example 4: Calculating the Product-Moment Correlation Coefficient given Summary Statistics
A data set can be summarized by the following:
Calculate the product-moment correlation coefficient for this data set, giving your answer correct to three decimal places.
Answer
Approach 1:
In order to calculate the product-moment correlation coefficient, we use the formula where is the number of paired values of and .
Substituting , , , , , and , we get
Since the value of lies in the interval , the answer is appropriate:
Approach 2:
First, we need to calculate the summary statistics , , and :
It is helpful when calculating and to check that they are positive, as these two summary statistics should not be negative. , however, can be negative and will determine whether the data sets are positively or negatively correlated.
Next, we can calculate the correlation coefficient using the formula
Since the value of lies in the interval , the answer is appropriate:
In the following example, we will calculate the correlation coefficient from a set of continuous bivariate data and use this to determine the strength and direction of correlation.
Example 5: Calculating Pearson’s Correlation Coefficient from a Set of Continuous Bivariate Data and Using This to Determine the Strength and Direction of Correlation
The data table shows the high jump and long jump results achieved by 15 competitors in the women’s heptathlon in the 2016 Rio Olympics.
Long Jump (m) | 5.51 | 5.72 | 5.81 | 5.88 | 5.91 | 6.05 | 6.08 | 6.10 | 6.16 | 6.19 | 6.31 | 6.31 | 6.34 | 6.48 | 6.58 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
High Jump (m) | 1.65 | 1.77 | 1.83 | 1.77 | 1.77 | 1.77 | 1.8 | 1.77 | 1.8 | 1.86 | 1.86 | 1.83 | 1.89 | 1.86 | 1.98 |
- Calculate, to the nearest thousandth, the value of the product-moment correlation coefficient between the long jump and high jump results.
- What does this correlation coefficient reveal about the relationship
between the long jump and high jump results?
- There is a moderate negative linear correlation between the long jump and high jump results.
- There is a strong positive linear correlation between the long jump and high jump results.
- There is a moderate positive linear correlation between the long jump and high jump results.
- There is a strong negative linear correlation between the long jump and high jump results.
- There is no real correlation between the long jump and high jump results.
Answer
Part 1
In order to calculate the product-moment correlation coefficient, we calculate , , and , , , and and then substitute into the formula: where is the number of paired values of and .
It is helpful to add extra rows/columns in order to calculate , , and and then calculate the sum of each of these.
Long Jump, | High Jump, | |||
---|---|---|---|---|
5.51 | 1.65 | 30.3601 | 2.7225 | 9.0915 |
5.72 | 1.77 | 32.7184 | 3.1329 | 10.1244 |
5.81 | 1.83 | 33.7561 | 3.3489 | 10.6323 |
5.88 | 1.77 | 34.5744 | 3.1329 | 10.4076 |
5.91 | 1.77 | 34.9281 | 3.1329 | 10.4607 |
6.05 | 1.77 | 36.6025 | 3.1329 | 10.7085 |
6.08 | 1.8 | 36.9664 | 3.24 | 10.944 |
6.1 | 1.77 | 37.21 | 3.1329 | 10.797 |
6.16 | 1.8 | 37.9456 | 3.24 | 11.088 |
6.19 | 1.86 | 38.3161 | 3.4596 | 11.5134 |
6.31 | 1.86 | 39.8161 | 3.4596 | 11.7366 |
6.31 | 1.83 | 39.8161 | 3.3489 | 11.5473 |
6.34 | 1.89 | 40.1956 | 3.5721 | 11.9826 |
6.48 | 1.86 | 41.9904 | 3.4596 | 12.0528 |
6.58 | 1.98 | 43.2964 | 3.9204 | 13.0284 |
Using , the number of data points, and the sum of , , , , and from the table above, we can now substitute into the formula in order to calculate :
Since the value of lies in the interval , the answer 0.859 is appropriate:
Part 2
Having calculated the value of , we can then use this to determine the strength and direction of correlation. As is positive, the data sets are positively correlated. As is relatively close to 1, the data sets are strongly correlated. Therefore, there is a strong positive linear correlation between the long jump and high jump results. This is answer B.
In our last example, we will calculate the correlation coefficient where , , and are given using the other form of the formula for Pearson’s correlation coefficient discussed.
Example 6: Calculating Pearson’s Correlation Coefficient given 𝑆𝑥𝑥, 𝑆𝑦𝑦, and 𝑆𝑥𝑦
A data set has summary statistics , , and . Calculate the product-moment correlation coefficient for this data set, giving your answer correct to three decimal places.
Answer
The formula for Pearson’s correlation coefficient states that .
Therefore, substituting , , and gives
As we know lies in the interval , we can deduce 0.620 to be an appropriate value:
In this explainer, we have learned how to calculate Pearson’s correlation coefficient and how to interpret its meaning. Let’s recap the key points.
Key Points
- Pearson’s correlation coefficient, ,
tells us how strongly two continuous variables are linearly correlated:
- If lies in the interval , then they are strongly and directly correlated.
- If lies in the interval , then they are weakly and directly correlated.
- If lies in the interval , then they are strongly and inversely correlated.
- If lies in the interval , then they are weakly and inversely correlated.
- If lies in the interval , then there is no correlation.
- Pearson’s correlation coefficient is calculated by using the formula where represents the values of one variable, represents the values of the other variable, and represents the number of data points.
- We can also use the alternative form of Pearson’s correlation coefficient, using the formula where , , and are the summary statistics defined as represents the values of one variable; represents the values of the other variable; and represents the number of data points.
- We can use Pearson’s correlation coefficient for continuous bivariate data when either a set of data or summary statistics are given.