Lesson Explainer: Pearson’s Correlation Coefficient Mathematics

In this explainer, we will learn how to calculate and use Pearson’s correlation coefficient, π‘Ÿ, to describe the strength and direction of a linear relationship.

You may recall learning about correlation, when two sets of data have a statistical relationship with each other. For linear correlation, we can determine how strongly two data sets are correlated with one another by how closely they follow a line of best fit, as seen below.

Strong correlation: all points are close to the line of best fit.

Weak correlation: all points follow the line of best fit, but some are further away than others.

No correlation: there is no clear line of best fit and the points are scattered.

However, it is not always clear from this approach the exact strength of the correlation; therefore, another method of finding the strength of correlation is needed.

For linear correlation, we can use Pearson’s correlation coefficient (also known as the Pearson product-moment correlation coefficient) to determine the strength of linear correlation between two sets of data. The coefficient, known as π‘Ÿ, can take values in the interval [βˆ’1,1] and can tell us how strongly two variables are correlated depending on the value that π‘Ÿ takes.

How To: Determining the Strength of Linear Correlation Using Pearson’s Correlation Coefficient

In order to determine the strength of linear correlation between two sets of data, we can use Pearson’s correlation coefficient, π‘Ÿ. More specifically,

  • if two variables have a strong positive (direct) correlation, then π‘ŸΒ is close to 1;
  • if two variables have a weak positive (direct) correlation, then π‘ŸΒ is positive, but closer to 0 than 1;
  • if two variables have a strong negative (inverse) correlation, then π‘ŸΒ is close toΒ βˆ’1;
  • if two variables have a weak negative (inverse) correlation, then π‘ŸΒ is negative, but closer to 0 thanΒ βˆ’1;
  • if there is no correlation, then π‘ŸΒ is close to 0.

This can also be seen on the number line below.

We can compare the correlation coefficient with scatter diagrams to help us visualize different correlations, as seen in the diagrams below.

In the following example, we will demonstrate how to use the definition of Pearson’s correlation coefficient to determine the strength and direction of correlation.

Example 1: Determining the Type of Correlation between Two Variables Using the Value of the Correlation Coefficient

Which of the following is the most appropriate interpretation of a product-moment correlation coefficient of 0.8?

  1. A strong positive linear correlation
  2. No correlation
  3. A moderate positive linear correlation
  4. A strong negative linear correlation
  5. A moderate negative linear correlation

Answer

As the product-moment correlation coefficient tells us the strength of a linear association between two variables, we can use the number line below to help us determine the interpretation of a coefficient of 0.8.

As 0.8 is a positive number, we know the data sets are positively correlated. As 0.8 is relatively close to 1, we know it is strongly correlated. Therefore, there is a strong positive linear correlation.

We can also use the definition of Pearson’s correlation coefficient to help match the description of the correlation of two data sets with the most appropriate correlation coefficient.

Example 2: Determining the Most Appropriate Correlation Coefficient given a Description of the Correlation

Which of the following correlation coefficients indicates the weakest inverse correlation?

  1. βˆ’0.48
  2. βˆ’0.22
  3. βˆ’0.75
  4. βˆ’0.83

Answer

We know that an inverse correlation has a negative product-moment correlation coefficient, which is the case for all the options in the example. We also know that the weaker the correlation, the closer the value to zero. Using a number line can help us determine which value is most appropriate.

Therefore, since βˆ’0.22 is closest to zero, it indicates the weakest inverse correlation.

In the next example, we will use the definition of Pearson’s correlation coefficient to match the most appropriate correlation coefficient to a scatter diagram of a set of data.

Example 3: Determining the Most Appropriate Correlation Coefficient given a Scatter Diagram

What is the most likely value of the product-moment correlation coefficient for the data shown in the diagram?

  1. βˆ’0.58
  2. 0
  3. βˆ’0.94
  4. 0.78
  5. 0.37

Answer

The product-moment correlation coefficient tells us how well a set of points fit a line of best fit. Therefore, by adding a line of best fit, we can more easily determine the strength and direction of correlation.

An estimate of the line of best has been applied to the data. As the line of best has a negative slope, we know it has an inverse correlation, so the correlation coefficient lies between βˆ’1 and 0. Since the majority of the data points lie on, or close to, the line of best fit, there must be a strong positive correlation. We might, for instance, estimate that the interval in which the correlation coefficient lies is [βˆ’1,βˆ’0.8]. The option of βˆ’0.58 would represent a moderate negative correlation, and therefore, our answer is βˆ’0.94.

Note that, in this example, we only had two negative answers to choose from. As a correlation coefficient of value βˆ’0.94 indicates a strong inverse correlation, we can eliminate this answer, leaving us with βˆ’0.58.

We have seen that we can estimate and interpret the value of Pearson’s correlation coefficient and its relationship with a line of best fit. Next, we will determine how Pearson’s correlation coefficient is calculated.

Definition: Pearson’s Correlation Coefficient

The correlation coefficient π‘Ÿ determines the strength of correlation between two variables π‘₯ and 𝑦 and is calculated by using the formula π‘Ÿ=π‘›βˆ‘π‘₯π‘¦βˆ’ο€Ήβˆ‘π‘₯βˆ‘π‘¦ο…ο„π‘›βˆ‘π‘₯βˆ’ο€Ήβˆ‘π‘₯ο…ο„π‘›βˆ‘π‘¦βˆ’ο€Ήβˆ‘π‘¦ο…, where 𝑛 is the number of paired values of π‘₯ and 𝑦.

Pearson’s correlation coefficient is used for continuous bivariate data in order to determine the strength and direction of linear correlation between the two data sets.

If you read more widely around the topic of Pearson’s correlation coefficient, you will come across another form of the formula, which is π‘Ÿ=π‘†βˆšπ‘†π‘†,ο—ο˜ο—ο—ο˜ο˜ where 𝑆, π‘†ο˜ο˜, and π‘†ο—ο˜ are defined as 𝑆=ο„šπ‘₯βˆ’ο€Ήβˆ‘π‘₯𝑛,𝑆=ο„šπ‘¦βˆ’ο€Ήβˆ‘π‘¦ο…π‘›,𝑆=ο„šπ‘₯π‘¦βˆ’βˆ‘π‘₯βˆ‘π‘¦π‘›.ο—ο—οŠ¨οŠ¨ο˜ο˜οŠ¨οŠ¨ο—ο˜

In the next example, we will calculate the product-moment correlation coefficient when summary statistics, such as the values of the sum of each of π‘₯, 𝑦, π‘₯𝑦, π‘₯, and π‘¦οŠ¨, as well as the value of 𝑛, are given.

Example 4: Calculating the Product-Moment Correlation Coefficient given Summary Statistics

A data set can be summarized by the following: 𝑛=8,ο„šπ‘₯=78,ο„šπ‘¦=βˆ’73,ο„šπ‘₯𝑦=βˆ’752,ο„šπ‘₯=792,ο„šπ‘¦=735.and

Calculate the product-moment correlation coefficient for this data set, giving your answer correct to three decimal places.

Answer

Approach 1:

In order to calculate the product-moment correlation coefficient, we use the formula π‘Ÿ=π‘›βˆ‘π‘₯π‘¦βˆ’ο€Ήβˆ‘π‘₯βˆ‘π‘¦ο…ο„π‘›βˆ‘π‘₯βˆ’ο€Ήβˆ‘π‘₯ο…ο„π‘›βˆ‘π‘¦βˆ’ο€Ήβˆ‘π‘¦ο…, where 𝑛 is the number of paired values of π‘₯ and 𝑦.

Substituting 𝑛=8, ο„šπ‘₯=78, ο„šπ‘¦=βˆ’73, ο„šπ‘₯𝑦=βˆ’752, ο„šπ‘₯=792, and ο„šπ‘¦=735, we get π‘Ÿ=π‘›βˆ‘π‘₯π‘¦βˆ’ο€Ήβˆ‘π‘₯βˆ‘π‘¦ο…ο„π‘›βˆ‘π‘₯βˆ’ο€Ήβˆ‘π‘₯ο…ο„π‘›βˆ‘π‘¦βˆ’ο€Ήβˆ‘π‘¦ο…=8Γ—(βˆ’752)βˆ’78Γ—(βˆ’73)8Γ—792βˆ’(78)8Γ—735βˆ’(βˆ’73)=βˆ’322√252√551=βˆ’0.8643.correcttodecimalplaces

Since the value of π‘Ÿ lies in the interval [βˆ’1,1], the answer βˆ’0.864 is appropriate: π‘Ÿ=βˆ’0.864.

Approach 2:

First, we need to calculate the summary statistics 𝑆, π‘†ο˜ο˜, and π‘†ο—ο˜: 𝑆=ο„šπ‘₯βˆ’ο€Ήβˆ‘π‘₯𝑛=792βˆ’(78)8=31.5,𝑆=ο„šπ‘¦βˆ’ο€Ήβˆ‘π‘¦ο…π‘›=735βˆ’(βˆ’73)8=68.875,𝑆=ο„šπ‘₯π‘¦βˆ’βˆ‘π‘₯βˆ‘π‘¦π‘›=βˆ’752βˆ’(78)(βˆ’73)8=βˆ’40.25.ο—ο—οŠ¨οŠ¨οŠ¨ο˜ο˜οŠ¨οŠ¨οŠ¨ο—ο˜

It is helpful when calculating 𝑆 and π‘†ο˜ο˜ to check that they are positive, as these two summary statistics should not be negative. π‘†ο—ο˜, however, can be negative and will determine whether the data sets are positively or negatively correlated.

Next, we can calculate the correlation coefficient using the formula π‘Ÿ=π‘†βˆšπ‘†π‘†=βˆ’40.25√31.5Γ—68.875=βˆ’0.864.ο—ο˜ο—ο—ο˜ο˜correcttothreedecimalplaces

Since the value of π‘Ÿ lies in the interval [βˆ’1,1], the answer βˆ’0.864 is appropriate: π‘Ÿ=βˆ’0.864.

In the following example, we will calculate the correlation coefficient from a set of continuous bivariate data and use this to determine the strength and direction of correlation.

Example 5: Calculating Pearson’s Correlation Coefficient from a Set of Continuous Bivariate Data and Using This to Determine the Strength and Direction of Correlation

The data table shows the high jump and long jump results achieved by 15 competitors in the women’s heptathlon in the 2016 Rio Olympics.

Long Jump (m)5.515.725.815.885.916.056.086.106.166.196.316.316.346.486.58
High Jump (m)1.651.771.831.771.771.771.81.771.81.861.861.831.891.861.98
  1. Calculate, to the nearest thousandth, the value of the product-moment correlation coefficient between the long jump and high jump results.
  2. What does this correlation coefficient reveal about the relationship between the long jump and high jump results?
    1. There is a moderate negative linear correlation between the long jump and high jump results.
    2. There is a strong positive linear correlation between the long jump and high jump results.
    3. There is a moderate positive linear correlation between the long jump and high jump results.
    4. There is a strong negative linear correlation between the long jump and high jump results.
    5. There is no real correlation between the long jump and high jump results.

Answer

Part 1

In order to calculate the product-moment correlation coefficient, we calculate ο„šπ‘₯, ο„šπ‘¦, and ο„šπ‘₯, ο„šπ‘¦οŠ¨, ο„šπ‘₯𝑦, and 𝑛 and then substitute into the formula: π‘Ÿ=π‘›βˆ‘π‘₯π‘¦βˆ’ο€Ήβˆ‘π‘₯βˆ‘π‘¦ο…ο„π‘›βˆ‘π‘₯βˆ’ο€Ήβˆ‘π‘₯ο…ο„π‘›βˆ‘π‘¦βˆ’ο€Ήβˆ‘π‘¦ο…, where 𝑛 is the number of paired values of π‘₯ and 𝑦.

It is helpful to add extra rows/columns in order to calculate π‘₯, π‘¦οŠ¨, and π‘₯𝑦 and then calculate the sum of each of these.

Long Jump, π‘₯High Jump, 𝑦π‘₯οŠ¨π‘¦οŠ¨π‘₯𝑦
5.511.6530.36012.72259.0915
5.721.7732.71843.132910.1244
5.811.8333.75613.348910.6323
5.881.7734.57443.132910.4076
5.911.7734.92813.132910.4607
6.051.7736.60253.132910.7085
6.081.836.96643.2410.944
6.11.7737.213.132910.797
6.161.837.94563.2411.088
6.191.8638.31613.459611.5134
6.311.8639.81613.459611.7366
6.311.8339.81613.348911.5473
6.341.8940.19563.572111.9826
6.481.8641.99043.459612.0528
6.581.9843.29643.920413.0284
ο„šπ‘₯=91.43ο„šπ‘¦=27.21ο„šπ‘₯=558.4923οŠ¨ο„šπ‘¦=49.4361οŠ¨ο„šπ‘₯𝑦=166.1151

Using 𝑛=15, the number of data points, and the sum of π‘₯, 𝑦, π‘₯, π‘¦οŠ¨, and π‘₯𝑦 from the table above, we can now substitute into the formula in order to calculate π‘Ÿ: π‘Ÿ=π‘›βˆ‘π‘₯π‘¦βˆ’ο€Ήβˆ‘π‘₯βˆ‘π‘¦ο…ο„π‘›βˆ‘π‘₯βˆ’ο€Ήβˆ‘π‘₯ο…ο„π‘›βˆ‘π‘¦βˆ’ο€Ήβˆ‘π‘¦ο…π‘Ÿ=15Γ—166.1151βˆ’91.43Γ—27.2115Γ—558.4923βˆ’(91.43)15Γ—49.4361βˆ’(27.21)=3.9162√17.9396√1.1574=0.859.tothreedecimalplaces

Since the value of π‘Ÿ lies in the interval [βˆ’1,1], the answer 0.859 is appropriate: π‘Ÿ=0.859.

Part 2

Having calculated the value of π‘Ÿ, we can then use this to determine the strength and direction of correlation. As π‘Ÿ is positive, the data sets are positively correlated. As π‘Ÿ is relatively close to 1, the data sets are strongly correlated. Therefore, there is a strong positive linear correlation between the long jump and high jump results. This is answer B.

In our last example, we will calculate the correlation coefficient where 𝑆, π‘†ο˜ο˜, and π‘†ο—ο˜ are given using the other form of the formula for Pearson’s correlation coefficient discussed.

Example 6: Calculating Pearson’s Correlation Coefficient given 𝑆π‘₯π‘₯, 𝑆𝑦𝑦, and 𝑆π‘₯𝑦

A data set has summary statistics 𝑆=36.875, 𝑆=73.875, and 𝑆=32.375ο—ο˜. Calculate the product-moment correlation coefficient for this data set, giving your answer correct to three decimal places.

Answer

The formula for Pearson’s correlation coefficient states that π‘Ÿ=π‘†βˆšπ‘†π‘†ο—ο˜ο—ο—ο˜ο˜.

Therefore, substituting 𝑆=36.875, 𝑆=73.875, and 𝑆=32.375ο—ο˜ gives π‘Ÿ=32.375√36.875Γ—73.875π‘Ÿ=0.620.correcttothreedecimalplaces

As we know π‘Ÿ lies in the interval [βˆ’1,1], we can deduce 0.620 to be an appropriate value: π‘Ÿ=0.620.

In this explainer, we have learned how to calculate Pearson’s correlation coefficient and how to interpret its meaning. Let’s recap the key points.

Key Points

  • Pearson’s correlation coefficient, π‘Ÿ, tells us how strongly two continuous variables are linearly correlated:
    • If π‘Ÿ lies in the interval ]0.5,1], then they are strongly and directly correlated.
    • If π‘Ÿ lies in the interval ]0.1,0.5], then they are weakly and directly correlated.
    • If π‘Ÿ lies in the interval [βˆ’1,βˆ’0.5[, then they are strongly and inversely correlated.
    • If π‘Ÿ lies in the interval [βˆ’0.5,βˆ’0.1[, then they are weakly and inversely correlated.
    • If π‘Ÿ lies in the interval [βˆ’0.1,0.1], then there is no correlation.
  • Pearson’s correlation coefficient is calculated by using the formula π‘Ÿ=π‘›βˆ‘π‘₯π‘¦βˆ’ο€Ήβˆ‘π‘₯βˆ‘π‘¦ο…ο„π‘›βˆ‘π‘₯βˆ’ο€Ήβˆ‘π‘₯ο…ο„π‘›βˆ‘π‘¦βˆ’ο€Ήβˆ‘π‘¦ο…, where π‘₯ represents the values of one variable, 𝑦 represents the values of the other variable, and 𝑛 represents the number of data points.
  • We can also use the alternative form of Pearson’s correlation coefficient, using the formula π‘Ÿ=π‘†βˆšπ‘†π‘†,ο—ο˜ο—ο—ο˜ο˜ where 𝑆, π‘†ο˜ο˜, and π‘†ο—ο˜ are the summary statistics defined as 𝑆=ο„šπ‘₯βˆ’ο€Ήβˆ‘π‘₯𝑛,𝑆=ο„šπ‘¦βˆ’ο€Ήβˆ‘π‘¦ο…π‘›,𝑆=ο„šπ‘₯π‘¦βˆ’βˆ‘π‘₯βˆ‘π‘¦π‘›;ο—ο—οŠ¨οŠ¨ο˜ο˜οŠ¨οŠ¨ο—ο˜andπ‘₯ represents the values of one variable; 𝑦 represents the values of the other variable; and 𝑛 represents the number of data points.
  • We can use Pearson’s correlation coefficient for continuous bivariate data when either a set of data or summary statistics are given.

Nagwa uses cookies to ensure you get the best experience on our website. Learn more about our Privacy Policy.