Video Transcript
In this video, we’ll learn how to
calculate and use Pearson’s correlation coefficient 𝑟 to describe the strength and
direction of a linear relationship. We’ll begin by reminding ourselves
some of the terms and ideas related to correlation, which we’ll explore with some
examples. And then, we’ll calculate Pearson’s
product-moment correlation coefficient by hand using the formula.
Bivariate data is data where two
numerical or quantitative variables are uniquely paired across subjects in an
experiment. Suppose, for example, that we have
𝑛 people in a sample and we measure their heights and their weights. For each person or subject, we have
a unique pair of measurements. If we call 𝑋 height in meters and
𝑌 weight in kilograms, then the pair of measurements for each subject or person
gives us a data point in our bivariate data set. Now, suppose we’d like to know if
there’s a relationship or correlation between a person’s height and weight. To give us some idea, we first plot
our data on a scatter plot. And if we find that our data follow
a linear pattern, then we can say there’s a linear correlation between 𝑋 and 𝑌 or,
in this case, height and weight.
It’s important to remember though
when looking at correlation that we’re not saying that a change in one variable
causes a change in another variable. We’re simply describing the
relationship between the variables. And a scatter plot can give us some
information about our data. We can see from this scatter plot,
for example, that someone who is quite tall might be expected to be relatively
heavy. If there is correlation between our
variables, the scatter plot can tell us the direction of our correlation. If our 𝑋- and 𝑌-values increase
together, then we say we have positive or direct correlation. And if as the 𝑋-values increase,
the 𝑌-values decrease, then we say we have negative or inverse correlation.
If our scatter plot indicates no
pattern at all, then we have no correlation. And if we have a nonlinear
relationship between 𝑋 and 𝑌, then of course there’s no linear correlation. We can also to some extent tell
from a scatter plot how strong the linear relationship is by how closely the points
sit together in the linear pattern. So, for example, in the left-hand
diagram, where the points are closely following a linear pattern, we say the
correlation between 𝑋 and 𝑌 is very strong, whereas the data on the right-hand
diagram is loosely spread in the linear pattern. And we would say that this is weak
or moderate direct linear correlation.
So from our scatter diagrams, we
have an idea of the direction and the strength of the correlation. But since we’re mathematicians,
we’d like something a little more precise to measure our relationships. And this is where the correlation
coefficient comes in. This idea was developed by an
English mathematician called Karl Pearson and hence is known as Pearson’s
correlation coefficient or Pearson’s product-moment correlation coefficient. It’s denoted by 𝑟 subscript 𝑥𝑦
or simply 𝑟. 𝑟 takes values from negative one
to plus one. And the closer it is to either
positive or negative one, the stronger the linear relationship or correlation. Let’s look now at our first example
where we estimate Pearson’s correlation coefficient from a scatter plot.
What is the most likely value of
the product-moment correlation coefficient for the data shown in the diagram? Is it (A) negative 0.58, (B) zero,
(C) negative 0.94, (D) 0.78, or (E) 0.37?
When estimating Pearson’s
correlation coefficient from a scatter plot, there are two things we look at. The first is the direction of the
linear pattern, which in our case is top left to bottom right. The second thing is the spread of
the data points around a possible line of best fit. That is, how close our data points
are to a potential line of best fit. Let’s consider first the direction
of the linear relationship. Generally speaking, we know that if
our linear pattern is from bottom left to top right, then we have direct or positive
linear correlation. Conversely, if our data follow a
linear pattern from top left to bottom right, we say our data is negatively or
inversely correlated.
The product-moment correlation
coefficient for directly or positively correlated data takes values between zero and
one, whereas if our data are negative or inversely correlated, the coefficient is
between negative one and zero. With our data, the linear pattern
is from top left to bottom right. So our data correspond to the
second case, negative or inverse correlation. Our coefficient then must be
between negative one and zero. This means we can eliminate option
(D) and option (E) since they’re both positive. We’ve established that the
direction of our correlation is negative.
So now, let’s look at the spread of
the data about a possible line of best fit. We know that the wider the spread
away from a possible line of best fit, the weaker the correlation. And the closer the data are to a
potential line of best fit, the stronger the correlation. The product-moment correlation
coefficient takes values between negative one and positive one. And the closer the coefficient is
to positive or negative one, the stronger the correlation. On the other hand, the closer the
correlation coefficient to zero, the weaker the correlation. In the given data plot, most of the
points are very close to a possible line of best fit. And remembering that our
coefficient is negative, this means our coefficient must be close to negative
one. We can certainly eliminate option
(B) since we know that a correlation coefficient of zero means no correlation at
all.
So we’re left with options (A) and
(C). Option (A), with a value of
negative 0.58, would indicate a moderate correlation. That’s because it’s just over
halfway between zero and negative one. So since our correlation is very
strong, we can eliminate option (A). Option (C) is the closest to
negative one, with a value negative 0.94. So the most likely product-moment
correlation coefficient for the data shown is option (C), with a value of negative
0.94.
It’s worth noting also that if all
of the data points lie exactly on the line, we have either perfect direct or
positive linear correlation or perfect inverse or negative linear correlation. If we have perfect direct
correlation, the coefficient 𝑟 is equal to one. And with perfect inverse
correlation, the coefficient is equal to negative one. Let’s now look at some examples
where we’ll interpret different values of Pearson’s correlation coefficient.
Which of the following correlation
coefficients indicates the weakest inverse correlation? Is it option (A) negative 0.48,
option (B) negative 0.22, option (C) negative 0.75, or option (D) negative 0.83?
We’re given four correlation
coefficients, and we want to determine which of these indicates the weakest inverse
correlation. We know that Pearson’s correlation
coefficient takes values between negative one and positive one. And we know that if the value is
between negative one and zero, then we have inverse or negative correlation. So all of the given options
represent an inverse relationship. If the coefficient is between zero
and positive one, then our correlation is positive or direct. We also know that the closer the
coefficient is to positive or negative one, the stronger the correlation and that
the closer the coefficient is to zero, the weaker the correlation. And what this means is that the
greater the magnitude of the correlation coefficient, the stronger the
correlation.
So now, if we look at the
magnitudes of our four options, then the magnitude, that is, the absolute value, of
option (A) is 0.48. The magnitude of option (B) is
0.22. The magnitude of options (C) is
0.75. And the magnitude of option (D) is
0.83. Remember, we’re looking for the
correlation coefficient that indicates the weakest correlation. This means the correlation
coefficient with the smallest magnitude, that is, whose magnitude is closest to
zero. We see that our option (B) has a
magnitude closest to zero, and this indicates that (B) represents the weakest
correlation. Our answer is therefore (B) with a
value of negative 0.22.
Let’s now consider another
example.
Which of the following is the most
appropriate interpretation of a product-moment correlation coefficient of 0.8? Is it (A) a strong negative linear
correlation, (B) a moderate negative linear correlation, (C) a moderate positive
linear correlation, (D) a strong positive linear correlation, or (E) no
correlation?
We know that Pearson’s
product-moment correlation coefficient 𝑟 subscript 𝑥𝑦 or just 𝑟 takes values
between negative one and positive one. We also know that if 𝑟 is less
than zero and greater than or equal to negative one, then we have inverse or
negative correlation and that if 𝑟 is greater than zero and less than or equal to
positive one, we have direct or positive correlation. We’re asked which of the given
options is the most appropriate interpretation of a product-moment correlation
coefficient with a value of 0.8. Since this value is positive, we
know that we have direct or positive correlation. This means we can eliminate any of
our options that specify negative correlation. So we can eliminate options (A) and
(B), since these both specify a negative correlation. We can also eliminate option (E),
since no correlation would give us a correlation coefficient of zero. And our correlation coefficient is
nonzero; it’s 0.8.
This leaves us with options (C) and
(D), a moderate positive linear correlation or a strong positive linear
correlation. If we consider the magnitude of the
correlation coefficient, the stronger the correlation, the closer the magnitude to
one. And the closer the magnitude is to
zero, the weaker the correlation. This means then that approximately
midway between zero and positive or negative one, we have moderate correlation. Since the given coefficient is 0.8,
which is close to positive one, we can therefore say that this represents strong
positive correlation. And so the most appropriate
interpretation of a product-moment correlation coefficient of 0.8 is option (D), a
strong positive linear correlation.
So now that we know how to
interpret Pearson’s product-moment correlation coefficient, let’s look at how we
might actually calculate it. There are a few equivalent ways to
write the formula for Pearson’s product-moment correlation coefficient. And the one we’re going to use is
shown. You may, however, see the
correlation coefficient written as 𝑆 subscript 𝑥𝑦 over the square root of 𝑆
subscript 𝑥𝑥 multiplied by 𝑆 subscript 𝑦𝑦, where the terms are as shown. Looking now at our formula, we
recall that the capital Σ represents the sum, 𝑛 represents the number of data
pairs, and 𝑥𝑦 stands for the product of the 𝑥- and 𝑦-values within each data
pair. Let’s look at an example of how to
use the formula when we’re given the summary statistics.
A data set can be summarized by the
following. 𝑛 is equal to eight. The sum of the 𝑥-values is 78. The sum of the 𝑦-values is
negative 73. The sum of the products 𝑥𝑦 is
negative 752. The sum of the 𝑥 squareds is
792. The sum of the 𝑦-values squared is
735. Calculate the product-moment
correlation coefficient for this data set, giving your answer to three decimal
places.
We’re given the summary statistics
for a set of bivariate data that we can use to calculate Pearson’s product-moment
correlation coefficient. In the formula shown, we’re given a
value for 𝑛, and that’s equal to eight. And this is the number of data
pairs in our data set. We’re given the sum of the
𝑥-values, and that’s 78, the sum of the 𝑦-values, which is negative 73, the sum of
the products 𝑥𝑦, that’s negative 752, the sum of the 𝑥 squared values, which is
792, and the sum of the 𝑦-values squared, which is 735. So all we need to complete our
formula are the sum of the 𝑥-values squared and the sum of the 𝑦-values
squared.
The sum of our 𝑥-values squared is
78 squared. And that’s 6084. The sum of the 𝑦-values all
squared is negative 73 squared, which is 5329. And adding these to our list and
making some space, we can now substitute our summary statistics into our formula for
the correlation coefficient. We have the product-moment
correlation coefficient 𝑟 subscript 𝑥𝑦 as shown. And evaluating our numerator and
the two square roots in the denominator, we have negative 322 divided by the square
root of 252 multiplied by the square root of 551. And this evaluates to negative
0.864 to three decimal places.
The correlation coefficient must
lie between negative one and positive one. And in our case, this is true. And in fact, since our correlation
coefficient is close to negative one, we can interpret this as a strong negative
correlation. The product-moment correlation
coefficient for the data set summarized by the given statistics is negative 0.864 to
three decimal places. Note that you may also see the
correlation coefficient written simply as 𝑟.
In our final example, we’ll
calculate Pearson’s correlation coefficient from scratch.
The data table shows the high jump
and long jump results achieved by 15 competitors in the women’s heptathlon in the
2016 Rio Olympics. Calculate, to the nearest thousand,
the value of the product-moment correlation coefficient between the long jump and
high jump results. What does this correlation
coefficient reveal about the relationship between the long jump and high jump
results?
We’re given a table of values for
two variables, long jump and high jump scores for 15 women athletes in the Rio
Olympics. This is bivariate data, which means
that two measurements are recorded for each individual athlete, how far they jumped
in the long jump and how high they jumped in the high jump. So, for example, athlete one jumped
5.51 meters in the long jump and 1.65 meters in the high jump. And there are two parts to this
question. We’re first asked to calculate the
product-moment correlation coefficient and then we’re asked for an interpretation of
this value.
For the first part of the question,
we’re going to use the formula shown for the correlation coefficient 𝑟 subscript
𝑥𝑦 which you may see written as 𝑟. And to use this formula, we recall
that the capital Σ symbol signifies the sum, also that 𝑛 is the number of data
points or pairs. In our case, we have 15 athletes so
that 𝑛 is equal to 15. So let’s begin by calling the long
jump in meters the variable 𝑋 and the high jump in meters the variable 𝑌. To calculate our coefficient, we’re
going to need the various expressions within the formula. We’ll need the products 𝑥𝑦, the
𝑥-values squared, and the 𝑦-values squared. And so, we add some rows to our
table to help us with our calculations. We also add a column to the end of
our table for our sums.
So let’s first work out the
products 𝑥𝑦 for each athlete. For our first athlete, the product
is 5.51 multiplied by 1.65. And that’s equal to 9.0915. And so we put this into the first
empty cell of our new row for 𝑥𝑦. Similarly, for our second athlete,
we have 5.72 multiplied by 1.77, which is 10.1244. And this goes into the second cell
for the 𝑥𝑦 row. And so we continue in this way,
filling in the rest of the row, where we’ve restricted ourselves to three decimal
places for the sake of space. In the second new row of our table,
we want the 𝑥-values squared. So for example, our first entry
will be 5.51 squared, and that’s 30.3601. Putting this in our table, we
restrict ourselves now to two decimal places for the sake of space. And squaring the remainder of our
𝑥-values, we can fill in the table as shown.
Next, we take the squares of the
high jump values, that’s the 𝑦-values squared, and fill in our table as shown. So now let’s fill in our sums
column, where the sum of the 𝑋’s, for example, is the long jump scores all
summed. And that’s equal to 91.43. The sum of all high jump scores,
that’s the sum of the 𝑌-values, is 27.21. The sum of the products 𝑥𝑦 is
166.1151; that’s to four decimal places. The sum of the 𝑥-values squared is
558.4923. And the sum of the squares of the
𝑦’s is 49.4361. So in our sums column, we have the
sum of the 𝑋-values, the sum of the 𝑌-values, the sum of the product 𝑥𝑦, the sum
of the 𝑥-values squared, and the sum of the 𝑦-values squared. So now, we have everything we need
for our formula.
With 𝑛 is 15, that’s the number of
athletes, and all of the sums from our table, our correlation coefficient can be
calculated as shown. Using our calculators, we can
evaluate the numerator and denominator as shown. So we have 3.9162 divided by
4.5567, each to four decimal places, which is 0.8594 to four decimal places. To three decimal places then, that
is, to the nearest thousandth, the value of the product-moment correlation
coefficient between the long jump and high jump results is 0.859.
For the second part of the question
regarding the relationship between the long jump and high jump results, our
coefficient is very close to positive one. This means that there’s a strong
positive, that is, direct, linear correlation between long jump and high jump
results for the women athletes in the Rio Olympics.
Let’s now complete this video by
reminding ourselves of some of the key points we’ve covered. We know that correlation does not
mean causation. It simply indicates that a linear
relationship exists between two variables and gives us some idea of the strength and
direction of the relationship. We know that the product-moment
correlation coefficient applies to bivariate numerical data. The coefficient takes values
between negative and positive one. The closer 𝑟 is to negative or
positive one, the stronger the correlation. And conversely, the closer 𝑟 is to
zero, the weaker the correlation between the two variables.
And then, if the correlation
coefficient is equal to zero, there’s no linear correlation. A positive correlation coefficient
indicates a direct or positive linear relationship between the variables, whereas a
negative coefficient indicates an inverse or negative linear correlation. And to actually calculate Pearson’s
product-moment correlation coefficient for a bivariate data set, we use the formula
shown where the coefficient can be labeled either 𝑟 subscript 𝑥𝑦 or simply
𝑟.