Video Transcript
In this video, we’ll learn how to
calculate and use Pearson’s correlation coefficient 𝑟 to describe the strength and
direction of a linear relationship. We’ll begin by reminding ourselves
some of the terms and ideas related to correlation, which we’ll explore with some
examples. And then we’ll calculate Pearson’s
product-moment correlation coefficient by hand using the formula.
Bivariate data is data where two
numerical or quantitative variables are uniquely paired across the subjects in an
experiment. Suppose we have 𝑛 people in a
sample and we measure their heights and weights. For each person or subject, we have
a unique pair of measurements. If we call 𝑥 height in meters and
𝑦 weight in kilograms, then the pair of measurements for each person or subject
give us a data point in our bivariate data set. Now suppose we’d like to know if
there’s a relationship or correlation between a person’s height and weight. To give us some idea, we first plot
our data on a scatter plot. And if we find that our data
follows a linear pattern, then we can say there’s a linear correlation between 𝑥
and 𝑦 or height and weight.
It’s important to remember though
when looking at correlation that we’re not saying that a change in one variable
causes a change in another variable. We’re simply describing the
relationship between the variables. And a scatter plot can give us some
information about our data. We can see from this scatter plot,
for example, that someone who is quite tall might be expected to be relatively
heavy. And if there is correlation between
our variables, the scatter plot can tell us the direction of correlation. If our 𝑥- and 𝑦-values increase
together, then we say we have positive or direct correlation. And if as the 𝑥-values increase
the 𝑦-values decrease, then we say we have negative or inverse correlation. If, on the other hand, there’s no
pattern at all, we say there’s no correlation between 𝑥 and 𝑦. And if we have a nonlinear
relationship between 𝑥 and 𝑦, then there’s no linear correlation.
We can also, to some extent, tell
from a scatter plot how strong the linear relationship is by how closely the points
sit together in the linear pattern. So, for example, in the left-hand
diagram, where points are closely following a linear pattern, we say the correlation
is very strong, whereas the data on the right-hand diagram is loosely spread in the
linear pattern. And we would say that this is weak
or moderate direct linear correlation. This is overly well, but we’re
mathematicians and we want something a little more precise to measure our
relationships. And this is where the correlation
coefficient comes in.
This idea was developed by an
English mathematician called Karl Pearson and hence is known as Pearson’s
correlation coefficient or Pearson’s product-moment correlation coefficient. And it’s denoted by 𝑟 subscript
𝑥𝑦 or simply 𝑟. 𝑟 takes values from negative one
to plus one. And the closer it is to either
positive or negative one, the stronger the linear relationship or correlation. Let’s look at our first example
where we estimate Pearson’s correlation coefficient from a scatter plot.
What is the most likely value of
the product-moment correlation coefficient for the data shown in the diagram? Is it (A) negative 0.58, (B) zero,
(C) negative 0.94, (D) 0.78, or (E) 0.37?
In estimating Pearson’s correlation
coefficient from a scatter plot, there are two things we look at. The first is the direction of the
linear pattern, which in our case is top left to bottom right. And the second thing is the spread
of the data points around a possible line of best fit, that is, how close our data
points are to a potential line of best fit. Generally speaking, we know that if
a linear pattern of data is from bottom left to top right, then we have positive or
direct correlation. Conversely, if our data follow a
linear pattern from top left to bottom right, we say our data is negatively or
inversely correlated. And if our data is directly
correlated, that’s positively, then our coefficient is between zero and one, whereas
if our data is inversely correlated, the coefficient is between negative one and
zero.
In our case, our linear pattern is
from top left to bottom right, so ours is the second case. This means our correlation
coefficient must be between negative one and zero. And this means we can eliminate
both (D) and (E) since these are both positive. And now if we look at the spread of
the data, we know that the wider the spread away from a potential line of best fit,
the weaker the correlation and that the closer the data points are to a potential
line of best fit, the stronger the correlation. We know that Pearson’s correlation
coefficient takes values from negative one to positive one and that the closer the
coefficient is to positive or negative one, the stronger the correlation. And we know that the closer the
correlation coefficient gets to zero, the weaker the correlation.
In the given plot, most of the data
points are very close to a possible line of best fit. And remembering that our
coefficient is negative, this means our coefficient must be close to negative
one. We can eliminate (B) since we know
that a correlation coefficient of zero means there’s no correlation at all, and we
have very strong correlation. And so we’re left with option (A)
and option (C). Option (A) with the value negative
0.58 would indicate a moderate correlation. That’s because it’s just over
halfway between zero and negative one. And since our correlation is very
strong, we can eliminate option (A). Option (C) is the closest to
negative one with a value negative 0.94. So the most likely product-moment
correlation coefficient for the data shown is (C) is equal to negative 0.94.
It’s worth noting also that if all
of the data points lie exactly on the line, we have either perfect direct positive
correlation or perfect inverse negative linear correlation. In the case of perfect direct
correlation, the coefficient 𝑟 is equal to one. And for perfect inverse
correlation, the coefficient 𝑟 is equal to negative one. Let’s now look at some examples
where we’ll interpret different values of Pearson’s correlation coefficient.
Which of the following correlation
coefficients indicates the weakest inverse correlation? Is it (A) negative 0.48, (B)
negative 0.22, (C) negative 0.75, or (D) negative 0.83?
We’re given four correlation
coefficients and we want to determine which of these represents the weakest inverse
correlation. We know that Pearson’s correlation
coefficient takes values between negative one and plus one. We know that if our coefficient is
between negative one and zero, we have inverse correlation. That’s negative correlation. And if the value of 𝑟 is between
zero and positive one, then we have direct or positive correlation. We also know that the closer the
coefficient is to positive or negative one, the stronger the correlation and that
the closer the coefficient is to zero, the weaker the correlation. And what this means is that the
greater the magnitude of the correlation coefficient, the stronger the
correlation.
So if we now look at the magnitudes
of our four options, the magnitude of option (A) is 0.48, the magnitude of option
(B) is 0.22, the magnitude of option (C) is 0.75, and the magnitude of option (D) is
0.83. And remember, we’re looking for the
correlation coefficient that indicates the weakest correlation. This means the correlation
coefficient with the smallest magnitude, that is, whose magnitude is closest to
zero. And we can see that our option (B)
has a magnitude closest to zero. And since the coefficient with the
smallest magnitude is option (B), this indicates the weakest inverse
correlation. So (B) is equal to negative 0.22 is
our answer.
Let’s consider another example.
Which of the following is the most
appropriate interpretation of a product-moment correlation coefficient of 0.8? Is it (A) a strong negative linear
correlation, or (B) a moderate negative linear correlation, or (C) a moderate
positive linear correlation, (D) a strong positive linear correlation, or (E) no
correlation?
We know that Pearson’s
product-moment correlation coefficient takes values between negative one and
positive one. We also know that if 𝑟 is less
than zero and greater than or equal to negative one, then we have inverse or
negative correlation and that if 𝑟 is greater than zero and less than or equal to
positive one, we have direct or positive correlation. We’re asked which of the given
options is the most appropriate interpretation of a product-moment correlation
coefficient of positive 0.8. And the correlation coefficient of
positive 0.8 means that we have direct or positive correlation. And noting that the product-moment
correlation coefficient applies to linear correlation, we can eliminate any of our
options, which interpret the coefficient as a negative linear correlation.
This means we can eliminate option
(A) and option (B) since these both specify negative linear correlation. We can also eliminate option (E)
since this specifies no correlation and we do not have zero correlation. This leaves us with option (C) and
option (D), a moderate positive linear correlation or a strong positive linear
correlation. If we consider the magnitude of
correlation coefficients — the stronger the correlation, the closer the magnitude to
positive or negative one. The closer the magnitude is to
zero, the weaker the correlation — then this means approximately midway between zero
and positive or negative one, we have moderate correlation.
Since the given coefficient is 0.8,
which is close to positive one, we can say this represents strong positive linear
correlation. We can therefore eliminate option
(C), which refers to a moderate positive linear correlation. And so the most appropriate
interpretation of a product-moment correlation coefficient of 0.8 is option (D) a
strong positive linear correlation.
So now we know how to interpret
Pearson’s product-moment correlation coefficient, let’s look at how we might
actually calculate it.
There are a few equivalent ways to
write the formula for Pearson’s product-moment correlation coefficient and the one
we’ll use is shown. Recalling that capital Σ, the Greek
letter, represents the sum, we have the sum of the products 𝑥𝑦 minus the product
of 𝑛, which is the number of data pairs, with 𝑥 bar, which is the mean of the
𝑥-values, and 𝑦 bar, which is the mean of the 𝑦-values, divided by the product of
two square roots. These are the square roots of the
sum of the 𝑥- and 𝑦-values squared minus 𝑛, the number of data pairs, times the
mean squared for 𝑥 and 𝑦. This can be abbreviated to 𝑆 𝑥𝑦
over the square root of 𝑆 𝑥𝑥 times 𝑆 𝑦𝑦, where 𝑆 𝑥𝑦 is the covariance of 𝑥
and 𝑦, which is a measure of how 𝑥 and 𝑦 change together, and where 𝑆 𝑥𝑥 and
𝑆 𝑦𝑦 are the variation in 𝑥 and 𝑦, respectively. These are often referred to as the
sums of squares.
Let’s first look at an example of
how to use the abbreviated form to calculate the correlation coefficient. And in our final example, we’ll use
the full formula to calculate the correlation coefficient for a data set from
scratch.
A data set has summary statistics
𝑆 𝑥𝑥 is equal to 36.875, 𝑆 𝑦𝑦 is 73.875, and 𝑆 𝑥𝑦 is 32.375. Calculate the product-moment
correlation coefficient for this data set, giving your answer correct to three
decimal places.
We are given the summary statistics
for a data set, where 𝑆 𝑥𝑥 is 36.875; that’s the variation in 𝑥. 𝑆 𝑦𝑦 is 73.875; that’s the
variation in 𝑦. And 𝑆 𝑥𝑦 is 32.375; that’s the
covariance of 𝑥 and 𝑦. And using these summary statistics,
we want to calculate the product-moment correlation coefficient. To do this, we can use the
abbreviated form of the correlation coefficient formula. That is, the correlation
coefficient 𝑟 𝑥𝑦 is 𝑆 𝑥𝑦 divided by the square root of 𝑆 𝑥𝑥 times 𝑆
𝑦𝑦.
Since we’re given the summary
statistics, we simply need to substitute these into the formula. So that 𝑟 𝑥𝑦 is 32.375, that’s
𝑆 𝑥𝑦, divided by the square root of the product of 36.875, which is 𝑆 𝑥𝑥, and
73.875, which is 𝑆 𝑦𝑦. That is 32.375 over the square root
of 2724.140625, which is 32.375 divided by 52.19330 to five decimal places. And that’s approximately
0.62029. And so to three decimal places, the
product-moment correlation coefficient for this data set is 𝑟 𝑥𝑦 is equal to
0.620.
In our final example, we’ll
calculate Pearson’s correlation coefficient from scratch.
The data table shows the high jump
and long jump results achieved by 15 competitors in the women’s heptathlon in the
2016 Rio Olympics. Calculate to the nearest thousandth
the value of the product-moment correlation coefficient between the long jump and
high jump results. What does the correlation
coefficient reveal about the relationship between the long jump and high jump
results?
We’re given the table of values for
two variables: long jump and high jump scores for 15 women athletes in the Rio
Olympics. This is bivariate data, which means
that two measurements are recorded for each individual athlete: how far that athlete
jumped in the long jump and how high they jumped in the high jump. So, for example, athlete one jumped
5.51 meters in the long jump and 1.65 meters in the high jump. And there are two parts to this
question. We’re asked to calculate the
product-moment correlation coefficient and to determine what this coefficient
reveals about the relationship between the long jump and high jump results.
For the first part of the question,
we’ll use the formula for the product-moment correlation coefficient shown,
recalling that the capital Greek letter Σ means the sum, 𝑥 bar is the mean of the
𝑥-values, and 𝑦 bar is the mean of the 𝑦-values. 𝑛 is the number of data points or,
in our case, this is the number of athletes, and 𝑟 𝑥𝑦 is the coefficient. So let’s begin by calling the long
jump in meters the variable 𝑥 and the high jump in meters the variable 𝑦.
To calculate our coefficient, we’re
going to need the various expressions within the formula, for example, the product
𝑥𝑦. And so we add some rows to our
table. We’ve also added a column to the
end of our table where we can take the sums. So let’s first work out the product
𝑥𝑦 for each athlete. For our first athlete, the product
𝑥𝑦, which is long jump times high jump, is 5.51 times 1.65. And that’s equal to 9.0915, and so
we put this in the first new cell of our table. And similarly, for our second
athlete, we have the product 5.72 times 1.77, and that’s equal to 10.1244, which
goes in our second cell. And if we continue on in this way
for all our 15 athletes, we have the product shown, where we’ve limited ourselves to
three decimal places for the sake of space.
In the second new row of our table,
we want the values of 𝑥 squared; that’s the long jump value squared. So, for example, our first value
squared is 5.51 squared, which is 30.3601. For our second athlete, the long
jump score squared is 5.72 squared and that’s 32.7184, which to two decimal places
just for our table is 32.72. And continuing in this way, we can
fill in the 𝑥 squared values for all of our athletes. And we can take the squares of the
high jump values for 𝑦 squared. So, for example, the high jumps go
for athlete one was 1.65 meters and 1.65 meters squared is 2.7225, which we’ve
rounded to 2.723 just for the table.
So now let’s fill in our sums
column, where the sum of the 𝑥’s, for example, is the long jump scores all
summed. And that’s 91.43. The sum of all the high jump
scores, that’s the 𝑦-values, is 27.21. The sum of the 𝑥𝑦’s is 166.1151
to four decimal places. The sum of the 𝑥 squareds is
558.4923. And the sum of the 𝑦 squareds is
49.4361. We have 15 athletes, so we know 𝑛
is 15. So if we first work out our means,
we have 𝑥 bar is the sum of the 𝑥’s over 𝑛. That’s 91.43 over 15, and that’s
6.0953 to four decimal places. And similarly, we find the mean of
the 𝑦’s is 1.814. And making some space, we can write
down the sum of 𝑥𝑦. That’s 166.1151. And we can write down the sums of
the 𝑥 squareds and the 𝑦 squareds. So we have everything we need for
our formula.
Substituting all of these values
into our formula, our numerator evaluates to 0.26108 and our denominator evaluates
to 0.30378. Dividing then gives us 𝑟
approximately equal to 0.8594. So to three decimal places, the
correlation coefficient 𝑟 is equal to 0.859. Since this value is close to
positive one, we can say that there’s a very strong direct correlation or positive
linear relationship between long jump and high jump results for the women athletes
in the Rio Olympics.
Let’s now complete this video by
reminding ourselves of some of the key points we’ve covered. We know that correlation does not
mean causation. It simply indicates that a linear
relationship exists between two variables, and it gives us some idea of the strength
and direction of the relationship. We know that the product-moment
correlation coefficient applies to bivariate numerical data. It takes values between negative
and positive one. And the closer 𝑟 is to negative or
positive one, the stronger the correlation. And conversely, the closer 𝑟 is to
zero, the weaker the correlation between the two variables. And if the correlation coefficient
is zero, then there is no correlation at all.
If the coefficient is positive, we
say there is a direct or positive linear correlation between the variables, whereas
if the coefficient is negative, we say there is an inverse or negative linear
correlation. And to actually calculate Pearson’s
product- moment correlation coefficient for a bivariate data set, we use the formula
shown.