Video Transcript
In this video, we’re gonna learn
about linear correlation. There are lots of situations where
we have two sets of data related to individuals or events, and we call this
bivariate data. For example, student’s scores in
math tests and English scores. Each student took both tests. So we have two sets of numbers
related to individual students.
We can use one set for the
“𝑥”-coordinates and the other for the “𝑦”-coordinates and plot all the data as
points on a scatterplot. Then we can examine any patterns
that may emerge in the scatterplots to see if they suggest any association between
the two data sets. One type of pattern that can emerge
is a straight line relationship. This has turned out to be so useful
in scientific and statistical analysis that techniques have been developed to
quantify and interpret linear correlation between two associated sets of data. So we’re gonna talk about linear
correlation and the terminology that we use to describe it.
Let’s start by describing an
experiment that I do with my math students. I give each student a
different-sized circle and ask them to measure the diameter and circumference and
then we gather in all the results. This sounds pretty easy maybe, but
they only have straight rulers to measure with. So they need to be quite creative
about how they measure the circumference, and I don’t let them calculate it if they
happen to know about “𝜋” and the formula.
So we have two bits of data about
each circle and we use the diameters as the “𝑥”-coordinates and the circumferences
as the “𝑦”-coordinates and we plot all these points on a scatterplot.
So here’s the data that I gather
for one class, and here’s the scatterplot. Now the first thing that jumps off
the page is this point here, which looks very different to all the others. Most points are close to a straight
line running something like this, but the other point is a long way from the
pack. In fact, it turned out to be due to
a student who read out their diameter and circumference the wrong way round. So we were able to swap the “𝑥”-
and “𝑦”-coordinates over to correct them. But if the student who made the
mistake hadn’t been in the room to explain what they’d done, then we’d have had a
tricky decision to make. Why’s that point so far away from
the others? Because it was a genuine circle
which was very different to all the others or was there some kind of mistake? You shouldn’t just throw away data
because it looks different. You need to find out more about it:
is it real or is it a mistake? If it’s real, then you need to take
it into account in your analysis.
So after our correction, this is
what the scatterplot looked like with a new line of best fit. The line of best fit that we’ve
drawn is positioned in such a way that it minimises the overall vertical distance to
all of the points, like these orange lines here. It’s called a least squares
regression line. But we’re not gonna going into the
detail of how we calculate that just now. We’re just gonna draw it by eye,
trying our ruler in lots of different positions until we find a route that is as
close as possible to as many of the points as possible with a nice even balance of
points above and below the- the line along its entire length.
So we’ve got points above and below
here, we’ve got points above and below here, and we’ve also got points above and
below in the middle here.
And now we can use the line of best
fit to make predictions. For example, if we had a circle
that had a diameter of “two” inches, we could draw a line up to our line of best fit
and across to the “𝑦”-axis. And that looks like it would have a
circumference of between “six” and “six and a half” inches.
So without actually having to do
measurements on the circle if you know the diameter of a circle, you can use this
graph to make a prediction about what its circumference would be. And likewise if we know the
circumference, we could make a prediction about the diameter. So if we had a circle with a
circumference of “twenty” inches, we could draw a line across from the “𝑦”-axis to
our line of best fit and then down to the “𝑥”-axis. And it looks like that’s just under
“six point five” inches in diameter.
We could even go as far as
calculating the equation of that line of best fit and using that to make our
predictions. So for example, if we had a
diameter of “three” inches, “𝑥 will be equal to three”. We can plug that into our equation
and then that would give us an answer of “nine point four” inches for the
circumference, which is a bit easier and probably more accurate than reading off of
that scale.
Now looking at the equation, we can
see that the slope or the gradient is “positive three point one”. And because the pattern of dots
make a pretty close fit to a straight line and that line has this positive slope as
we’ve just seen, we say that the points are positively correlated, or if you want to
be really accurate, positively linearly correlated. And if the points had suggested a
line with a negative slope, then we’d have said that they had negative
correlation. So the terms positive and negative
correlation are statements about bivariate data.
So if higher values on one aspect
of data are associated with higher values on the other aspect of data and lower
values on one aspect of data are associated with lower values on the other aspect of
data, we call that positive correlation. And if high values on one aspect of
data are associated with low values on the other aspect of data, we call that
negative correlation. And some people call positive
correlation direct correlation and negative correlation inverse correlation. So they’re terms that you might
also come across.
But that doesn’t really cover it
all. Sometimes there’s no correlation
between two data sets. For example, if you plotted the
number of doughnuts people can eat without licking their lips against the number of
books that they’ve read over the past year, you might expect a scatterplot looking
something like this. There’s no association between the
two at all; there’s no correlation. Knowing how many books someone has
read over the past year tells you nothing about how many doughnuts they’re likely to
be able to eat without licking their lips and vice versa. Okay then, we’ve got a basic idea
of what correlation is now: it’s a way to describe apparent associations between
data sets or even the lack of association between them. Let’s go through a summary of what
the basic types are. We’ve got positive or direct
correlation, negative or inverse correlation, and no correlation. But there are also different
strengths of correlation. So strong correlation is when the
points are closer to a line of best fit. Weaker correlation is when they’re
scattered a bit more randomly further away from that line of best fit; there’s a bit
more variation going on there.
So for example with weak positive
correlation, you still get higher data values on one data aspect associated with
higher values on the other data aspect and-and lower with lower and so on. But the picture is a little bit
more confused; it’s not quite so clear that they’re correlated. And likewise with negative
correlation, you’ve still got high-high values on one data aspect associated with
low values on the other data aspect. But those points don’t conform to
that line of best fit so clearly.
Now this strong and weak
correlation idea is all a bit fluffy and woolly. If we drew the axes slightly
differently and used a different scale, we could make correlation look stronger or
weaker by having the points more spaced out or closer to the line. So that’s not really that
great. But luckily we have something
called a correlation coefficient which quantifies the strength of the
correlation. And this is a number that runs on a
scale from “negative one” for perfect negative correlation through “zero” for no
correlation up to “positive one” for perfect positive correlation.
So perfect negative correlation
would be when all of the points exactly sit on that line of best fit. In perfect positive correlation,
all the points would exactly fit on that line of best fit. So in both of those cases, our line
of best fit would make perfect predictions of one thing from the other.
So going back to our circle
measuring task that we did with my students, that should’ve given us perfect
positive correlation between the diameter and the circumference of a circle. We know that there’s a formula that
exactly describes this relationship: the circumference is “𝜋 times the
diameter”. Now the only reason that it didn’t
come out perfect was that the students weren’t able to measure the circles with “a
hundred percent” accuracy. But we did see a pretty strong
positive correlation. And we had a good deal of
confidence that the predictions of one aspect based on the other using our line of
best fit were going to be quite reliable because all the data points were close to
that line. The line was a good predictor for
the data points that we gathered.
So going back to our scale, we had
correlation which was quite strong. It was probably in this region, not
“one” but approaching “one”.
So in the real world, things are
quite messy. So we would probably would never
expect to get perfect positive or perfect negative correlation. We would always be operating in
this kind of zone in between here somewhere and we’ll be looking at the tendency:
are we sort of generally closer to “negative one” or are we generally closer to
“zero” or are we generally closer to “one”?
So the value of the correlation
coefficient tells us how reliable the predictions made using our line of best fit
are. Close to “negative one” or
“positive one”, that means they’re quite reliable. Closer to “zero”, they’re totally
unreliable.
So let’s have a look at these two
scatterplots. So there’re two classes, A and B,
and they both did a math test and an English test. And we’ve used the English scores
as our “𝑥”-coordinates; and the math scores as our “𝑦”-coordinates. So for class A, we’ve got this
particular pattern. Everybody scored about “fifty” on
English, but there’s a complete range of scores on math. And for class B everybody scored
about “fifty” on math, but there’s a complete range of scores on English.
Now those points suggest a pretty
clear line of best fit in each case. So for class A, the line of best
fit would be vertical; and for class B, the line of best fit would be
horizontal. So how strong do you think the
correlation is in each case? Well, in fact — both cases — we’ve
got “zero” or no correlation. And that’s because knowing one of
the scores tells you nothing about the other. There’s no predictability of one
score based on the other score. In class A, if I know someone
scored “fifty” for English, that doesn’t tell me anything at all about what they
might have scored in their math score. People who scored “fifty” for
English scored a whole range of different scores on their math test. And likewise for class B, if I know
somebody scored “fifty” on math, that doesn’t enable me to predict what score they
got on their English test because people who scored “fifty” on math scored the
complete range of different scores on their English test.
This means that although the points
suggest a pretty good line of best fit because it’s exactly horizontal or exactly
vertical, you can’t use one score to make a prediction about the other for any
individual student. This means there is no correlation
between the two. Correlation is all about the
predictive power of one piece of data for another piece of data.
Now correlation is also about
association between data within a certain range. For example, one March I planted
some sunflower seeds in my garden and I measured how tall the plants were every
day. By the end of September, I’d
gathered a lot of data. And there was a pretty strong
positive correlation between the number of days that had passed since I planted the
seeds and the height of my plants, which were about “twelve” feet tall by that
stage. Now by extending that pattern, I
confidently predicted that by the end of the following January my plants will be
“twenty” feet tall and I wondered if that would be a world record. Of course I was wrong. Autumn came. They stopped growing, they died,
they fell over, and they rotted.
Although the data that I gathered
was very good at estimating how tall the plants would have been over the time that I
was gathering the data in this region here, it turned out to be very bad at making
predictions about the future. Using patterns to make estimations
within the range of data you’ve collected is called interpolation. And this could be very reliable if
the data has strong positive or strong negative correlation. But trying to use those patterns to
make predictions about the future or beyond the range of data- the data that you’ve
collected is called extrapolation. And it could be very unreliable
even in data that was perfectly correlated within the data range that you
gathered.
Another thing, although we’ve been
talking about correlation in this video, really — and as we mentioned this a couple
of times — we mean linear correlation: how well the data fits a straight line
pattern. Sometimes though the data doesn’t
fit a straight line so well, but maybe it would fit a curve.
Take this data about the number of
visits to the UK between “nineteen seventy-eight” and “nineteen ninety-nine” for
example. If we fit a linear pattern through
the middle here, we can see that although it’s quite a good line of best fit with
this pattern emerging at the ends, it’s the line is tending to underpredict the
number of thousands of visits made each year, but in the middle it’s
overpredicting. So although it looks like a
reasonable line of best fit, there’s a pattern to the way in which is making errors
about making its predictions.
If we fitted more of a curve like
this, then there’s a mix of underestimates and overestimates moving along that
line. So it’s a slightly better predictor
of the number of visits based on what year it is.
So although nonlinear correlation
is beyond the scope of this video, we did just want you to be aware that it is
something that does exist. So we’ve taken a look at strong or
weak positive or direct correlation. So we’ve seen strong or weak
positive or direct correlation: the closer the correlation coefficient is to “one”,
the stronger the correlation. And we’ve seen strong or weak
negative or inverse correlation: in this case, the closer the correlation
coefficient is to “negative one”, the stronger the correlation.
And we’ve seen examples of no
correlation. Now this can happen if you’ve got
this random splatter of points that look like this or if you’ve got a completely
vertical or completely horizontal line of best fit. When the correlation coefficient is
close to “zero”, knowing one piece of data doesn’t help you to predict what the
other one would be. So for example, if we knew what
their math score was, it wouldn’t help us to predict what their English score was
because there’s a whole range of different values that it could’ve been.
We’ve also seen that when we’ve got
good strong correlation doing interpolation, making predictions of one piece of data
based on the other within the range of data we’ve got, can be quite reliable. But trying to do extrapolation or
make predictions beyond the data range that we’ve gathered can give us very bad
results indeed.
One last thing, correlation tells
you about association, not necessarily causality. It could just be a coincidence that
two sets of data correlate or maybe there’s some other underlying factor affecting
both sets of data. For example, between “two thousand”
and “two thousand and nine”, an analysis of the average amount of margarine consumed
per person by people in the United States each year correlated very strongly with
the divorce rate per thousand people in the state of Maine that year. That’s just a coincidence. How could the number of divorces in
one particular state be affected by how much margarine was being consumed elsewhere
in the country?
There’s also a very weak negative
correlation between how yellow people’s teeth are and how long they live. Now there’s no causal link between
the two. But shorter lifespans and having
yellow teeth are both caused by smoking tobacco. So perhaps that aspect is causing
this underlying apparent weak correlation between those two other pieces of
data.