### Video Transcript

In this video, we’re gonna learn about linear correlation. There are lots of
situations where we have two sets of data related to individuals or events, and we call this
bivariate data. For example, student’s scores in math tests and English scores. Each student
took both tests. So we have two sets of numbers related to individual students.

We can use one set for the “𝑥”-coordinates and the other for the
“𝑦”-coordinates and plot all the data as points on a scatterplot. Then we can examine any patterns that may emerge in the scatterplots to see
if they suggest any association between the two data sets. One type of pattern that can emerge is a straight line relationship. This has
turned out to be so useful in scientific and statistical analysis that techniques have been
developed to quantify and interpret linear correlation between two associated sets of data. So we’re gonna talk about linear correlation and the terminology that we use
to describe it.

Let’s start by describing an experiment that I do with my math students. I
give each student a different-sized circle and ask them to measure the diameter and
circumference and then we gather in all the results. This sounds pretty easy maybe, but they only have straight rulers to measure
with. So they need to be quite creative about how they measure the circumference, and I don’t
let them calculate it if they happen to know about “𝜋” and the formula.

So we have two bits of data about each circle and we use the diameters as the
“𝑥”-coordinates and the circumferences as the “𝑦”-coordinates and we
plot all these points on a scatterplot.

So here’s the data that I gather for one class, and here’s the scatterplot. Now the first thing that jumps off the page is this point here, which looks
very different to all the others. Most points are close to a straight line running something like this, but the
other point is a long way from the pack. In fact, it turned out to be due to a student who read
out their diameter and circumference the wrong way round. So we were able to swap the
“𝑥”- and “𝑦”-coordinates over to correct them. But if the student who made the mistake hadn’t been in the room to explain
what they’d done, then we’d have had a tricky decision to make. Why’s that point so far away
from the others? Because it was a genuine circle which was very different to all the others or
was there some kind of mistake? You shouldn’t just throw away data because it looks different. You need to
find out more about it: is it real or is it a mistake? If it’s real, then you need to take it
into account in your analysis.

So after our correction, this is what the scatterplot looked like with a new
line of best fit. The line of best fit that we’ve drawn is positioned in such a way that it
minimises the overall vertical distance to all of the points, like these orange lines here.
It’s called a least squares regression line. But we’re not gonna going into the detail of how we
calculate that just now. We’re just gonna draw it by eye, trying our ruler in lots of different
positions until we find a route that is as close as possible to as many of the points as
possible with a nice even balance of points above and below the- the line along its entire
length.

So we’ve got points above and below here, we’ve got points above and below
here, and we’ve also got points above and below in the middle here.

And now we can use the line of best fit to make predictions. For example, if we
had a circle that had a diameter of “two” inches, we could draw a line up to our line of best fit and across to the “𝑦”-axis. And that looks like it would have a circumference of between “six” and “six and a half” inches.

So without actually having to do measurements on the circle if you know the
diameter of a circle, you can use this graph to make a prediction about what its circumference
would be. And likewise if we know the circumference, we could make a prediction about the
diameter. So if we had a circle with a circumference of “twenty” inches, we could draw a line across from the “𝑦”-axis to our line of best
fit and then down to the “𝑥”-axis. And it looks like that’s just under “six point five” inches in diameter.

We could even go as far as calculating the equation of that line of best fit
and using that to make our predictions. So for example, if we had a diameter of “three” inches, “𝑥 will be equal to three”. We can plug that into our equation and then that would give us an answer of “nine point four”
inches for the circumference, which is a bit easier and probably more accurate than reading off of that
scale.

Now looking at the equation, we can see that the slope or the gradient is
“positive three point one”. And because the pattern of dots make a pretty close fit to a straight line and
that line has this positive slope as we’ve just seen, we say that the points are positively
correlated, or if you want to be really accurate, positively linearly correlated. And if the points had suggested a line with a negative slope, then we’d have
said that they had negative correlation. So the terms positive and negative correlation are statements about bivariate
data.

So if higher values on one aspect of data are associated with higher values
on the other aspect of data and lower values on one aspect of data are associated with lower
values on the other aspect of data, we call that positive correlation. And if high values on one aspect of data are associated with low values on
the other aspect of data, we call that negative correlation. And some people call positive correlation direct correlation and negative
correlation inverse correlation. So they’re terms that you might also come across.

But that doesn’t really cover it all. Sometimes there’s no correlation between
two data sets. For example, if you plotted the number of doughnuts people can eat without licking
their lips against the number of books that they’ve read over the past year, you might expect a
scatterplot looking something like this. There’s no association between the two at all; there’s
no correlation. Knowing how many books someone has read over the past year tells you nothing
about how many doughnuts they’re likely to be able to eat without licking their lips and vice
versa. Okay then, we’ve got a basic idea of what correlation is now: it’s a way to
describe apparent associations between data sets or even the lack of association between them. Let’s go through a summary of what the basic types are. We’ve got positive or direct correlation, negative or inverse correlation, and
no correlation. But there are also different strengths of correlation. So strong correlation is when the points are closer to a line of best fit.
Weaker correlation is when they’re scattered a bit more randomly further away from that line
of best fit; there’s a bit more variation going on there.

So for example with weak positive correlation, you still get higher data
values on one data aspect associated with higher values on the other data aspect and-and
lower with lower and so on. But the picture is a little bit more confused; it’s not quite so clear
that they’re correlated. And likewise with negative correlation, you’ve still got high-high values on one
data aspect associated with low values on the other data aspect. But those points don’t conform
to that line of best fit so clearly.

Now this strong and weak correlation idea is all a bit fluffy and woolly. If we drew the axes slightly differently and used a different scale, we could
make correlation look stronger or weaker by having the points more spaced out or closer to the
line. So that’s not really that great. But luckily we have something called a correlation coefficient which
quantifies the strength of the correlation. And this is a number that runs on a scale from
“negative one” for perfect negative correlation through “zero” for no correlation up to “positive one”
for perfect positive correlation.

So perfect negative correlation would be when all of the points exactly sit
on that line of best fit. In perfect positive correlation, all the points would exactly fit on
that line of best fit. So in both of those cases, our line of best fit would make perfect predictions
of one thing from the other.

So going back to our circle measuring task that we did with my students, that
should’ve given us perfect positive correlation between the diameter and the circumference
of a circle. We know that there’s a formula that exactly describes this relationship: the
circumference is “𝜋 times the diameter”. Now the only reason that it didn’t come out perfect was that the students
weren’t able to measure the circles with “a hundred percent” accuracy. But we did see a pretty strong
positive correlation. And we had a good deal of confidence that the predictions of one aspect
based on the other using our line of best fit were going to be quite reliable because all the
data points were close to that line. The line was a good predictor for the data points that we gathered.

So going back to our scale, we had correlation which was quite strong. It was
probably in this region, not “one” but approaching “one”.

So in the real world, things are quite messy. So we would probably
would never
expect to get perfect positive or perfect negative correlation. We would always be
operating in
this kind of zone in between here somewhere and we’ll be looking at
the tendency:
are we sort of generally closer to “negative one” or are we generally closer
to “zero” or are
we generally closer to “one”?

So the value of the correlation coefficient tells us how reliable the
predictions made using our line of best fit are. Close to “negative one” or “positive one”, that
means they’re quite reliable. Closer to “zero”, they’re totally unreliable.

So let’s have a look at these two scatterplots. So there’re two
classes, A and
B, and they both did a math test and an English test. And we’ve used the English
scores as our “𝑥”-coordinates; and the math scores as our “𝑦”-coordinates. So for
class A, we’ve got this particular pattern. Everybody scored about “fifty” on English, but there’s a
complete range of scores on math. And for class B everybody scored about “fifty” on math, but
there’s a complete range of scores on English.

Now those points suggest a pretty clear line of best fit in each case. So for
class A, the line of best fit would be vertical; and for class B, the line of best fit would
be horizontal. So how strong do you think the correlation is in each case? Well, in fact — both
cases — we’ve got “zero” or no correlation. And that’s because knowing one of the scores tells you nothing about the
other. There’s no predictability of one score based on the other score. In class A, if I know someone scored “fifty” for English, that doesn’t tell me
anything at all about what they might have scored in their math score. People who
scored “fifty” for
English scored a whole range of different scores on their math test. And likewise for
class B,
if I know somebody scored “fifty” on math, that doesn’t enable me to predict what score they
got on
their English test because people who scored “fifty” on math scored the complete range of
different
scores on their English test.

This means that although the points suggest a pretty good line of best fit
because it’s exactly horizontal or exactly vertical, you can’t use one score to make a
prediction about the other for any individual student. This means there is no correlation
between the two. Correlation is all about the predictive power of one piece of data for
another piece of data.

Now correlation is also about association between data within a certain range.
For example, one March I planted some sunflower seeds in my garden and I measured how tall the
plants were every day. By the end of September, I’d gathered a lot of data. And there was a pretty
strong positive correlation between the number of days that had passed since I planted the
seeds and the height of my plants, which were about “twelve” feet tall by that stage. Now by extending that pattern, I confidently predicted that by the end of the
following January my plants will be “twenty” feet tall and I wondered if that would be a world
record. Of course I was wrong. Autumn came. They stopped growing, they died, they fell over,
and they rotted.

Although the data that I gathered was very good at estimating how tall the
plants would have been over the time that I was gathering the data in this region here, it turned out to be very bad at making predictions about the future. Using patterns to make estimations within the range of data you’ve collected
is called interpolation. And this could be very reliable if the data has strong positive or
strong negative correlation. But trying to use those patterns to make predictions about the
future or beyond the range of data- the data that you’ve collected is called extrapolation. And
it could be very unreliable even in data that was perfectly correlated within the data range
that you gathered.

Another thing, although we’ve been talking about correlation in this video,
really — and as we mentioned this a couple of times — we mean linear correlation: how well the data
fits a straight line pattern. Sometimes though the data doesn’t fit a straight line so well, but maybe it
would fit a curve.

Take this data about the number of visits to the UK between “nineteen seventy-eight” and “nineteen ninety-nine” for
example. If we fit a linear pattern through the middle here, we can see that although it’s quite
a good line of best fit with this pattern emerging at the ends, it’s the line is tending
to underpredict the number of thousands of visits made each year, but in the middle it’s overpredicting.
So although it looks like a reasonable line of best fit, there’s a pattern to the
way in which is making errors about making its predictions.

If we fitted more of a curve like this, then there’s a mix of
underestimates and overestimates moving along that line. So it’s a slightly better predictor of
the number of visits based on what year it is.

So although nonlinear correlation is beyond the scope of this video, we did
just want you to be aware that it is something that does exist. So we’ve taken a look at strong or weak positive or direct correlation. So we’ve seen strong or weak positive or direct correlation: the closer the correlation coefficient is to “one”, the stronger the correlation. And we’ve seen strong or weak negative or inverse correlation: in this case, the closer the correlation coefficient is to “negative one”, the
stronger the correlation.

And we’ve seen examples of no correlation. Now this can happen if you’ve got this
random splatter of points that look like this or if you’ve got a completely vertical or
completely horizontal line of best fit. When the correlation coefficient is close to “zero”, knowing one piece of data
doesn’t help you to predict what the other one would be. So for example, if we knew what their
math score was, it wouldn’t help us to predict what their English score was because there’s a
whole range of different values that it could’ve been.

We’ve also seen that when we’ve got good strong correlation doing
interpolation, making predictions of one piece of data based on the other within the range of
data we’ve got, can be quite reliable. But trying to do extrapolation or make predictions beyond the data range that
we’ve gathered can give us very bad results indeed.

One last thing, correlation tells you about association, not necessarily
causality. It could just be a coincidence that two sets of data correlate or maybe
there’s some other underlying factor affecting both sets of data. For example, between “two thousand” and “two thousand and nine”, an analysis of the average amount of
margarine consumed per person by people in the United States each year correlated very
strongly with the divorce rate per thousand people in the state of Maine that year. That’s just
a coincidence. How could the number of divorces in one particular state be affected by how
much margarine was being consumed elsewhere in the country?

There’s also a very weak negative correlation between how yellow people’s
teeth are and how long they live. Now there’s no causal link between the two. But shorter lifespans and having
yellow teeth are both caused by smoking tobacco. So perhaps that aspect is causing this
underlying apparent weak correlation between those two other pieces of data.