### Video Transcript

In this video, we’re gonna learn about linear correlation. There are lots of situations where we have two sets of data related to individuals or events, and we call this bivariate data. For example, student’s scores in math tests and English scores. Each student took both tests. So we have two sets of numbers related to individual students.

We can use one set for the “𝑥”-coordinates and the other for the “𝑦”-coordinates and plot all the data as points on a scatterplot. Then we can examine any patterns that may emerge in the scatterplots to see if they suggest any association between the two data sets. One type of pattern that can emerge is a straight line relationship. This has turned out to be so useful in scientific and statistical analysis that techniques have been developed to quantify and interpret linear correlation between two associated sets of data. So we’re gonna talk about linear correlation and the terminology that we use to describe it.

Let’s start by describing an experiment that I do with my math students. I give each student a different-sized circle and ask them to measure the diameter and circumference and then we gather in all the results. This sounds pretty easy maybe, but they only have straight rulers to measure with. So they need to be quite creative about how they measure the circumference, and I don’t let them calculate it if they happen to know about “𝜋” and the formula.

So we have two bits of data about each circle and we use the diameters as the “𝑥”-coordinates and the circumferences as the “𝑦”-coordinates and we plot all these points on a scatterplot.

So here’s the data that I gather for one class, and here’s the scatterplot. Now the first thing that jumps off the page is this point here, which looks very different to all the others. Most points are close to a straight line running something like this, but the other point is a long way from the pack. In fact, it turned out to be due to a student who read out their diameter and circumference the wrong way round. So we were able to swap the “𝑥”- and “𝑦”-coordinates over to correct them. But if the student who made the mistake hadn’t been in the room to explain what they’d done, then we’d have had a tricky decision to make. Why’s that point so far away from the others? Because it was a genuine circle which was very different to all the others or was there some kind of mistake? You shouldn’t just throw away data because it looks different. You need to find out more about it: is it real or is it a mistake? If it’s real, then you need to take it into account in your analysis.

So after our correction, this is what the scatterplot looked like with a new line of best fit. The line of best fit that we’ve drawn is positioned in such a way that it minimises the overall vertical distance to all of the points, like these orange lines here. It’s called a least squares regression line. But we’re not gonna going into the detail of how we calculate that just now. We’re just gonna draw it by eye, trying our ruler in lots of different positions until we find a route that is as close as possible to as many of the points as possible with a nice even balance of points above and below the- the line along its entire length.

So we’ve got points above and below here, we’ve got points above and below here, and we’ve also got points above and below in the middle here.

And now we can use the line of best fit to make predictions. For example, if we had a circle that had a diameter of “two” inches, we could draw a line up to our line of best fit and across to the “𝑦”-axis. And that looks like it would have a circumference of between “six” and “six and a half” inches.

So without actually having to do measurements on the circle if you know the diameter of a circle, you can use this graph to make a prediction about what its circumference would be. And likewise if we know the circumference, we could make a prediction about the diameter. So if we had a circle with a circumference of “twenty” inches, we could draw a line across from the “𝑦”-axis to our line of best fit and then down to the “𝑥”-axis. And it looks like that’s just under “six point five” inches in diameter.

We could even go as far as calculating the equation of that line of best fit and using that to make our predictions. So for example, if we had a diameter of “three” inches, “𝑥 will be equal to three”. We can plug that into our equation and then that would give us an answer of “nine point four” inches for the circumference, which is a bit easier and probably more accurate than reading off of that scale.

Now looking at the equation, we can see that the slope or the gradient is “positive three point one”. And because the pattern of dots make a pretty close fit to a straight line and that line has this positive slope as we’ve just seen, we say that the points are positively correlated, or if you want to be really accurate, positively linearly correlated. And if the points had suggested a line with a negative slope, then we’d have said that they had negative correlation. So the terms positive and negative correlation are statements about bivariate data.

So if higher values on one aspect of data are associated with higher values on the other aspect of data and lower values on one aspect of data are associated with lower values on the other aspect of data, we call that positive correlation. And if high values on one aspect of data are associated with low values on the other aspect of data, we call that negative correlation. And some people call positive correlation direct correlation and negative correlation inverse correlation. So they’re terms that you might also come across.

But that doesn’t really cover it all. Sometimes there’s no correlation between two data sets. For example, if you plotted the number of doughnuts people can eat without licking their lips against the number of books that they’ve read over the past year, you might expect a scatterplot looking something like this. There’s no association between the two at all; there’s no correlation. Knowing how many books someone has read over the past year tells you nothing about how many doughnuts they’re likely to be able to eat without licking their lips and vice versa. Okay then, we’ve got a basic idea of what correlation is now: it’s a way to describe apparent associations between data sets or even the lack of association between them. Let’s go through a summary of what the basic types are. We’ve got positive or direct correlation, negative or inverse correlation, and no correlation. But there are also different strengths of correlation. So strong correlation is when the points are closer to a line of best fit. Weaker correlation is when they’re scattered a bit more randomly further away from that line of best fit; there’s a bit more variation going on there.

So for example with weak positive correlation, you still get higher data values on one data aspect associated with higher values on the other data aspect and-and lower with lower and so on. But the picture is a little bit more confused; it’s not quite so clear that they’re correlated. And likewise with negative correlation, you’ve still got high-high values on one data aspect associated with low values on the other data aspect. But those points don’t conform to that line of best fit so clearly.

Now this strong and weak correlation idea is all a bit fluffy and woolly. If we drew the axes slightly differently and used a different scale, we could make correlation look stronger or weaker by having the points more spaced out or closer to the line. So that’s not really that great. But luckily we have something called a correlation coefficient which quantifies the strength of the correlation. And this is a number that runs on a scale from “negative one” for perfect negative correlation through “zero” for no correlation up to “positive one” for perfect positive correlation.

So perfect negative correlation would be when all of the points exactly sit on that line of best fit. In perfect positive correlation, all the points would exactly fit on that line of best fit. So in both of those cases, our line of best fit would make perfect predictions of one thing from the other.

So going back to our circle measuring task that we did with my students, that should’ve given us perfect positive correlation between the diameter and the circumference of a circle. We know that there’s a formula that exactly describes this relationship: the circumference is “𝜋 times the diameter”. Now the only reason that it didn’t come out perfect was that the students weren’t able to measure the circles with “a hundred percent” accuracy. But we did see a pretty strong positive correlation. And we had a good deal of confidence that the predictions of one aspect based on the other using our line of best fit were going to be quite reliable because all the data points were close to that line. The line was a good predictor for the data points that we gathered.

So going back to our scale, we had correlation which was quite strong. It was probably in this region, not “one” but approaching “one”.

So in the real world, things are quite messy. So we would probably would never expect to get perfect positive or perfect negative correlation. We would always be operating in this kind of zone in between here somewhere and we’ll be looking at the tendency: are we sort of generally closer to “negative one” or are we generally closer to “zero” or are we generally closer to “one”?

So the value of the correlation coefficient tells us how reliable the predictions made using our line of best fit are. Close to “negative one” or “positive one”, that means they’re quite reliable. Closer to “zero”, they’re totally unreliable.

So let’s have a look at these two scatterplots. So there’re two classes, A and B, and they both did a math test and an English test. And we’ve used the English scores as our “𝑥”-coordinates; and the math scores as our “𝑦”-coordinates. So for class A, we’ve got this particular pattern. Everybody scored about “fifty” on English, but there’s a complete range of scores on math. And for class B everybody scored about “fifty” on math, but there’s a complete range of scores on English.

Now those points suggest a pretty clear line of best fit in each case. So for class A, the line of best fit would be vertical; and for class B, the line of best fit would be horizontal. So how strong do you think the correlation is in each case? Well, in fact — both cases — we’ve got “zero” or no correlation. And that’s because knowing one of the scores tells you nothing about the other. There’s no predictability of one score based on the other score. In class A, if I know someone scored “fifty” for English, that doesn’t tell me anything at all about what they might have scored in their math score. People who scored “fifty” for English scored a whole range of different scores on their math test. And likewise for class B, if I know somebody scored “fifty” on math, that doesn’t enable me to predict what score they got on their English test because people who scored “fifty” on math scored the complete range of different scores on their English test.

This means that although the points suggest a pretty good line of best fit because it’s exactly horizontal or exactly vertical, you can’t use one score to make a prediction about the other for any individual student. This means there is no correlation between the two. Correlation is all about the predictive power of one piece of data for another piece of data.

Now correlation is also about association between data within a certain range. For example, one March I planted some sunflower seeds in my garden and I measured how tall the plants were every day. By the end of September, I’d gathered a lot of data. And there was a pretty strong positive correlation between the number of days that had passed since I planted the seeds and the height of my plants, which were about “twelve” feet tall by that stage. Now by extending that pattern, I confidently predicted that by the end of the following January my plants will be “twenty” feet tall and I wondered if that would be a world record. Of course I was wrong. Autumn came. They stopped growing, they died, they fell over, and they rotted.

Although the data that I gathered was very good at estimating how tall the plants would have been over the time that I was gathering the data in this region here, it turned out to be very bad at making predictions about the future. Using patterns to make estimations within the range of data you’ve collected is called interpolation. And this could be very reliable if the data has strong positive or strong negative correlation. But trying to use those patterns to make predictions about the future or beyond the range of data- the data that you’ve collected is called extrapolation. And it could be very unreliable even in data that was perfectly correlated within the data range that you gathered.

Another thing, although we’ve been talking about correlation in this video, really — and as we mentioned this a couple of times — we mean linear correlation: how well the data fits a straight line pattern. Sometimes though the data doesn’t fit a straight line so well, but maybe it would fit a curve.

Take this data about the number of visits to the UK between “nineteen seventy-eight” and “nineteen ninety-nine” for example. If we fit a linear pattern through the middle here, we can see that although it’s quite a good line of best fit with this pattern emerging at the ends, it’s the line is tending to underpredict the number of thousands of visits made each year, but in the middle it’s overpredicting. So although it looks like a reasonable line of best fit, there’s a pattern to the way in which is making errors about making its predictions.

If we fitted more of a curve like this, then there’s a mix of underestimates and overestimates moving along that line. So it’s a slightly better predictor of the number of visits based on what year it is.

So although nonlinear correlation is beyond the scope of this video, we did just want you to be aware that it is something that does exist. So we’ve taken a look at strong or weak positive or direct correlation. So we’ve seen strong or weak positive or direct correlation: the closer the correlation coefficient is to “one”, the stronger the correlation. And we’ve seen strong or weak negative or inverse correlation: in this case, the closer the correlation coefficient is to “negative one”, the stronger the correlation.

And we’ve seen examples of no correlation. Now this can happen if you’ve got this random splatter of points that look like this or if you’ve got a completely vertical or completely horizontal line of best fit. When the correlation coefficient is close to “zero”, knowing one piece of data doesn’t help you to predict what the other one would be. So for example, if we knew what their math score was, it wouldn’t help us to predict what their English score was because there’s a whole range of different values that it could’ve been.

We’ve also seen that when we’ve got good strong correlation doing interpolation, making predictions of one piece of data based on the other within the range of data we’ve got, can be quite reliable. But trying to do extrapolation or make predictions beyond the data range that we’ve gathered can give us very bad results indeed.

One last thing, correlation tells you about association, not necessarily causality. It could just be a coincidence that two sets of data correlate or maybe there’s some other underlying factor affecting both sets of data. For example, between “two thousand” and “two thousand and nine”, an analysis of the average amount of margarine consumed per person by people in the United States each year correlated very strongly with the divorce rate per thousand people in the state of Maine that year. That’s just a coincidence. How could the number of divorces in one particular state be affected by how much margarine was being consumed elsewhere in the country?

There’s also a very weak negative correlation between how yellow people’s teeth are and how long they live. Now there’s no causal link between the two. But shorter lifespans and having yellow teeth are both caused by smoking tobacco. So perhaps that aspect is causing this underlying apparent weak correlation between those two other pieces of data.