Video Transcript
In this video, we’ll learn how to
deal with linear correlation and distinguish between different types of
correlation. Let’s think about what happens when
we plot a scatter diagram. A scatter diagram can be used to
represent bivariate data, where one set of data is paired with another set of
data. For instance, we might look to plot
the daily precipitation in New York City versus fried-chicken sales in pounds. Looking at the scatter diagram,
there appears to be a pattern, or trend. In this case, as the daily
precipitation increases, so do the fried-chicken sales. In this case, we might say that
these two data sets have a correlation, meaning there appears to be some sort of
relationship between them.
It’s worth noting though that
whilst we might appear to find correlation, that doesn’t necessarily mean that
causation exists. In other words, we cannot
necessarily assume that daily precipitation actually causes fried-chicken sales to
increase.
Now, with that in mind, let’s fully
define the word correlation. We say that two data sets are
correlated when there appears to be a relationship between them. We can use a scatter diagram to
identify whether this correlation exists. Now, more specifically, if we plot
these points on a scatter diagram and they mainly appear to lie along a straight
line, then they’re said to be linearly correlated. Similarly, if they follow some
nonlinear trend, such as a curve or a logarithmic trend, then they’re said to be
nonlinearly correlated. And, of course, if no such trend
exists, there’s said to be no correlation.
Consider the linear correlation we
discussed. A scatter diagram showing two
variables that are linearly correlated might look a little bit like this. Similarly, it could look a little
something like this. The data points in either case
appear to lie approximately along a straight line. In our second example, the points
might look a little something like this. In this case, the line of best fit
is a curve. Finally, if there is no
correlation, our scatter diagram might look a little something like this. In each of these cases, we’ve
considered whether we can actually draw a line of best fit through each of our
points. The shape of the line of best fit
then tells us information about the type of correlation, if it exists.
So, with this in mind, let’s look
at how to compare a line of best fit with data on a scatter diagram. And this will help us determine
whether the data is linearly correlated.
Can we use the line of best fit to
describe the trend in the data? Why?
And then we have a scatter diagram
with a line of best fit drawn. Let’s imagine this supposed line of
best fit wasn’t drawn on the diagram. How would we construct our own line
of best fit? How would we find a line that more
accurately describes the trend in the data given by the blue points? Well, it might look a little
something like this. Yes, as the values of 𝑥 increase,
the values of 𝑦 also increase. But we can see that this is not
necessarily in a straight line. This means 𝑥 and 𝑦 do appear to
be correlated. But we would say they are
nonlinearly correlated. The line of best fit is not a
straight line.
And so this would not be a sensible
line of best fit to describe the trend in the data. We certainly wouldn’t want to use
this line of best fit to make predictions or estimates based on the data we’re
given, and the reason being is because this data is not linearly correlated. It doesn’t approximately follow a
straight line.
Now, whilst this wouldn’t be a
sensible line of best fit to describe the trend in the data, we did say that both
the line of best fit and the apparent trend in the data show that as the values of
𝑥 increase, the values of 𝑦 also appear to increase. And there are some phrases we can
use to describe this. We say that two data sets are
positively correlated, or directly correlated, if one data set increases as the
other increases. In the case of positive linear
correlation, the data points might look a little something like this. If data sets are negatively
correlated, or inversely correlated, then as one set increases, the other will
decrease, and vice versa. In the case of two data sets that
have negative linear correlation, the points appear to follow a line which slopes
downwards, as we see.
So, with this in mind, let’s
determine whether data is positively or negatively correlated or not at all
correlated using a line of best fit.
What type of correlation exists
between the two variables in the scatter plot shown?
When we think about correlation, we
think about linear correlation — in other words, points that approximately follow a
straight line — we think about nonlinear correlation — these are points that might
follow a different type of trend, for example, a curve. And if things are linearly
correlated, we say that they can be positively linearly correlated or negative
linearly correlated, depending on the direction of the line of best fit. So, let’s consider the graph we’ve
been given here and see if we can draw a line of best fit.
The line of best fit, of course,
does not need to go through the origin, the point zero, zero, although here it does
appear that it might. And that line of best fit should
roughly follow the trend of our points. We might now notice that our line
of best fit slopes upwards. In other words, it has a positive
slope. So this tells us that as the values
of 𝑥 increase, so do the values of 𝑦. In this case then, the variables 𝑥
and 𝑦 are positively correlated. Specifically, since these points
also approximately follow a straight line, we can say that the correlation is
linear. And so we fully answered the
question. The type of correlation that exists
is positive linear correlation.
Now, in this example, we were given
a scatter diagram of a data set. This might not always be the
case. We might instead be given a
description of the type of variables. As we’ll now see, we’ll then need
to use our understanding of how variables relate to one another as a way of
determining whether they are positively or negatively correlated or not correlated
at all.
Suppose variable 𝑥 is the number
of hours you work and variable 𝑦 is your salary. You suspect that the more hours you
work, the higher your salary is. Does this follow a positive
correlation, a negative correlation, or no correlation?
We’re told that variable 𝑥 is the
number of hours worked, whilst variable 𝑦 is the salary. And we’re looking to find a
relationship, if it exists, between these two variables. Now, in fact, the suspicion is that
the more hours you work, the higher your salary is. So, let’s attempt to plot this on a
scatter graph. Variable 𝑥 is the number of hours
worked, whilst 𝑦 is the salary, so we can label the axes as shown. Let’s make up some starting
figures. Let’s imagine that if you work 15
hours, you earn 20,000 pounds. You might then assume that if you
work 30 hours a week, you earn an annual salary of 40,000 pounds. Assuming that the more hours you
work, the higher your salary is, we could add extra points on our scatter graph as
shown.
We notice that the points plotted
approximately follow a straight line and that this straight line has a positive
slope. It slopes upward. Since this line slopes upwards, we
can say that the two variables 𝑥 and 𝑦 must have positive correlation. Now, we also assumed that this was
positive linear correlation, but that might not be the case. We only know that the higher the
number of hours, the higher the salary, which means that this is an example of
positive correlation.
Now, in this example, we modeled
our data points as lying very closely to some straight line. The distance that the data points
actually lie relative to a line of best fit describes the strength of the
correlation. For instance, suppose we’re
interested in positive linear correlation. If all the points lie very close to
the line of best fit, as in this example, we can say that’s an example of strong
correlation. If, however, the points are quite
far away from the line of best fit, as in this example, then we say that there is
weak correlation. Of course eventually this weak
correlation turns into no correlation as the points get further and further away
from one another. With this in mind, let’s determine
the strength of correlation in our next example.
State which of the scatter diagrams
shows bivariate data with a stronger correlation.
And then there are two diagrams to
choose from. Remember, when we think about the
strength of a correlation, we’re determining how close the points are to a line of
best fit. The closer the points are, the
stronger the correlation. So, it makes sense to begin by
drawing the line of best fit on both of our diagrams. The line of best fit on diagram one
might look a little something like this. The points approximately follow a
straight line, so there is linear correlation here. Specifically, as the 𝑥-variables
increase, so do the 𝑦. So, we can say that 𝑥 and 𝑦 are
positively linearly correlated.
In diagram two, our line of best
fit looks quite similar. But we notice that all of the
points are a little bit further away from the line itself. This means in diagram two, the
correlation is less strong. We might say it’s weak. And so the answer is diagram
one. The scatter diagram one shows
bivariate data with a stronger correlation.
We’ve now looked at how two
different variables can be related and what it means for them to have a linear or
nonlinear relationship. We’ve considered how to describe
the relationship between variables in terms of positive, negative, or no
correlation. And we’ve looked at how strongly
correlated variables are based on how close they are to a line of best fit. With all this in mind, let’s recap
the key points from this lesson.
In this video, we learned that if
two variables follow a trend of some description, they’re said to be correlated. If we model these points on a
scatter diagram and they appear to follow approximately a straight line, then linear
correlation exists. Then, if the line of best fit
constructed appears to slope upwards, in other words, its slope is positive, then
they have positive correlation. And if that line of best fit slopes
downwards, if it has negative slope, then those variables are said to be negatively
correlated. Now, if neither of these is true,
in other words, if a line of best fit cannot be constructed, then we said that there
was no correlation. Finally, we saw that we can
determine the strength of the correlation by considering how close all of the points
lie to the LOBF, the line of best fit.