### Video Transcript

In this video, we’re gonna look at
the least squares regression method of finding the equation of a line of best fit
through points on a scatter plot. We’ll also talk a little bit about
the theory behind the method and how to use and interpret the regression equation
when you worked out what it is. Let’s look at an example of a
situation where you might want to calculate the equation of the least squares
regression line.

Some students did an experiment in
which they hung objects of various masses from a spring. And they measured the length of the
spring in each case. Find the equation of the least
squares regression line. Well we can see, for example, when
they hung a mass of ten kilograms from the spring, it had a length of twelve
centimetres. When they hung twenty kilograms, it
was sixteen centimetres and so on. So this is bivariate data. There are two variables: mass and
length. And each pair of pieces of data
relates to a specific event. So placing a mass of twenty
kilograms on the spring extends it to a length of sixteen centimetres.

Now the first thing that we need to
think about is which is the dependent variable. And which is the independent
variable. Now the independent variable is the
one that you’d normally control or change. And it causes changes in the value
of the other variable, the dependent variable. So the dependent variable depends
on the value of the other variable. Now we’re choosing which masses to
put on the spring. And that’s causing the change in
the length of the spring. So length is the dependent
variable. And the mass is the independent
variable. Now we tend to call the independent
variable 𝑥 and the dependent variable 𝑦. So let’s add those letters to our
table.

Now we can plot these points on a
scatter plot. And it looks like we’ve got very
strong positive correlation between the two variables. In fact, when you calculate the
Pearson’s product moment correlation coefficient, it comes out to be 0.99 to two
decimal places. So yep, indeed that is very strong
positive correlation. So how do we go about drawing a
suitable line of best fit? Well we could just try laying down
our ruler in various positions until it looks like the points are generally as close
as they can be to the line. Then we can read off a couple of
pairs of coordinates from the line and work out its equation. However, luckily there’s a more
methodical and consistent way of going about it. Calculating the equation of the
least squares regression line. And the way this works is to
specify a line which minimizes the sum of the squares of the residuals of each of
the points.

So, for example, if we call that
point one, this distance here, this vertical distance between the point and the
line, is called the residual. So if the equation of our line of
best fit was 𝑦 equals 𝑎 plus 𝑏𝑥, we could plug 𝑥-coordinates in there and make
predictions about what the corresponding 𝑦-coordinates would be. So the residuals of each point are
the difference between that prediction of our estimate and the actual values that we
observed when we did the experiment. So for these points here, our
estimates were under estimates of the actual values. And for these points here, our
estimates were overestimates of the actual values that we observed. So we could think of some of the
residuals as being positive and some of the residuals as being negative.

So if we tried out lots and lots of
different lines of best fit, and for each one added up all the residuals. Then you might expect that lines
with a sum of the residuals close to zero would be a better line of best fit than
lines with larger sums of residuals. But the thing is some spectacularly
bad lines of best fit might be drawn. So that all the positive residuals
exactly balance all of the negative residuals and give us sum of zero. So to get around this potential
problem, the least squares regression model takes the squares of all the residuals,
so that the results are all positive. And then it finds the line that
minimizes the sum of the squares of all these residuals. Hence the name least squares
regression.

Now we’re not gonna go into detail
here about how to derive the exact formula. But we are gonna talk about how to
use it. The formula then for the least
squares regression line is 𝑦 is equal to 𝑎 plus 𝑏𝑥, where 𝑏 is 𝑆𝑥𝑦 over
𝑆𝑥𝑥, where 𝑆𝑥𝑦 is the covariance of 𝑥 and 𝑦 divided by 𝑛. And 𝑆𝑥𝑥 is the variance of 𝑥
divided by 𝑛. Also the value of 𝑎 is equal to
the mean 𝑦-value minus 𝑏 times the mean 𝑥-value. Now remember to work out the value
of 𝑆𝑥𝑦, we have to multiply each pair of 𝑥- and 𝑦-coordinates together and then
sum the result of all those added together. And then we need to take the sum of
all the 𝑥s and times that by the sum of all the 𝑦s. And then take that result and
divide it by the number of pieces of data that we’ve got. And to work out the value of
𝑆𝑥𝑥, we square each of the pieces of 𝑥-data. And then we add up all those
squares. And then from that, we take away
the sum of all the 𝑥-values squared and then divide it by the number of pieces of
data that we’ve got.

Okay it all looks pretty horrible
on paper at the moment. But when we go through the example,
I’m sure you’ll find it much easier. Firstly, we take the table of
values that we were given in the question. And then we need to extend it a
bit. We need to create a couple of extra
columns and an extra row at the bottom. So firstly, the columns of the 𝑥
squared values, so that’s each individual 𝑥-value squared, and then the 𝑥𝑦
values. So that’s for each data pair, we
take the 𝑥-value and multiply it by the 𝑦-value. And then we create a row at the
bottom for all of our totals. So first of all, let’s add up all
the 𝑥s and then add up all the 𝑦s. Well ten plus twenty plus thirty
plus forty plus fifty is a hundred and fifty. And if I add up all five 𝑦-values,
I get an answer of 120. No, I’ve got five pairs of
data. So that’s 𝑛 equals five. And squaring each 𝑥-value, 10
squared is 100. 20 squared is 400 and so on. Then adding them all up, I get
5500.

Now I’m gonna do 𝑥 times 𝑦. So 10 times 12 is 120. 20 times 16 times 320 and so
on. And then if I add all of those up,
I get 4230. Now I can calculate the individual
values. So 𝑆𝑥𝑦, remember, was the sum of
the 𝑥 times 𝑦s minus the sum of the 𝑥s minus the sum of the 𝑦s divided by the
number of pieces of data. Well the sum of the 𝑥𝑦s is
4230. Sum of the 𝑥s is 150. The sum of the 𝑦s is 120. And I’ve got five pieces of
data. So popping that into my calculator,
I get 630. And 𝑆𝑥𝑥, I’ve gotta sum the 𝑥
squared column. And I’ve gotta sum the 𝑥 column,
square that value, and divide by the number of pieces of data. Well the total of all the
𝑥-squares added together is 5500. And the sum of the 𝑥s is 150. So I’ve gotta square 150 and divide
by the number of pieces of the data, five. And when I pop that into my
calculator, I get 1000. So just making a note of those over
on the left while I make some space to do some more calculations on the right. The 𝑏-value in the equation of my
straight line is equal to 𝑆𝑥𝑦 over 𝑆𝑥𝑥. So that’s 630 divided by 1000,
which is 0.63.

Now to calculate the 𝑎-value is a
little bit more trickier. I need to work out the mean
𝑦-coordinate and the mean 𝑥-coordinate and then also take into account the answer
I got for 𝑏 in that first part. So to work out the mean 𝑦-value, I
just have to add up all the 𝑦s and divide by how many there are. So that’s 120 divided by five,
which is 24. And same process again for the 𝑥s,
just add up all the 𝑥-values and divide by how many there are. So that’s 150 divided by five,
which is thirty. So the mean 𝑥-value is thirty. The mean 𝑦-value is 24. So 𝑎 then is the mean 𝑦-value
minus 𝑏 times the mean 𝑥-value, which is 24 minus 0.63 times 30. And that gives us an answer of
5.1. So again just making a note of
those values so I can carry on on the right hand side with more working out. The equation of our line of best
fit is 𝑦 is equal to 𝑎 plus 𝑏𝑥. So our least squares regression
line is 𝑦 is equal to 5.1 plus 0.63 times the 𝑥-coordinate.

Well, that’s great. So we’ve now got an equation that
enables us to make predictions about the length of the spring given different masses
that were hanging from it. So, for example, if we were gonna
hang a mass of 37 kilograms from the spring, we just put an 𝑥-value of 37 into that
equation. And we’d make our prediction that
𝑦 is equal to 28.41 centimetres. That’s how long we’d expect the
spring to be with that mass hanging from it. Now because our scatter plot showed
that we had very strong positive correlation, we’d expect that equation to make
pretty reasonable estimates of 𝑦-values given certain 𝑥-values. Well, that is, we’d expect them to
be good estimates, if we use 𝑥-values between about 10 and 50. In other words, if we use the
equation to interpolate the 𝑦-values.

Now we gather data, 𝑥-data, in
that range. We don’t know if that same equation
will be true outside that range. For example, if we put a mass of 60
or 70 or 80 kilograms on the spring, it might snap all together. So our equation just simply
wouldn’t work. So using the equation to make
predictions of 𝑦-values based on 𝑥-values in the range that we’ve gathered data
for is called interpolation. But extending beyond that range and
making predictions with 𝑥-values less than 10 or greater than 50 is called
extrapolation. And as we said, extrapolating is
generally a bad idea. Cause we’re just not gonna be so
confident that the rules still apply for those data values. And our equation may not hold.

Now we could use the equation to
make prediction about the length of the spring without any weight standing on it at
all. So we put an 𝑥-value of zero. And we’d have the equation 𝑦
equals 5.1 plus 0.63 times zero. So the length of the spring with no
weights added to it would be 5.1 centimetres. Now just quickly, I’m gonna go back
up here and rub this out and change it to a plus. Obviously, I made a bit of a
mistake there. So apologies for that. So going back to our question here
with a mass of zero kilograms, we get a length of spring of 5.1 centimetres. So this is telling us the starting
conditions for a problem if you like. With no masses added, the spring
will be 5.1 centimetres. Now I think you can spot the
potential problem with this. Because we only gather data with
𝑥-values from 10 to 50 kilograms, the equation we’re extrapolating back to zero
here. So that might not necessarily be
true. It might be true. But we don’t have 100 percent
confidence that the equation will still hold for those 𝑥-values.

Now we can also interpret the
parameters in that regression equation. That coefficient of 𝑥, the
multiple of 𝑥 there, 0.63, means that every time I add one more kilogram, so I
increase 𝑥 by one. Then the spring stretches by 0.63
centimetres. And as we’ve just seen that number
there, the 5.1 on its own, when I have an 𝑥-value of zero, then 𝑦 is equal to
5.1. So when no mass is added to the
spring, its length would be 5.1 centimetres. Now this method of least squares
regression analysis seems like magic. Simply process your data. And you get an easy-to-use equation
to make predictions of one value from the other, brilliant!

But remember, you also need to
consider the strength of the correlation before using your least squares regression
line to make predictions. If there’s little or no
correlation, then the equation is gonna give you very unreliable predictions or
estimates. You’d also need to consider the
amount of data that you used to build the model. The more data you have, generally,
the more reliable and more realistic that model will be. And also remember, don’t
extrapolate. Interpolation is quite good. If the correlation’s quite good,
then interpolated values will be quite good predictions. Extrapolated values, you really
don’t know how reliable they’re going to be.

Okay, here’s one for you to
try.

Find the least squares regression
equation for the following data. And use it to estimate the value of
𝑦 when 𝑥 equals nine, then comment on your result.

So we’ve got some data here for 𝑥
and 𝑦. When 𝑥 is one, 𝑦 is twelve. When 𝑥 is two, 𝑦 is seven and so
on. And we’ve given you the formulae
down at the bottom there for you to use. So press pause and then come back
when you’ve answered the question. And I’ll go through the answers —
Right, first we need to add two columns, the 𝑥 squareds and the 𝑥𝑦s. Now we’re gonna fill those in. One squared is one. Two squared is four and so on. And now, the 𝑥𝑦s, one times
twelve is twelve. Two times seven are fourteen and so
on. Now we’ll add a row at the bottom
for all the totals. Now if I add up all the 𝑥-values,
I get fifteen. Adding up all the 𝑦-values gives
me a total of 37. Adding up all the 𝑥 squared values
gives me a total of 55. And adding up all the products of
𝑥 and 𝑦, I get a total of 93. And because I’ve got five sets of
data, 𝑛 is equal to five.

So 𝑆𝑥𝑦 is the sum of the 𝑥𝑦s
minus the sum of the 𝑥s times the sum of the 𝑦s all over 𝑛. So that’s 93 minus 15 times 37 all
over five, which is negative 18. And the 𝑆𝑥𝑥 value is the sum of
the 𝑥 squareds minus the sum of the 𝑥s all squared divided by 𝑛. Well and the sum of the 𝑥 squareds
is 55. The sum of the 𝑥s is 15. And 𝑛 is five. So that becomes 55 minus 15 squared
over five, which is equal to 10. So working out the values of the
parameters for our equation of our straight line, 𝑦 equals 𝑎 plus 𝑏𝑥. The 𝑏-value is 𝑆𝑥𝑦 divided by
𝑆𝑥𝑥. Well that was negative 18 divided
by 10, which is negative 1.8. And the mean 𝑦-value was just the
sum of all the 𝑦s divided by how many there are. That’s 37 divided by five, which is
7.4. And the mean 𝑥-value is the sum of
all the 𝑥-values divided by how many there are. So that’s 15 divided by five, which
is three. So the 𝑎-value then is the mean 𝑦
minus 𝑏 times the mean 𝑥. Now because the 𝑏-value is
negative 1.8, we gotta be quite careful with our negative signs here. So that’s 7.4 minus negative 1.8
times three, which is equal to 12.8.

So the equation of our least
squares regression line, 𝑦 equals 𝑎 plus 𝑏𝑥. All we need to do then is
substitute in our values for 𝑎 and 𝑏. So that’s the equation. 𝑦 equals 12.8 minus 1.8𝑥. And now we have to substitute in 𝑥
equals nine to make a prediction of the corresponding 𝑦-value. So 𝑦 would be equal to 12.8 minus
1.8 times nine, which would be negative 3.4. Now commenting on the result,
there’re couple of things I wanna say. One is we’ve extrapolated. Look, the 𝑥-values that we
gathered in terms of our data were from one to five. Well we’ve used an 𝑥-value of
nine. So we’ve extrapolated. So we don’t necessarily know how
reliable that answer’s gonna be. And the other thing I would say is
we don’t know how good the correlation was. We don’t know the Pearson’s
correlation coefficient or any other correlation coefficient for that matter. So even if we had interpolated our
value, we still wouldn’t really know how reliable that answer would be. But the main point to make is that
it was an extrapolated value. So we do need to be cautious about
it.

So in summary, we can work out the
equation of our least squares regression line 𝑦 equals 𝑎 plus 𝑏𝑥 by using 𝑏 is
equal to 𝑆𝑥𝑦 over 𝑆𝑥𝑥. And 𝑎 is equal to the mean
𝑦-value minus 𝑏 times the mean 𝑥-value. So 𝑆𝑥𝑦, remember, is the sum of
the 𝑥 times 𝑦 answers minus the sum of the 𝑥s times the sum of the 𝑦s over how
many pieces of data we’ve got. The 𝑆𝑥𝑥 value is the sum of the
𝑥 squared values minus the sum of the 𝑥-values all squared divided by the number
pieces of data you’ve got. And you know how to work out the
mean 𝑦-value and the mean 𝑥-value. You just add them up and divide by
how many you’ve got. And finally beware of
extrapolation.