Video Transcript
In this video, weโre gonna look at
the least squares regression method of finding the equation of a line of best fit
through points on a scatter plot. Weโll also talk a little bit about
the theory behind the method and how to use and interpret the regression equation
when you worked out what it is. Letโs look at an example of a
situation where you might want to calculate the equation of the least squares
regression line.
Some students did an experiment in
which they hung objects of various masses from a spring. And they measured the length of the
spring in each case. Find the equation of the least
squares regression line. Well we can see, for example, when
they hung a mass of ten kilograms from the spring, it had a length of twelve
centimetres. When they hung twenty kilograms, it
was sixteen centimetres and so on. So this is bivariate data. There are two variables: mass and
length. And each pair of pieces of data
relates to a specific event. So placing a mass of twenty
kilograms on the spring extends it to a length of sixteen centimetres.
Now the first thing that we need to
think about is which is the dependent variable. And which is the independent
variable. Now the independent variable is the
one that youโd normally control or change. And it causes changes in the value
of the other variable, the dependent variable. So the dependent variable depends
on the value of the other variable. Now weโre choosing which masses to
put on the spring. And thatโs causing the change in
the length of the spring. So length is the dependent
variable. And the mass is the independent
variable. Now we tend to call the independent
variable ๐ฅ and the dependent variable ๐ฆ. So letโs add those letters to our
table.
Now we can plot these points on a
scatter plot. And it looks like weโve got very
strong positive correlation between the two variables. In fact, when you calculate the
Pearsonโs product moment correlation coefficient, it comes out to be 0.99 to two
decimal places. So yep, indeed that is very strong
positive correlation. So how do we go about drawing a
suitable line of best fit? Well we could just try laying down
our ruler in various positions until it looks like the points are generally as close
as they can be to the line. Then we can read off a couple of
pairs of coordinates from the line and work out its equation. However, luckily thereโs a more
methodical and consistent way of going about it. Calculating the equation of the
least squares regression line. And the way this works is to
specify a line which minimizes the sum of the squares of the residuals of each of
the points.
So, for example, if we call that
point one, this distance here, this vertical distance between the point and the
line, is called the residual. So if the equation of our line of
best fit was ๐ฆ equals ๐ plus ๐๐ฅ, we could plug ๐ฅ-coordinates in there and make
predictions about what the corresponding ๐ฆ-coordinates would be. So the residuals of each point are
the difference between that prediction of our estimate and the actual values that we
observed when we did the experiment. So for these points here, our
estimates were under estimates of the actual values. And for these points here, our
estimates were overestimates of the actual values that we observed. So we could think of some of the
residuals as being positive and some of the residuals as being negative.
So if we tried out lots and lots of
different lines of best fit, and for each one added up all the residuals. Then you might expect that lines
with a sum of the residuals close to zero would be a better line of best fit than
lines with larger sums of residuals. But the thing is some spectacularly
bad lines of best fit might be drawn. So that all the positive residuals
exactly balance all of the negative residuals and give us sum of zero. So to get around this potential
problem, the least squares regression model takes the squares of all the residuals,
so that the results are all positive. And then it finds the line that
minimizes the sum of the squares of all these residuals. Hence the name least squares
regression.
Now weโre not gonna go into detail
here about how to derive the exact formula. But we are gonna talk about how to
use it. The formula then for the least
squares regression line is ๐ฆ is equal to ๐ plus ๐๐ฅ, where ๐ is ๐๐ฅ๐ฆ over
๐๐ฅ๐ฅ, where ๐๐ฅ๐ฆ is the covariance of ๐ฅ and ๐ฆ divided by ๐. And ๐๐ฅ๐ฅ is the variance of ๐ฅ
divided by ๐. Also the value of ๐ is equal to
the mean ๐ฆ-value minus ๐ times the mean ๐ฅ-value. Now remember to work out the value
of ๐๐ฅ๐ฆ, we have to multiply each pair of ๐ฅ- and ๐ฆ-coordinates together and then
sum the result of all those added together. And then we need to take the sum of
all the ๐ฅs and times that by the sum of all the ๐ฆs. And then take that result and
divide it by the number of pieces of data that weโve got. And to work out the value of
๐๐ฅ๐ฅ, we square each of the pieces of ๐ฅ-data. And then we add up all those
squares. And then from that, we take away
the sum of all the ๐ฅ-values squared and then divide it by the number of pieces of
data that weโve got.
Okay it all looks pretty horrible
on paper at the moment. But when we go through the example,
Iโm sure youโll find it much easier. Firstly, we take the table of
values that we were given in the question. And then we need to extend it a
bit. We need to create a couple of extra
columns and an extra row at the bottom. So firstly, the columns of the ๐ฅ
squared values, so thatโs each individual ๐ฅ-value squared, and then the ๐ฅ๐ฆ
values. So thatโs for each data pair, we
take the ๐ฅ-value and multiply it by the ๐ฆ-value. And then we create a row at the
bottom for all of our totals. So first of all, letโs add up all
the ๐ฅs and then add up all the ๐ฆs. Well ten plus twenty plus thirty
plus forty plus fifty is a hundred and fifty. And if I add up all five ๐ฆ-values,
I get an answer of 120. No, Iโve got five pairs of
data. So thatโs ๐ equals five. And squaring each ๐ฅ-value, 10
squared is 100. 20 squared is 400 and so on. Then adding them all up, I get
5500.
Now Iโm gonna do ๐ฅ times ๐ฆ. So 10 times 12 is 120. 20 times 16 times 320 and so
on. And then if I add all of those up,
I get 4230. Now I can calculate the individual
values. So ๐๐ฅ๐ฆ, remember, was the sum of
the ๐ฅ times ๐ฆs minus the sum of the ๐ฅs minus the sum of the ๐ฆs divided by the
number of pieces of data. Well the sum of the ๐ฅ๐ฆs is
4230. Sum of the ๐ฅs is 150. The sum of the ๐ฆs is 120. And Iโve got five pieces of
data. So popping that into my calculator,
I get 630. And ๐๐ฅ๐ฅ, Iโve gotta sum the ๐ฅ
squared column. And Iโve gotta sum the ๐ฅ column,
square that value, and divide by the number of pieces of data. Well the total of all the
๐ฅ-squares added together is 5500. And the sum of the ๐ฅs is 150. So Iโve gotta square 150 and divide
by the number of pieces of the data, five. And when I pop that into my
calculator, I get 1000. So just making a note of those over
on the left while I make some space to do some more calculations on the right. The ๐-value in the equation of my
straight line is equal to ๐๐ฅ๐ฆ over ๐๐ฅ๐ฅ. So thatโs 630 divided by 1000,
which is 0.63.
Now to calculate the ๐-value is a
little bit more trickier. I need to work out the mean
๐ฆ-coordinate and the mean ๐ฅ-coordinate and then also take into account the answer
I got for ๐ in that first part. So to work out the mean ๐ฆ-value, I
just have to add up all the ๐ฆs and divide by how many there are. So thatโs 120 divided by five,
which is 24. And same process again for the ๐ฅs,
just add up all the ๐ฅ-values and divide by how many there are. So thatโs 150 divided by five,
which is thirty. So the mean ๐ฅ-value is thirty. The mean ๐ฆ-value is 24. So ๐ then is the mean ๐ฆ-value
minus ๐ times the mean ๐ฅ-value, which is 24 minus 0.63 times 30. And that gives us an answer of
5.1. So again just making a note of
those values so I can carry on on the right hand side with more working out. The equation of our line of best
fit is ๐ฆ is equal to ๐ plus ๐๐ฅ. So our least squares regression
line is ๐ฆ is equal to 5.1 plus 0.63 times the ๐ฅ-coordinate.
Well, thatโs great. So weโve now got an equation that
enables us to make predictions about the length of the spring given different masses
that were hanging from it. So, for example, if we were gonna
hang a mass of 37 kilograms from the spring, we just put an ๐ฅ-value of 37 into that
equation. And weโd make our prediction that
๐ฆ is equal to 28.41 centimetres. Thatโs how long weโd expect the
spring to be with that mass hanging from it. Now because our scatter plot showed
that we had very strong positive correlation, weโd expect that equation to make
pretty reasonable estimates of ๐ฆ-values given certain ๐ฅ-values. Well, that is, weโd expect them to
be good estimates, if we use ๐ฅ-values between about 10 and 50. In other words, if we use the
equation to interpolate the ๐ฆ-values.
Now we gather data, ๐ฅ-data, in
that range. We donโt know if that same equation
will be true outside that range. For example, if we put a mass of 60
or 70 or 80 kilograms on the spring, it might snap all together. So our equation just simply
wouldnโt work. So using the equation to make
predictions of ๐ฆ-values based on ๐ฅ-values in the range that weโve gathered data
for is called interpolation. But extending beyond that range and
making predictions with ๐ฅ-values less than 10 or greater than 50 is called
extrapolation. And as we said, extrapolating is
generally a bad idea. Cause weโre just not gonna be so
confident that the rules still apply for those data values. And our equation may not hold.
Now we could use the equation to
make prediction about the length of the spring without any weight standing on it at
all. So we put an ๐ฅ-value of zero. And weโd have the equation ๐ฆ
equals 5.1 plus 0.63 times zero. So the length of the spring with no
weights added to it would be 5.1 centimetres. Now just quickly, Iโm gonna go back
up here and rub this out and change it to a plus. Obviously, I made a bit of a
mistake there. So apologies for that. So going back to our question here
with a mass of zero kilograms, we get a length of spring of 5.1 centimetres. So this is telling us the starting
conditions for a problem if you like. With no masses added, the spring
will be 5.1 centimetres. Now I think you can spot the
potential problem with this. Because we only gather data with
๐ฅ-values from 10 to 50 kilograms, the equation weโre extrapolating back to zero
here. So that might not necessarily be
true. It might be true. But we donโt have 100 percent
confidence that the equation will still hold for those ๐ฅ-values.
Now we can also interpret the
parameters in that regression equation. That coefficient of ๐ฅ, the
multiple of ๐ฅ there, 0.63, means that every time I add one more kilogram, so I
increase ๐ฅ by one. Then the spring stretches by 0.63
centimetres. And as weโve just seen that number
there, the 5.1 on its own, when I have an ๐ฅ-value of zero, then ๐ฆ is equal to
5.1. So when no mass is added to the
spring, its length would be 5.1 centimetres. Now this method of least squares
regression analysis seems like magic. Simply process your data. And you get an easy-to-use equation
to make predictions of one value from the other, brilliant!
But remember, you also need to
consider the strength of the correlation before using your least squares regression
line to make predictions. If thereโs little or no
correlation, then the equation is gonna give you very unreliable predictions or
estimates. Youโd also need to consider the
amount of data that you used to build the model. The more data you have, generally,
the more reliable and more realistic that model will be. And also remember, donโt
extrapolate. Interpolation is quite good. If the correlationโs quite good,
then interpolated values will be quite good predictions. Extrapolated values, you really
donโt know how reliable theyโre going to be.
Okay, hereโs one for you to
try.
Find the least squares regression
equation for the following data. And use it to estimate the value of
๐ฆ when ๐ฅ equals nine, then comment on your result.
So weโve got some data here for ๐ฅ
and ๐ฆ. When ๐ฅ is one, ๐ฆ is twelve. When ๐ฅ is two, ๐ฆ is seven and so
on. And weโve given you the formulae
down at the bottom there for you to use. So press pause and then come back
when youโve answered the question. And Iโll go through the answers โ
Right, first we need to add two columns, the ๐ฅ squareds and the ๐ฅ๐ฆs. Now weโre gonna fill those in. One squared is one. Two squared is four and so on. And now, the ๐ฅ๐ฆs, one times
twelve is twelve. Two times seven are fourteen and so
on. Now weโll add a row at the bottom
for all the totals. Now if I add up all the ๐ฅ-values,
I get fifteen. Adding up all the ๐ฆ-values gives
me a total of 37. Adding up all the ๐ฅ squared values
gives me a total of 55. And adding up all the products of
๐ฅ and ๐ฆ, I get a total of 93. And because Iโve got five sets of
data, ๐ is equal to five.
So ๐๐ฅ๐ฆ is the sum of the ๐ฅ๐ฆs
minus the sum of the ๐ฅs times the sum of the ๐ฆs all over ๐. So thatโs 93 minus 15 times 37 all
over five, which is negative 18. And the ๐๐ฅ๐ฅ value is the sum of
the ๐ฅ squareds minus the sum of the ๐ฅs all squared divided by ๐. Well and the sum of the ๐ฅ squareds
is 55. The sum of the ๐ฅs is 15. And ๐ is five. So that becomes 55 minus 15 squared
over five, which is equal to 10. So working out the values of the
parameters for our equation of our straight line, ๐ฆ equals ๐ plus ๐๐ฅ. The ๐-value is ๐๐ฅ๐ฆ divided by
๐๐ฅ๐ฅ. Well that was negative 18 divided
by 10, which is negative 1.8. And the mean ๐ฆ-value was just the
sum of all the ๐ฆs divided by how many there are. Thatโs 37 divided by five, which is
7.4. And the mean ๐ฅ-value is the sum of
all the ๐ฅ-values divided by how many there are. So thatโs 15 divided by five, which
is three. So the ๐-value then is the mean ๐ฆ
minus ๐ times the mean ๐ฅ. Now because the ๐-value is
negative 1.8, we gotta be quite careful with our negative signs here. So thatโs 7.4 minus negative 1.8
times three, which is equal to 12.8.
So the equation of our least
squares regression line, ๐ฆ equals ๐ plus ๐๐ฅ. All we need to do then is
substitute in our values for ๐ and ๐. So thatโs the equation. ๐ฆ equals 12.8 minus 1.8๐ฅ. And now we have to substitute in ๐ฅ
equals nine to make a prediction of the corresponding ๐ฆ-value. So ๐ฆ would be equal to 12.8 minus
1.8 times nine, which would be negative 3.4. Now commenting on the result,
thereโre couple of things I wanna say. One is weโve extrapolated. Look, the ๐ฅ-values that we
gathered in terms of our data were from one to five. Well weโve used an ๐ฅ-value of
nine. So weโve extrapolated. So we donโt necessarily know how
reliable that answerโs gonna be. And the other thing I would say is
we donโt know how good the correlation was. We donโt know the Pearsonโs
correlation coefficient or any other correlation coefficient for that matter. So even if we had interpolated our
value, we still wouldnโt really know how reliable that answer would be. But the main point to make is that
it was an extrapolated value. So we do need to be cautious about
it.
So in summary, we can work out the
equation of our least squares regression line ๐ฆ equals ๐ plus ๐๐ฅ by using ๐ is
equal to ๐๐ฅ๐ฆ over ๐๐ฅ๐ฅ. And ๐ is equal to the mean
๐ฆ-value minus ๐ times the mean ๐ฅ-value. So ๐๐ฅ๐ฆ, remember, is the sum of
the ๐ฅ times ๐ฆ answers minus the sum of the ๐ฅs times the sum of the ๐ฆs over how
many pieces of data weโve got. The ๐๐ฅ๐ฅ value is the sum of the
๐ฅ squared values minus the sum of the ๐ฅ-values all squared divided by the number
pieces of data youโve got. And you know how to work out the
mean ๐ฆ-value and the mean ๐ฅ-value. You just add them up and divide by
how many youโve got. And finally beware of
extrapolation.