Nagwa uses cookies to ensure you get the best experience on our website. Learn more about our Privacy Policy.

Video: Least Squares Regression Line

Tim Burnham

A detailed walk-through of how to use the least squares regression method to find the equation of a line of best fit for the points on a scatterplot, along with an exploration of how to use and interpret the equation when you have found it.

17:15

Video Transcript

In this video, we’re gonna look at the least squares regression method of finding the equation of a line of best fit through points on a scatter plot. We’ll also talk a little bit about the theory behind the method and how to use and interpret the regression equation when you’ve worked out what it is.

Let’s look at an example of a situation where you might want to calculate the equation of the least squares regression line. Some students did an experiment in which they hung objects of various masses from a spring, and they measured the length of the spring in each case. Find the equation of the least squares regression line. Well we can see, for example, when they hung a mass of ten kilograms from the spring, it had a length of twelve centimetres. When they hung twenty kilograms, it was sixteen centimetres and so on. So this is bivariate data. There are two variables: mass and length. And each pair of pieces of data relates to a specific event. So placing a mass of twenty kilograms on the spring extends it to a length of sixteen centimetres. Now the first thing that we need to think about is which is the dependent variable and which is the independent variable. Now the independent variable is the one that you’d normally control or change, and it causes changes in the value of the other variable, the dependent variable. So the dependent variable depends on the value of the other variable. Now we’re choosing which masses to put on the spring and that’s causing the change in the length of the spring. So length is the dependent variable and the mass is the independent variable. Now we tend to call the independent variable 𝑥 and the dependent variable 𝑦, so let’s add those letters to our table.

Now we can plot these points on a scatter plot. And it looks like we’ve got very strong positive correlation between the two variables. In fact, when you calculate the Pearson product moment correlation coefficient, it comes out to be nought point nine nine to two decimal places. So yep, indeed that is very strong positive correlation.

So how do we go about drawing a suitable line of best fit? Well we could just try laying down our ruler in various positions until it looks like the points are generally as close as they can be to the line. Then we can read off a couple of pairs of coordinates from the line and work out its equation. However, luckily there’s a more methodical and consistent way of going about it, calculating the equation of the least squares regression line. And the way this works is to specify a line which minimizes the sum of the squares of the residuals of each of the points. So for example, if we call that point one, this distance here, this vertical distance between the point and the line, is called the residual. So if the equation of our line of best fit was 𝑦 equals 𝑎 plus 𝑏𝑥, we can plug 𝑥-coordinates in there and make predictions about what the corresponding 𝑦-coordinates would be. So the residuals of each point are the difference between that prediction of our estimate and the actual values that we observed when we did the experiment.

So for these points here, our estimates were under estimates of the actual values. And for these points here, our estimates were over estimates of the actual values that we have observed. So we could think of some of the residuals as being positive and some of the residuals as being negative. So if we tried out lots and lots of different lines of best fit, and for each one added up all the residuals, then you might expect that lines with a sum of the residuals close to the zero would be a better line of best fit than lines with larger sums of residuals. But the thing is some spectacularly bad lines of best fit might be drawn so that all the positive residuals exactly balance all of the negative residuals and give us sum of zero.

So to get around this potential problem, the least squares regression model takes the squares of all the residuals, so that the results are all positive. And then it finds the line that minimizes the sum of the squares of all these residuals, hence the name least squares regression. Now we’re not gonna go into detail here about how to derive the exact formula, but we are gonna talk about how to use it. The formula then for the least squares regression line is 𝑦 is equal to 𝑎 plus 𝑏𝑥 where 𝑏 is 𝑆𝑥𝑦 over 𝑆𝑥𝑥, where 𝑆𝑥𝑦 is the covariance of 𝑥 and 𝑦 divided by 𝑛, and 𝑆𝑥𝑥 is the variance of 𝑥 divided by 𝑛. Also the value of 𝑎 is equal to the mean 𝑦 value minus 𝑏 times the mean 𝑥 value.

Now remember to work out the value of 𝑆𝑥𝑦, we have to multiply each pair of 𝑥 and 𝑦-coordinates together and then sum the result of all those added together. And then we need to take the sum of all the 𝑥s and times that by the sum of all the 𝑦s, and then take that result and divide it by the number of pieces of data that we’ve got. And to work out the value of 𝑆𝑥𝑥, we square each of the pieces of 𝑥-data, and then we add up all those squares. And then from that, we take away the sum of all the 𝑥-values squared and then divide it by the number of pieces of data that we’ve got.

Okay it all looks pretty horrible on paper at the moment, but when we go through the example, I’m sure you’ll find it much easier. Firstly, we take the table of values that we were given in the question, and then we need to extend it a bit. We need to create a couple of extra columns and an extra row at the bottom. So firstly, the columns of the 𝑥 squared values, so that’s each individual 𝑥 value squared and then the 𝑥𝑦 values. So that’s for each data pair, we take the 𝑥-value and multiply it by the 𝑦-value. And then we create a row at the bottom for all of our totals. So first of all, let’s add up all the 𝑥s and then add up all the 𝑦s. Well ten plus twenty plus thirty plus forty plus fifty is a hundred and fifty. And if I add up all five 𝑦-values, I get an answer of a hundred and twenty. No, I’ve got five pairs of data, so that’s 𝑛 equals five. And squaring each 𝑥-value, ten squared is a hundred, twenty squared is four hundred, and so on. Then adding them all up, I get five thousand and five hundred. Now I’m gonna do 𝑥 times 𝑦, so ten times twelve is a hundred and twenty. Twenty times sixteen times three hundred and twenty and so on. And then if I add all of those up, I get four thousand two hundred and thirty.

Now I can calculate the individual values, so 𝑆𝑥𝑦, remember was the sum of the 𝑥 times 𝑦s minus the sum of the 𝑥s minus the sum of the 𝑦s divided by the number of pieces of data. Well the sum of the 𝑥𝑦s is four thousand two hundred and thirty. Sum of the 𝑥s is a hundred and fifty. The sum of the 𝑦s is a hundred and twenty. And I’ve got five pieces of data. So popping that into my calculator, I get six hundred and thirty. And 𝑆𝑥𝑥, I’ve gotta sum the 𝑥 squared column and I’ve gotta sum the 𝑥 column, square that value, and divide by the number of pieces of data.

Well the total of all the 𝑥-squares added together is five thousand five hundred. And the sum of the 𝑥s is a hundred and fifty, so I’ve got a square of a hundred and fifty and divide by the number of pieces of the data, five. And when I pop that into my calculator, I get one thousand. So just making a note of those over on the left while I make some space to do some more calculations on the right, the 𝑏-value in the equation of my straight line is equal to 𝑆𝑥𝑦 over 𝑆𝑥𝑥. So that’s six hundred and thirty divided by one thousand, which is nought point six three. Now to calculate the 𝑎-value is a little bit more trickier. I need to work out the mean 𝑦-coordinate and the mean 𝑥-coordinate and then also take in to account the answer I got for 𝑏 in that first part.

So to work out the mean 𝑦-value, I just have to add up all the 𝑦s and divide by how many there are. So that’s a hundred and twenty divided by five, which is twenty-four. And same process again for the 𝑥s, just add up all the 𝑥-values and divide by how many there are. So that’s a hundred and fifty divided by five, which is thirty. So the mean 𝑥-value is thirty. The mean 𝑦-value is twenty-four. So 𝑎 then is the mean 𝑦-value minus 𝑏 times the mean 𝑥-value, which is twenty-four minus nought point six three times thirty. And that gives us an answer of five point one. So again just making a note of those values so I can carry on on the right hand side with more working out.

The equation of our line of best fit is 𝑦 is equal to 𝑎 plus 𝑏𝑥. So our least squares regression line is 𝑦 is equal to five point one plus nought point six three times the 𝑥-coordinate. Now that’s great. So we’ve now got an equation that enables us to make predictions about the length of the spring given different masses that were hanging from it. So for example, if we were gonna hang a mass of thirty-seven kilograms from the spring, we just put an 𝑥-value of thirty-seven into that equation. And we’d make our prediction that 𝑦 is equal to twenty-eight point four one centimetres. That’s how long we’d expect the spring to be with that mass hanging from it. Now because our scatter plot showed that we had very strong positive correlation, we’d expect that equation to make pretty reasonable estimates of 𝑦-value given certain 𝑥-values.

Well that is we’d expect them to be good estimates, if we use 𝑥-values between about ten and fifty, in other words, if we use the equation to interpolate the 𝑦-values. Now we gathered data, 𝑥-data, in that range. We don’t know if that same equation will be true outside that range. For example, if we put a mass of sixty or seventy or eighty kilograms on the spring, it might snap all together. So our equation just simply wouldn’t work. So using the equation to make predictions of 𝑦-values based on 𝑥-values in the range that we’ve gathered data for is called interpolation, but extending beyond that range and making predictions with 𝑥-values less than ten or greater than fifty is called extrapolation. And as we said, extrapolating is generally a bad idea, cause we’re just not gonna be so confident that the rules still apply for those data values and our equation may not hold. Now we could use the equation to make prediction about the length of the spring without any weight standing on it at all. So we put an 𝑥-value of zero, and we’d have the equation 𝑦 equals five point one plus nought point six three times zero. So the length of the spring with no weights added to it would be five point one centimetres.

Now just quickly, I’m gonna go back up here and rub this out and change it to a plus. Obviously, we had a bit of a mistake there so apologies for that. So going back to our question here with a mass of zero kilograms, we get a length of spring of five point one centimetres. So this is telling us the starting conditions for a problem if you like. With no masses added, the spring will be five point one centimetres. Now I think you can spot the potential problem with this. Because we only gather data with 𝑥-values from ten to fifty kilograms, the equation we’re extrapolating back to zero here, so that might not necessarily be true. It might be true, but we don’t have a hundred percent confidence that the equation will still hold for those 𝑥-values.

Now we can also interpret the parameters in that regression equation. That coefficient of 𝑥, the multiple of 𝑥 there, nought point six three means that every time I add one more kilogram, so I increase 𝑥 by one, then the spring stretches by nought point six three centimetres. And as we’ve just seen that number there, the five point one on its own, when I have an 𝑥-value of zero, then 𝑦 is equal to five point one. So when no mass is added to the spring, its length would be five point one centimetres.

Now this method of least squares regression analysis seems like magic. Simply process your data and you get an easy-to-use equation to make predictions of one value from the other, brilliant! But remember, you also need to consider the strength of the correlation before using your least squares regression line to make predictions. If there’s little or no correlation, then the equation is gonna give you very unreliable predictions or estimates. You’d also need to consider the amount of data that you used to build the model. The more data you have, generally, the more reliable and more realistic that model will be. And also remember, don’t extrapolate. Interpolation is quite good if the correlation’s quite good, then interpolated values will be quite good predictions. Extrapolated values, you really don’t know how reliable they’re going to be.

Okay here’s one for you to try. Find the least squares regression equation for the following data, and use it to estimate the value of 𝑦 when 𝑥 equals nine, then comment on your result. So we’ve got some data here for 𝑥 and 𝑦 when 𝑥 is one, 𝑦 is twelve when 𝑥 is two, 𝑦 is seven and so on. And we’ve given you the formulae down at the bottom there for you to use. So press pause and then come back when you’ve answered the question and I’ll go through the answers. Right, first we need to add two columns, the 𝑥 squared and the 𝑥𝑦s. Now we’re gonna fill those in; one squared is one, two squared is four, and so on. And now, the 𝑥𝑦s, one times twelve is twelve, two times seven are fourteen, and so on.

Now we’ll add a row at the bottom for all the totals. Now if I add up all the 𝑥-values, I get fifteen, adding up all the 𝑦-values gives me a total of thirty-seven, adding up all the 𝑥 squared values gives me a total of fifty-five, and adding up all the products of 𝑥 and 𝑦, I get a total of ninety-three. And because I’ve got five sets of data, 𝑛 is equal to five, so 𝑆𝑥𝑦 is the sum of the 𝑥𝑦s minus the sum of the 𝑥s minus the sum of the 𝑦s all over 𝑛. So that’s ninety-three minus fifteen times thirty-seven all over five, which is negative eighteen. And the 𝑆𝑥𝑥 value is the sum of the 𝑥 squared minus the sum of the 𝑥s all squared divided by 𝑛. Well on the sum of the 𝑥 squared is fifty-five, the sum of the 𝑥s is fifteen, and 𝑛 is five, so that becomes fifty-five minus fifteen squared over five, which is equal to ten. So working out the values of the parameters for our equation of our straight line, 𝑦 equals 𝑎 plus 𝑏𝑥. The 𝑏-value is 𝑆𝑥𝑦 divided by 𝑆𝑥𝑥. Well that was negative eighteen divided by ten, which is negative one point eight. And the mean 𝑦-value was just the sum of all the 𝑦s divided by how many there are, that’s thirty-seven divided by five, which is seven point four. And the mean 𝑥-value is the sum of all the 𝑥-values divided by how many there are. So that’s fifteen divided by five, which is three. So the 𝑎-value then is the mean 𝑦 minus 𝑏 times the mean 𝑥.

Now because the 𝑏-value is negative one point eight, we gotta be quite careful with our negative signs here. So that’s seven point four minus negative one point eight times three, which is equal to twelve point eight. So the equation of our least squares regression line, 𝑦 equals 𝑎 plus 𝑏𝑥. All we need to do then is substitute in our values for 𝑎 and 𝑏. So that’s the equation: 𝑦 equals twelve point eight minus one point eight 𝑥. And now we have to substitute in 𝑥 equals nine to make a prediction of the corresponding 𝑦-value. So 𝑦 would be equal to twelve point eight minus one point eight times nine, which would be negative three point four.

Now commenting on the result, there’re couple of things I wanna say; one is we’ve extrapolated. Look, the 𝑥-values that we’ve gathered in terms of our data were from one to five. Well we’ve used an 𝑥-value of nine, so we’ve extrapolated. So we don’t necessarily know how reliable that answer’s gonna be. And the other thing I would say is we don’t know how good the correlation was. We don’t know the Pearson’s correlation coefficient or any other correlation coefficient for that matter. So even if we had interpolated our value, we still wouldn’t really know how reliable that answer would be. But the main point to make is that it was an extrapolated value, so we do need to be cautious about it.

So in summary, we can work out the equation of our least squares regression line 𝑦 equals 𝑎 plus 𝑏𝑥 by using 𝑏 is equal to 𝑆𝑥𝑦 over 𝑆𝑥𝑥 and 𝑎 is equal to the mean 𝑦-value minus 𝑏 times the mean 𝑥-value. So 𝑆𝑥𝑦, remember, is the sum of the 𝑥 times 𝑦 answers minus the sum of the 𝑥s times the sum of the 𝑦s over how many pieces of data we’ve got. The 𝑆𝑥𝑥 value is the sum of the 𝑥 squared values minus the sum of the 𝑥 values all squared divided by the number pieces of data you’ve got. And you know how to work out the mean 𝑦-value and the mean 𝑥-value. You just add them up and divide by how many you’ve got. And finally beware of extrapolation.