Video: Least Squares Regression Line | Nagwa Video: Least Squares Regression Line | Nagwa

# Video: Least Squares Regression Line

In this video, we will learn how to find and use the least squares regression line equation.

17:14

### Video Transcript

In this video, we’re gonna look at the least squares regression method of finding the equation of a line of best fit through points on a scatter plot. We’ll also talk a little bit about the theory behind the method and how to use and interpret the regression equation when you worked out what it is. Let’s look at an example of a situation where you might want to calculate the equation of the least squares regression line.

Some students did an experiment in which they hung objects of various masses from a spring. And they measured the length of the spring in each case. Find the equation of the least squares regression line. Well we can see, for example, when they hung a mass of ten kilograms from the spring, it had a length of twelve centimetres. When they hung twenty kilograms, it was sixteen centimetres and so on. So this is bivariate data. There are two variables: mass and length. And each pair of pieces of data relates to a specific event. So placing a mass of twenty kilograms on the spring extends it to a length of sixteen centimetres.

Now the first thing that we need to think about is which is the dependent variable. And which is the independent variable. Now the independent variable is the one that you’d normally control or change. And it causes changes in the value of the other variable, the dependent variable. So the dependent variable depends on the value of the other variable. Now we’re choosing which masses to put on the spring. And that’s causing the change in the length of the spring. So length is the dependent variable. And the mass is the independent variable. Now we tend to call the independent variable 𝑥 and the dependent variable 𝑦. So let’s add those letters to our table.

Now we can plot these points on a scatter plot. And it looks like we’ve got very strong positive correlation between the two variables. In fact, when you calculate the Pearson’s product moment correlation coefficient, it comes out to be 0.99 to two decimal places. So yep, indeed that is very strong positive correlation. So how do we go about drawing a suitable line of best fit? Well we could just try laying down our ruler in various positions until it looks like the points are generally as close as they can be to the line. Then we can read off a couple of pairs of coordinates from the line and work out its equation. However, luckily there’s a more methodical and consistent way of going about it. Calculating the equation of the least squares regression line. And the way this works is to specify a line which minimizes the sum of the squares of the residuals of each of the points.

So, for example, if we call that point one, this distance here, this vertical distance between the point and the line, is called the residual. So if the equation of our line of best fit was 𝑦 equals 𝑎 plus 𝑏𝑥, we could plug 𝑥-coordinates in there and make predictions about what the corresponding 𝑦-coordinates would be. So the residuals of each point are the difference between that prediction of our estimate and the actual values that we observed when we did the experiment. So for these points here, our estimates were under estimates of the actual values. And for these points here, our estimates were overestimates of the actual values that we observed. So we could think of some of the residuals as being positive and some of the residuals as being negative.

So if we tried out lots and lots of different lines of best fit, and for each one added up all the residuals. Then you might expect that lines with a sum of the residuals close to zero would be a better line of best fit than lines with larger sums of residuals. But the thing is some spectacularly bad lines of best fit might be drawn. So that all the positive residuals exactly balance all of the negative residuals and give us sum of zero. So to get around this potential problem, the least squares regression model takes the squares of all the residuals, so that the results are all positive. And then it finds the line that minimizes the sum of the squares of all these residuals. Hence the name least squares regression.

Now we’re not gonna go into detail here about how to derive the exact formula. But we are gonna talk about how to use it. The formula then for the least squares regression line is 𝑦 is equal to 𝑎 plus 𝑏𝑥, where 𝑏 is 𝑆𝑥𝑦 over 𝑆𝑥𝑥, where 𝑆𝑥𝑦 is the covariance of 𝑥 and 𝑦 divided by 𝑛. And 𝑆𝑥𝑥 is the variance of 𝑥 divided by 𝑛. Also the value of 𝑎 is equal to the mean 𝑦-value minus 𝑏 times the mean 𝑥-value. Now remember to work out the value of 𝑆𝑥𝑦, we have to multiply each pair of 𝑥- and 𝑦-coordinates together and then sum the result of all those added together. And then we need to take the sum of all the 𝑥s and times that by the sum of all the 𝑦s. And then take that result and divide it by the number of pieces of data that we’ve got. And to work out the value of 𝑆𝑥𝑥, we square each of the pieces of 𝑥-data. And then we add up all those squares. And then from that, we take away the sum of all the 𝑥-values squared and then divide it by the number of pieces of data that we’ve got.

Okay it all looks pretty horrible on paper at the moment. But when we go through the example, I’m sure you’ll find it much easier. Firstly, we take the table of values that we were given in the question. And then we need to extend it a bit. We need to create a couple of extra columns and an extra row at the bottom. So firstly, the columns of the 𝑥 squared values, so that’s each individual 𝑥-value squared, and then the 𝑥𝑦 values. So that’s for each data pair, we take the 𝑥-value and multiply it by the 𝑦-value. And then we create a row at the bottom for all of our totals. So first of all, let’s add up all the 𝑥s and then add up all the 𝑦s. Well ten plus twenty plus thirty plus forty plus fifty is a hundred and fifty. And if I add up all five 𝑦-values, I get an answer of 120. No, I’ve got five pairs of data. So that’s 𝑛 equals five. And squaring each 𝑥-value, 10 squared is 100. 20 squared is 400 and so on. Then adding them all up, I get 5500.

Now I’m gonna do 𝑥 times 𝑦. So 10 times 12 is 120. 20 times 16 times 320 and so on. And then if I add all of those up, I get 4230. Now I can calculate the individual values. So 𝑆𝑥𝑦, remember, was the sum of the 𝑥 times 𝑦s minus the sum of the 𝑥s minus the sum of the 𝑦s divided by the number of pieces of data. Well the sum of the 𝑥𝑦s is 4230. Sum of the 𝑥s is 150. The sum of the 𝑦s is 120. And I’ve got five pieces of data. So popping that into my calculator, I get 630. And 𝑆𝑥𝑥, I’ve gotta sum the 𝑥 squared column. And I’ve gotta sum the 𝑥 column, square that value, and divide by the number of pieces of data. Well the total of all the 𝑥-squares added together is 5500. And the sum of the 𝑥s is 150. So I’ve gotta square 150 and divide by the number of pieces of the data, five. And when I pop that into my calculator, I get 1000. So just making a note of those over on the left while I make some space to do some more calculations on the right. The 𝑏-value in the equation of my straight line is equal to 𝑆𝑥𝑦 over 𝑆𝑥𝑥. So that’s 630 divided by 1000, which is 0.63.

Now to calculate the 𝑎-value is a little bit more trickier. I need to work out the mean 𝑦-coordinate and the mean 𝑥-coordinate and then also take into account the answer I got for 𝑏 in that first part. So to work out the mean 𝑦-value, I just have to add up all the 𝑦s and divide by how many there are. So that’s 120 divided by five, which is 24. And same process again for the 𝑥s, just add up all the 𝑥-values and divide by how many there are. So that’s 150 divided by five, which is thirty. So the mean 𝑥-value is thirty. The mean 𝑦-value is 24. So 𝑎 then is the mean 𝑦-value minus 𝑏 times the mean 𝑥-value, which is 24 minus 0.63 times 30. And that gives us an answer of 5.1. So again just making a note of those values so I can carry on on the right hand side with more working out. The equation of our line of best fit is 𝑦 is equal to 𝑎 plus 𝑏𝑥. So our least squares regression line is 𝑦 is equal to 5.1 plus 0.63 times the 𝑥-coordinate.

Well, that’s great. So we’ve now got an equation that enables us to make predictions about the length of the spring given different masses that were hanging from it. So, for example, if we were gonna hang a mass of 37 kilograms from the spring, we just put an 𝑥-value of 37 into that equation. And we’d make our prediction that 𝑦 is equal to 28.41 centimetres. That’s how long we’d expect the spring to be with that mass hanging from it. Now because our scatter plot showed that we had very strong positive correlation, we’d expect that equation to make pretty reasonable estimates of 𝑦-values given certain 𝑥-values. Well, that is, we’d expect them to be good estimates, if we use 𝑥-values between about 10 and 50. In other words, if we use the equation to interpolate the 𝑦-values.

Now we gather data, 𝑥-data, in that range. We don’t know if that same equation will be true outside that range. For example, if we put a mass of 60 or 70 or 80 kilograms on the spring, it might snap all together. So our equation just simply wouldn’t work. So using the equation to make predictions of 𝑦-values based on 𝑥-values in the range that we’ve gathered data for is called interpolation. But extending beyond that range and making predictions with 𝑥-values less than 10 or greater than 50 is called extrapolation. And as we said, extrapolating is generally a bad idea. Cause we’re just not gonna be so confident that the rules still apply for those data values. And our equation may not hold.

Now we could use the equation to make prediction about the length of the spring without any weight standing on it at all. So we put an 𝑥-value of zero. And we’d have the equation 𝑦 equals 5.1 plus 0.63 times zero. So the length of the spring with no weights added to it would be 5.1 centimetres. Now just quickly, I’m gonna go back up here and rub this out and change it to a plus. Obviously, I made a bit of a mistake there. So apologies for that. So going back to our question here with a mass of zero kilograms, we get a length of spring of 5.1 centimetres. So this is telling us the starting conditions for a problem if you like. With no masses added, the spring will be 5.1 centimetres. Now I think you can spot the potential problem with this. Because we only gather data with 𝑥-values from 10 to 50 kilograms, the equation we’re extrapolating back to zero here. So that might not necessarily be true. It might be true. But we don’t have 100 percent confidence that the equation will still hold for those 𝑥-values.

Now we can also interpret the parameters in that regression equation. That coefficient of 𝑥, the multiple of 𝑥 there, 0.63, means that every time I add one more kilogram, so I increase 𝑥 by one. Then the spring stretches by 0.63 centimetres. And as we’ve just seen that number there, the 5.1 on its own, when I have an 𝑥-value of zero, then 𝑦 is equal to 5.1. So when no mass is added to the spring, its length would be 5.1 centimetres. Now this method of least squares regression analysis seems like magic. Simply process your data. And you get an easy-to-use equation to make predictions of one value from the other, brilliant!

But remember, you also need to consider the strength of the correlation before using your least squares regression line to make predictions. If there’s little or no correlation, then the equation is gonna give you very unreliable predictions or estimates. You’d also need to consider the amount of data that you used to build the model. The more data you have, generally, the more reliable and more realistic that model will be. And also remember, don’t extrapolate. Interpolation is quite good. If the correlation’s quite good, then interpolated values will be quite good predictions. Extrapolated values, you really don’t know how reliable they’re going to be.

Okay, here’s one for you to try.

Find the least squares regression equation for the following data. And use it to estimate the value of 𝑦 when 𝑥 equals nine, then comment on your result.

So we’ve got some data here for 𝑥 and 𝑦. When 𝑥 is one, 𝑦 is twelve. When 𝑥 is two, 𝑦 is seven and so on. And we’ve given you the formulae down at the bottom there for you to use. So press pause and then come back when you’ve answered the question. And I’ll go through the answers — Right, first we need to add two columns, the 𝑥 squareds and the 𝑥𝑦s. Now we’re gonna fill those in. One squared is one. Two squared is four and so on. And now, the 𝑥𝑦s, one times twelve is twelve. Two times seven are fourteen and so on. Now we’ll add a row at the bottom for all the totals. Now if I add up all the 𝑥-values, I get fifteen. Adding up all the 𝑦-values gives me a total of 37. Adding up all the 𝑥 squared values gives me a total of 55. And adding up all the products of 𝑥 and 𝑦, I get a total of 93. And because I’ve got five sets of data, 𝑛 is equal to five.

So 𝑆𝑥𝑦 is the sum of the 𝑥𝑦s minus the sum of the 𝑥s times the sum of the 𝑦s all over 𝑛. So that’s 93 minus 15 times 37 all over five, which is negative 18. And the 𝑆𝑥𝑥 value is the sum of the 𝑥 squareds minus the sum of the 𝑥s all squared divided by 𝑛. Well and the sum of the 𝑥 squareds is 55. The sum of the 𝑥s is 15. And 𝑛 is five. So that becomes 55 minus 15 squared over five, which is equal to 10. So working out the values of the parameters for our equation of our straight line, 𝑦 equals 𝑎 plus 𝑏𝑥. The 𝑏-value is 𝑆𝑥𝑦 divided by 𝑆𝑥𝑥. Well that was negative 18 divided by 10, which is negative 1.8. And the mean 𝑦-value was just the sum of all the 𝑦s divided by how many there are. That’s 37 divided by five, which is 7.4. And the mean 𝑥-value is the sum of all the 𝑥-values divided by how many there are. So that’s 15 divided by five, which is three. So the 𝑎-value then is the mean 𝑦 minus 𝑏 times the mean 𝑥. Now because the 𝑏-value is negative 1.8, we gotta be quite careful with our negative signs here. So that’s 7.4 minus negative 1.8 times three, which is equal to 12.8.

So the equation of our least squares regression line, 𝑦 equals 𝑎 plus 𝑏𝑥. All we need to do then is substitute in our values for 𝑎 and 𝑏. So that’s the equation. 𝑦 equals 12.8 minus 1.8𝑥. And now we have to substitute in 𝑥 equals nine to make a prediction of the corresponding 𝑦-value. So 𝑦 would be equal to 12.8 minus 1.8 times nine, which would be negative 3.4. Now commenting on the result, there’re couple of things I wanna say. One is we’ve extrapolated. Look, the 𝑥-values that we gathered in terms of our data were from one to five. Well we’ve used an 𝑥-value of nine. So we’ve extrapolated. So we don’t necessarily know how reliable that answer’s gonna be. And the other thing I would say is we don’t know how good the correlation was. We don’t know the Pearson’s correlation coefficient or any other correlation coefficient for that matter. So even if we had interpolated our value, we still wouldn’t really know how reliable that answer would be. But the main point to make is that it was an extrapolated value. So we do need to be cautious about it.

So in summary, we can work out the equation of our least squares regression line 𝑦 equals 𝑎 plus 𝑏𝑥 by using 𝑏 is equal to 𝑆𝑥𝑦 over 𝑆𝑥𝑥. And 𝑎 is equal to the mean 𝑦-value minus 𝑏 times the mean 𝑥-value. So 𝑆𝑥𝑦, remember, is the sum of the 𝑥 times 𝑦 answers minus the sum of the 𝑥s times the sum of the 𝑦s over how many pieces of data we’ve got. The 𝑆𝑥𝑥 value is the sum of the 𝑥 squared values minus the sum of the 𝑥-values all squared divided by the number pieces of data you’ve got. And you know how to work out the mean 𝑦-value and the mean 𝑥-value. You just add them up and divide by how many you’ve got. And finally beware of extrapolation.

## Join Nagwa Classes

Attend live sessions on Nagwa Classes to boost your learning with guidance and advice from an expert teacher!

• Interactive Sessions
• Chat & Messaging
• Realistic Exam Questions