Lesson Video: Least Squares Regression Line Mathematics

Start Practising

In this video, we will learn how to find and use the least squares regression line equation.

21:37

Video Transcript

In this video, we’ll learn how to find and use the least squares regression line equation. The term regression was first used by an English statistician, Sir Francis Galton, in the Victorian era when looking at the relationship between the heights of parents and their children. He found that the children of taller parents grew up to be slightly shorter than their parents, whereas the children of smaller parents grew up to be taller than their parents. He called this effect regression towards mediocrity; that is, the heights tended, or regressed, towards the mean.

Now we use regression analysis to identify and analyze relationships between variables. The method of least squares regression allows us to determine the line of best fit for a set of bivariate data. And in this video, we’ll learn how to find the line of least squares regression using formulae for the coefficients in the equation of the line. We recall that bivariate data is data collected on two quantitative, that is, numerical, variables, where the observations are paired for each subject. Say, for example, 𝑥 is equal to height and 𝑦 is equal to weight. If we have 𝑛 people in our sample, then our data set will consist of 𝑛 pairs of measurements, each pair relating to one person. So, for example, 𝑥 one would be the height of person number one, and 𝑦 one would be the weight of person number one.

Now suppose both a scatter plot and a correlation coefficient indicate that variables’ height and weight are linearly related. That is, as one variable increases, the other increases linearly or decreases linearly. Our next step will be to try and model this relationship with the line that fits our data best. That is, we want to find the straight line, 𝑦 is equal to 𝑎 plus 𝑏𝑥, whose distance from each of our data points is minimized. The vertical distance between the point 𝑥𝑖, 𝑦𝑖 and the line is 𝑦𝑖 minus 𝑦 hat, where 𝑦 hat is the 𝑦-value on the line associated with that’s directly above 𝑥𝑖. This distance for each point is called the residual or the error. The least squares regression line, which we’d often see with a hat above the 𝑦, minimizes the sum of the errors squared, hence the phrase least squares.

So how do we find the least squares regression line? If 𝑦 hat is equal to 𝑎 plus 𝑏𝑥 is the least squares regression line for a set of bivariate data with variables 𝑥 and 𝑦, then the slope 𝑏 is given by 𝑆 𝑥𝑦 over 𝑆 subscript 𝑥𝑥, where 𝑆 subscript 𝑥𝑦 is the sum of the products 𝑥𝑦 minus the sum of the 𝑥’s multiplied by the sum of the 𝑦’s divided by 𝑛. And 𝑆 subscript 𝑥𝑥 is the sum of the squares of the 𝑥’s minus the sum of the 𝑥’s all squared divided by 𝑛. And 𝑛 is the number of data pairs. So given a set of bivariate data, this is how we find the slope of our line, 𝑏. The 𝑦-intercept 𝑎 of our line is given by 𝑦 bar, which is the mean of the 𝑦’s, minus 𝑏 times the mean of the 𝑥’s, where we recall that the mean of 𝑦 is the sum of the 𝑦-values divided by 𝑛 and similarly for 𝑥.

We can see from our formulae that in order to find the 𝑦-intercept 𝑎, we first need to find the slope 𝑏 and the mean of the 𝑥-values and the mean of the 𝑦-values. You may see some of these expressions written in slightly different but equivalent forms. So let’s just make a note of some of these. We can also write the slope 𝑏 as 𝑟 multiplied by 𝑆 subscript 𝑦 over 𝑆 subscript 𝑥, where 𝑟 is Pearson’s correlation coefficient, 𝑆 subscript 𝑦 is the standard deviation of 𝑦, and 𝑆 subscript 𝑥 is the standard deviation of 𝑥. If we substitute our expressions 𝑆 subscript 𝑥𝑦 and 𝑆 subscript 𝑥𝑥 into our formula for 𝑏, we find the expression for 𝑏 as shown. And in fact, we’ll use this in our examples. You may also see 𝑆 subscript 𝑥𝑦 and 𝑆 subscript 𝑥𝑥 written as shown.

So now that we have the formulae for our coefficients 𝑎 and 𝑏, let’s look at an example where we see how to find the slope 𝑏 of the regression line from summary statistics.

For a given data set, the sum of the 𝑥-values is 47, the sum of the 𝑦-values is 45.75, the sum of the squares of the 𝑥’s is 329, the sum of the squares of the 𝑦’s is 389.3125, the sum of the products 𝑥𝑦 is 310.25, and 𝑛 is equal to eight. Calculate the value of the regression coefficient 𝑏 in the least squares regression model 𝑦 is equal to 𝑎 plus 𝑏𝑥. Give your answer correct to three decimal places.

We’re given the summary statistics for a data set. We have the sum of the 𝑥-values, the sum of the 𝑦-values, the sum of the squares of the 𝑥’s, the sum of the squares of the 𝑦’s, and the sum of the product 𝑥𝑦. And we know that our data set consists of 𝑛 is equal to eight bivariate data pairs. We’re asked to find the coefficient 𝑏, that is, the slope of the line, in the least squares regression model 𝑦 is equal to 𝑎 plus 𝑏𝑥. We use the formula shown to calculate 𝑏, and we begin by writing out our summary statistics.

Since we’re given 𝑛, the sum of the product 𝑥𝑦, the sum of the 𝑥’s, the sum of the 𝑦’s, and the sum of the squares of the 𝑥’s, the only thing left to find to be able to use the formula is the sum of the 𝑥’s all squared. And since the sum of the 𝑥’s is 47, the sum of the 𝑥’s all squared is 47 squared; that is 2209. Substituting our values into the formula then, we have eight, which is 𝑛, multiplied by 310.25, that’s the sum of the product, minus the sum of the 𝑥’s, which is 47, multiplied by the sum of the 𝑦’s, which is 45.75, all over eight multiplied by 329 minus 2209. Evaluating our products, this gives us 2482 minus 2150.25 divided by 2632 minus 2209. Our numerator evaluates to 331.75 and our denominator to 423, which evaluates to approximately 0.78428. That’s to five decimal places.

Hence, to three decimal places, the regression coefficient 𝑏 is equal to 0.784.

And although we’re not actually asked to find the equation of the line and the 𝑦-intercept 𝑎, which is given by 𝑦 bar minus 𝑏 times 𝑥 bar, where 𝑦 bar and 𝑥 bar are the mean of 𝑦 and 𝑥, respectively, we can calculate 𝑎 quite quickly and therefore the equation of the line. The mean of the 𝑦-values is the sum of the 𝑦-values divided by 𝑛; that’s 45.75 divided by eight. That is 5.71875. Similarly, the mean of the 𝑥’s is the sum of the 𝑥’s over 𝑛. That’s 47 over eight, and that’s 5.875. Substituting these values into our formula for 𝑎 then, we have 𝑎 is equal to 5.71875, that’s the mean of the 𝑦’s, minus 0.78428, which is 𝑏 to five decimal places for accuracy, multiplied by 5.875, which is the mean of the 𝑥’s. And so the 𝑦-intercept 𝑎 is equal to 1.111 to three decimal places.

And so the equation of the line of least squares regression for the data set is 𝑦 is equal to 1.111 plus 0.784𝑥, where we’ve calculated our coefficients to three decimal places.

In this example, we were given the summary statistics for a data set. And in our next example, we’ll see how to find the line of least squares regression from the data itself.

The scatter plot shows a set of data for which the linear regression model appears appropriate. The data used to produce this scatter plot is given in the table shown. Calculate the equation of the least squares regression line of 𝑦 on 𝑥, rounding the regression coefficients to the nearest thousandth.

The equation of the line of least squares regression is 𝑦 hat is equal to 𝑎 plus 𝑏𝑥, where 𝑦 hat is the predicted value of 𝑦 for each value of 𝑥, 𝑎 is the 𝑦-intercept, and 𝑏 is the slope of the line. To find the equation of the line, we first find the slope 𝑏, which is given in the formula shown. We then use this value for 𝑏 to find the 𝑦-intercept 𝑎, which is given by the mean of 𝑦 minus 𝑏 multiplied by the mean of 𝑥, where we recall that the mean of the 𝑦-values is given by the sum of the 𝑦-values divided by the number of data pairs 𝑛 and similarly for the mean of 𝑥. In fact, in our case, we have eight data pairs so that 𝑛 is equal to eight. So let’s make a note of this.

Now, to find the coefficients 𝑎 and 𝑏, we’re going to need the various sums shown in the formulae. And in order to calculate these sums, we begin by expanding our table to include a row for the product 𝑥𝑦 and another for the squares of the 𝑥-values. In the first cell of our new row for the product 𝑥𝑦, we have the product of the first 𝑥-value 0.5 with the first 𝑦-value 9.25, and that’s 4.625. So we put this in the first cell of our new row. Our second new entry will be the second 𝑥-value, that’s one, multiplied by the second 𝑦-value 7.6. That is 7.6. And this goes into the second cell for the new row of products. And we can fill in the remaining products 𝑥𝑦, as shown.

The first entry in our second new row is the first 𝑥-value squared. That is 0.5 squared, which is 0.25. And so this goes in the first cell of our second new row. Our second 𝑥-value squared is one squared, which is one. And we can fill in the remainder of the 𝑥 squared values in our second new row as shown. Now remember, we’re trying to find this sum, so our next step is to sum each of the rows. And if we introduce a new column for our sums, then, for example, the sum of our 𝑥-values is 18. And this is the first entry in our new column. Summing our 𝑦-values gives us 45.1. The sum of the products is 78.05. And the sum of the squares of the 𝑥’s is 51.

So now we can use these values to calculate the slope 𝑏 of our line. Into our formula then, we have eight, which is 𝑛, multiplied by 78.05, the sum of our products, minus 18, which is the sum of the 𝑥’s, multiplied by 45.1, the sum of the 𝑦’s, over 𝑛, eight, times 51, which is the sum of the squares of the 𝑥’s, minus 18 squared, which is the sum of the 𝑥’s all squared. Evaluating our products gives us 624.4 minus 811.8 all divided by 408 minus 324. And typing this carefully into our calculator, we find 𝑏 is approximately equal to negative 2.23095. To three decimal places, that is, to the nearest thousandth, that’s negative 2.231.

We can see from the data points on our scatter plot that as the 𝑥-values of the data points increase, the 𝑦-values of the data points decrease. And this is confirmed by the fact that our coefficient 𝑏 is negative, that is, negative 2.231. And now making some space so that we can calculate our 𝑦-intercept 𝑎, we see from our formula that we’re first going to need to calculate the mean of the 𝑥-values and the mean of the 𝑦-values. The mean of the 𝑦-values is 45.1 divided by eight. That is 5.6375. The mean of the 𝑥-values is 18 divided by eight, which is 2.25.

And so, making some space again, we can use these to calculate our coefficient 𝑎. And we have 𝑎 is equal to 5.6375 minus negative 2.23095, which is 𝑏 to five decimal places for accuracy, multiplied by 2.25. This evaluates to approximately 10.65714, which is 10.657 to three decimal places, that is, to the nearest thousandth. The equation of the least squares regression line of 𝑦 on 𝑥 for this data then is 𝑦 hat is equal to 10.657 minus 2.231𝑥 all to the nearest thousandth. Note that we write 𝑦 with a hat on to indicate that this is a predicted value for 𝑦 from the line calculated with the given data. You’ll often see this written simply as 𝑦 is equal to 𝑎 plus 𝑏𝑥.

Now, so far, we’ve not been given any definition of what the variables 𝑥 and 𝑦 refer to. But when considering real-life variables in the context of regression, if possible, we first establish which of our variables is the dependent and which is the independent variable. Recall that independent variables are variables we may control or change. We believe they have a direct effect on a dependent variable. Another name for independent variables is explanatory variables, and they’re often labeled 𝑥. Dependent variables, on the other hand, are variables that are being tested and are dependent on one or more independent variables. Since they respond to changes in the independent variable or variables, they’re often called response variables, and they’re often labeled 𝑦.

In our next example, we’ll calculate the coefficients for the least squares regression line for real-life data. And so we’ll need to begin by determining which is the dependent and which is the independent variable.

Using the information in the table, find the regression line 𝑦 hat is equal to 𝑎 plus 𝑏𝑥. Round 𝑎 and 𝑏 to three decimal places.

Since we want to find the regression line, we begin by determining which of our variables is the dependent and which is the independent variable. We might expect that the amount of summer crop produced in kilograms is dependent on the amount of land it’s produced on. And so we specify the production in kilograms is the dependent variable 𝑦, whereas cultivated land measured in feddan is the independent variable 𝑥. And note that a feddan is a unit of area measuring just over one acre.

To find the regression line, we must find the slope 𝑏 and the 𝑦-intercept 𝑎. And to find these values, we use the two formulae shown. We first calculate the slope 𝑏 since we’ll need this to calculate the 𝑦-intercept 𝑎. And we see from our formula for 𝑏 that we’re going to need to find various sums, that is, the sum of the products 𝑥𝑦, the sum of the 𝑥-values, the sum of the 𝑦-values, the sum of the squared 𝑥-values, and we’ll also need the sum of the 𝑥’s all squared. And to find the value for 𝑎, we’re going to need the mean of the 𝑦-values, that is, the sum of the 𝑦-values divided by 𝑛, which is the number of data pairs, and similarly for the mean of the 𝑥-values.

In our data set, we have 10 pairs of data so that 𝑛 is equal to 10. And we make a note of this before we start making our calculations. Our next step is to find the sums. And to find the sum of our products 𝑥𝑦 and our 𝑥 squared values, we introduce two new rows to our table. To calculate the products 𝑥𝑦, taking our first 𝑥 and our first 𝑦, we have 126 multiplied by 160. That is 20160. And this goes into the first cell of our first new row. Our second product is our second 𝑥-value multiplied by our second 𝑦-value. That is 13 multiplied by 40, which is 520. And this goes into our second cell in the first new row. We can then complete this row with the products as shown.

The first element in our second new row is the first 𝑥-value squared, that is, 126 squared, which is 15876. And this goes into our second new row. Our second 𝑥-value squared is 13 squared, which is 169. And this goes into the second cell of our second new row. And we continue in this way to complete the row. Our next step is to find the sum for each of the rows. So we introduce a new column. The sum of the 𝑥-values is 967. The sum of the 𝑦-values is 1880. The sum of the products 𝑥𝑦 is 189320. And the sum of the squares of the 𝑥’s is 130977. So now with all our sums, we’re in a position to calculate 𝑏.

Substituting our sums into the formula for 𝑏 with 𝑛 is equal to 10, we have 10 times 189320, that’s the sum of the products 𝑥𝑦, minus 967, which is the sum of the 𝑥’s, multiplied by 1880, which is the sum of the 𝑦’s, all divided by 10, which is 𝑛, multiplied by the sum of the squared 𝑥-values, which is 130977, minus 967 squared. That’s the sum of the 𝑥’s all squared. And evaluating our numerator and denominator, we have 75240 divided by 374681. And this evaluates to approximately 0.20081. To three decimal places then, we have 𝑏 is equal to 0.201.

Now to find the 𝑦-intercept 𝑎, we need to find the means of the 𝑦’s and the 𝑥-values. The mean of the 𝑦’s is the sum of all the 𝑦-values divided by 𝑛. That’s 1880 divided by 10, and that’s 188. Similarly, the mean of the 𝑥-values is the sum of the 𝑥’s divided by 𝑛. And that’s 967 divided by 10, which is 96.7. So now we can use these values together with our slope 𝑏, where we’ll use the value of 𝑏 to five decimal places for accuracy, to calculate the 𝑦-intercept 𝑎. Evaluating this gives us 𝑎 is equal to 168.58167 and so on. That is 168.582 to three decimal places. The line of least squares regression then for this data to three decimal places is 𝑦 hat is equal to 168.582 plus 0.201𝑥.

We can interpret this as for every additional unit of land, we expect the production of the summer crop to increase by approximately 0.2 kilograms.

Once we have our line of regression, we can use this to estimate values for the dependent variable for particular values of the independent variable 𝑥. However, if we do this, we must be very careful to restrict ourselves to 𝑥-values within the range of the given data. Let’s consider how this might work using the variables given in this example. Our dependent variable 𝑦 is crop production in kilograms, and our independent variable 𝑥 is cultivated land measured in feddan. Our line of least squares regression, which we’ve just calculated to three decimal places from the given data, is 𝑦 is equal to 168.582 plus 0.201𝑥.

Now suppose we want to know how many kilograms of summer crop we would expect from 100 feddan of cultivated land. Substituting 𝑥 is equal to 100 in our equation evaluates to 188.682 kilograms. That’s to three decimal places. Now it’s fine to use this value of 𝑥 since it lies within the range of 𝑥 for the data, that is, between 13 and 180. So we can use 𝑥 is 100 in the equation of the line to estimate the value for 𝑦, crop production.

Now let’s look at an example of what might happen if we try and predict using an 𝑥-value outside the range of the data. Suppose we let 𝑥 equal zero. This means we’re going to interpret the 𝑦-intercept. If we let 𝑥 equal zero in our equation, we find 𝑦 hat is equal to 168.582. But this tells us that with zero units of cultivated land, crop production is estimated at approximately 169 kilograms, which is absurd since if we have no land, we can’t produce any crops. This is an example of extrapolation where we try and predict outside the range of the given data. Interpolation, on the other hand, is when we try and predict or estimate within the range of the data. This example illustrates that extrapolation should only be used with the utmost caution.

Let’s complete this video by recalling some of the key points we’ve covered. The least squares regression line 𝑦 hat is equal to 𝑎 plus 𝑏𝑥 is a linear model for bivariate data. The coefficient 𝑏, which is the slope of the line, and 𝑎, which is the 𝑦-intercept, can be calculated using the formulae shown, where 𝑦 bar is the mean of the 𝑦-values and 𝑥 bar is the mean of the 𝑥-values and 𝑛 is the number of data pairs. We may use the regression model to estimate using 𝑥-values within the range of the given data. And that’s called interpolation. However, using 𝑥-values outside the range of the known data to estimate or predict, that’s extrapolation, is not advisable.

Lesson Video: Least Squares Regression Line Mathematics

Video Transcript

Join Nagwa Classes