In this explainer, we will learn how to find and use the least squares regression line equation.
The term “regression” was first used by Sir Francis Galton, an English Victorian era statistician, in reference to the heights of children and their parents. Tall parents tended to have children shorter than themselves and vice versa for short parents. He called this effect “regression towards mediocrity”; heights regressed to the mean. Since his findings, regression analysis has been used to identify and analyze relationships between variables. In particular, the method of least squares allows us to determine the line of best fit for a set of bivariate data.
Suppose we have collected measurements for two quantitative variables, and , to form a set of bivariate data. That is, we have of pairs of data, , . Suppose also that both a scatter plot and the correlation coefficient of our data indicate that the variables and are linearly related. That is, as one increases the other increases linearly, or decreases linearly, with the first.
Our next step in analyzing such data is to try and model this relationship with a line of best fit. This means that we seek the equation of the line that maps the path of the data passing as closely as possible to each of the data points. We might try and construct this line by eye; however, there is a technique that can allow us to find its exact equation.
Recall that, in general, the equation of a straight line is where is the -intercept and is the slope of the line. It is unlikely that a set of bivariate data will lie exactly on a straight line, so to find the equation of the line that fits our data most closely, we find the line whose overall average distance from all of our data points is minimized. This distance , for each point , is called the error or residual. It is the difference between the true value of for a data point and the predicted value , on the line, for the same -value.
The least squares regression line, , minimizes the sum of the squared differences of the points from the line, hence, the phrase “least squares.” We will not cover the derivation of the formulae for the line of best fit here. However, we will demonstrate how to use the formulae to find coefficients and of the line.
Definition: The Least Squares Regression Line
If is the line of least squares regression for a set of bivariate data with variables and , then where
We may also write the slope as , where is the correlation coefficient and and are the standard deviations of and , respectively, or, alternatively, by substituting the expressions for and into the formula for the slope as
We note further that and are often written as which are equivalent to the expressions given above. In this form, we can see that is the sum of the product of the differences between each and the mean of and each and the mean of , and is the sum of the squares of the differences between each and the mean of .
In practice, we calculate the slope, , first, since is needed to calculate . Let’s look at an example of how to calculate the least squares regression line from a table of bivariate data.
Example 1: Calculating the Least Squares Regression Line given a Table of Data Summarizing the Sums of the Observed Values
Use the information in the table to find the equation of the least squares regression line of on . Write the equation in the form , where and are accurate to three decimal places.
|9||41||29||1 189||1 681||841|
|10||42||27||1 134||1 764||729|
|Sum||310||225||7 219||10 128||5 193|
To find the least squares regression line , we must find the slope, , and the -intercept, . To do this, we use the formulae where is the mean of and is the mean of .
The number of data pairs in our data set is and in the final row of the table we are given the sums that we need. These are , , and . Since we will need the slope, , to calculate , let’s first use the given values to find :
To calculate the value of the -intercept, , we need the means of the - and the -values. These are
We can now use these values, in fraction form for accuracy, together with our slope , to find :
Hence, with the -term first, the least squares regression line is .
Within our calculations, we have used expressions such as , , and , which are known as summary statistics.
Definition: Summary Statistics
Summary statistics are statistics that we calculate from the observations in a sample data set, which summarize the data in a way that allows us to communicate, and hence interpret, as much information as possible.
In our next example, we will find the least squares regression line directly from the summary statistics.
Example 2: Calculating a Regression Coefficient for a Least Squares Regression Model from Summary Statistics
For a given data set, , , , , , and . Calculate the value of the regression coefficient in the least squares regression model . Give your answer correct to three decimal places.
To calculate the regression coefficient from the summary statistics given, we can use the formula
Substituting in our values for , , , and , we have
Hence, to three decimal places, the regression coefficient .
Our next example demonstrates how to find the equation of the least squares regression line for a given set of bivariate data.
Example 3: Calculating the Equation of the Least Squares Regression Line
The scatterplot shows a set of data for which a linear regression model appears appropriate.
The data used to produce this scatterplot is given in the table shown.
Calculate the equation of the least squares regression line of on , rounding the regression coefficients to the nearest thousandth.
The equation of the line of least squares regression is , where the slope or regression coefficient is
The -intercept is given by , with , the mean of , and , the mean of . We have eight paired data points ; hence, . To find the two coefficients, and , we begin by putting our data into a table with columns for the product and , since we will need their sums for our calculation. So, for example, in the third column, the first entry is , and so on for each pair .
Our next step is to sum each of the columns so that we have the sums in the final row.
We can now use these sums in the formula for to calculate the slope of the regression line:
Note that, from the scatter diagram, we see that as increases, the -values, in general, decrease and this is confirmed by the fact that the slope, , is negative. To find the value of the constant , we must first calculate the means and . Using the sums from our table, these are
Hence, keeping these values in exact fractional form for accuracy, together with our value for , which in exact form is equal to , our -intercept is
The equation of the least squares regression line of on for this data, to the nearest thousandth, is therefore .
In our next example, we will apply our knowledge of the calculation of the least squares regression line to a real-life situation. However, when considering real-life variables in the context of regression, if possible, we first establish which of our variables is the dependent variable and which is the independent variable. These are defined as follows.
Definition: Dependent and Independent Variables
Independent variables are variables that we may control or change and that we believe have a direct effect on a dependent variable. Independent variables are also sometimes called explanatory variables and are often labeled , or , for explanatory variables.
Dependent variables are variables that are being tested and are dependent on independent variables. Dependent variables are often called response variables as they respond to changes in explanatory variables and are often labeled .
Example 4: Finding the Equation of a Regression Line in a Regression Model
Using the information in the table, find the regression line . Round and to 3 decimal places.
|Cultivated Land in Feddan||126||13||104||180||38||161||14||99||55||177|
|Production of a Summer Crop in Kilograms||160||40||80||340||260||200||280||280||140||100|
We begin by determining which of our variables is the independent variable and which is the dependent variable. Since we would expect the amount of a summer crop produced to depend on the amount of land on which it is cultivated, it makes sense that the “production” variable is the dependent variable and the “land” variable is the independent variable .
To find the equation of the line of least squares regression, , we must find the slope or regression coefficient and the -intercept . We have ten pairs of data, that is, ten measurements of the independent variable “cultivated land in feddan” that are paired with ten measurements of the dependent variable “production of a summer crop in kilograms”, so . We can use the following formula to calculate :
We will therefore need to find the sums , , , and . Let us put our data into a table with columns for the product and for so that we may more easily calculate the required sums.
|Cultivated Land (Feddan)||Summer Crop (kg)|
|126||160||20 160||15 876|
|104||80||8 320||10 816|
|180||340||61 200||32 400|
|38||260||9 880||1 444|
|161||200||32 200||25 921|
|99||280||27 720||9 801|
|55||140||7 700||3 025|
|177||100||17 700||31 329|
The sums, which are in the final row, have been calculated for each column, and we may now use these in the formula to find :
The -intercept is given by , where is the mean of the -values and is the mean of the -values. These are
To calculate accurately to three decimal places, we need to substitute a suitably accurate value for . Here, we can substitute the exact fraction, or a decimal accurate to at least five decimal places. Therefore, calculating , we have
With the values of our regression coefficient and -intercept to three decimal places, the line of least squares regression is .
We may interpret this as follows: for every additional unit of cultivated land in feddan, we expect the production of the summer crop to increase by approximately 0.2 kg. We might also interpret the value of , since this is the -intercept. However, we need to be careful that our interpretation makes sense within the context of the data. In our case, with , we might conclude that, with no cultivated land, that is, , we could expect to produce 168.582 kg of the summer crop, which does not make physical sense. We might perhaps infer that we begin with 168.582 kg of the summer crop from other sources, but we do not know this from the data. This illustrates how care must be taken when considering how variables behave outside of the range of the given data.
Once we have a regression model, which in the case of linear data is the least squares line of regression, we may, with care, use our model to estimate values for the dependent variable. We see how this works in our next example.
Example 5: Calculating an Estimated Value for a Variable at a Given Point in a Regression Model
Using the information in the table, estimate the value of when . Give your answer to the nearest integer.
We are given a set of bivariate data where we have six pairs of values for each of the two variables and . To estimate a -value for a given -value, assuming the data is approximately linear, we must first find the equation of the regression line, . To do this, we first calculate the slope , using the formula below:
This requires the sums , , , and , and by putting our data into a table with columns for the product and for , we can easily calculate these as shown below.
Substituting the necessary sums into our formula for gives
The -intercept verifies the equation and we will use our value for to calculate this. However, first we must find the means of the - and the -values. These are
With these values, we then have
The regression line is, therefore, . Now, if we substitute , we find
Hence, to the nearest integer, when , we estimate .
In this example, we estimated a value of the dependent variable for a value of that was within our range of known values. This is called interpolation and the following definition clarifies this.
Definition: Interpolation and Extrapolation
Interpolation Estimating or predicting a value of the dependent variable from within the range of known values of the independent variable.
Extrapolation Estimating or predicting a value of the dependent variable from outside the range of known values of the independent variable.
Extrapolation should be used with the utmost caution, if at all. The behavior of the variables may change outside the known range of data leading to errors. Therefore, extrapolation should be avoided where possible.
We complete this explainer by recalling some of the key points covered.
- The least squares regression line is a linear model for bivariate data sets consisting of data points , where is the independent, or explanatory, variable and is the dependent, or response, variable.
- The least squares regression line is the line whose sum of squares of distances of the data points from that line is a minimum. The equation of the line is where with , the mean of , and , the mean of , and where
- The slope, , may also be written as , where is the correlation coefficient and and are the standard deviations of and .
- We may use the least squares regression line to estimate or predict values of the dependent variable using interpolation, that is, using -values within the known range. Extrapolation, that is, using values outside this range to estimate or predict is not advisable as results may be erroneous.