Lesson Explainer: Least Squares Regression Line Mathematics

In this explainer, we will learn how to find and use the least squares regression line equation.

The term β€œregression” was first used by Sir Francis Galton, an English Victorian era statistician, in reference to the heights of children and their parents. Tall parents tended to have children shorter than themselves and vice versa for short parents. He called this effect β€œregression towards mediocrity”; heights regressed to the mean. Since his findings, regression analysis has been used to identify and analyze relationships between variables. In particular, the method of least squares allows us to determine the line of best fit for a set of bivariate data.

Suppose we have collected 𝑛 measurements for two quantitative variables, 𝑋 and π‘Œ, to form a set of bivariate data. That is, we have 𝑛 of pairs of data, (π‘₯,𝑦), 𝑖=1,…,𝑛. Suppose also that both a scatter plot and the correlation coefficient of our data indicate that the variables 𝑋 and π‘Œ are linearly related. That is, as one increases the other increases linearly, or decreases linearly, with the first.

Our next step in analyzing such data is to try and model this relationship with a line of best fit. This means that we seek the equation of the line that maps the path of the data passing as closely as possible to each of the data points. We might try and construct this line by eye; however, there is a technique that can allow us to find its exact equation.

Recall that, in general, the equation of a straight line is 𝑦=π‘Ž+𝑏π‘₯, where π‘Ž is the 𝑦-intercept and 𝑏 is the slope of the line. It is unlikely that a set of bivariate data will lie exactly on a straight line, so to find the equation of the line that fits our data most closely, we find the line whose overall average distance from all of our data points is minimized. This distance π‘¦βˆ’Μ‚π‘¦, for each point (π‘₯,𝑦), is called the error or residual. It is the difference between the true value of 𝑦 for a data point and the predicted value ̂𝑦, on the line, for the same π‘₯-value.

The least squares regression line, ̂𝑦=π‘Ž+𝑏π‘₯, minimizes the sum of the squared differences of the points from the line, hence, the phrase β€œleast squares.” We will not cover the derivation of the formulae for the line of best fit here. However, we will demonstrate how to use the formulae to find coefficients π‘Ž and 𝑏 of the line.

Definition: The Least Squares Regression Line

If ̂𝑦=π‘Ž+𝑏π‘₯ is the line of least squares regression for a set of bivariate data with variables 𝑋 and π‘Œ, then theslopeand𝑏=π‘†π‘†π‘Ž=π‘¦βˆ’π‘π‘₯,ο—ο˜ο—ο— where 𝑆=ο„šπ‘₯π‘¦βˆ’βˆ‘π‘₯βˆ‘π‘¦π‘›,𝑆=ο„šπ‘₯βˆ’ο€Ήβˆ‘π‘₯𝑛,π‘₯=βˆ‘π‘₯𝑛(π‘₯),𝑦=βˆ‘π‘¦π‘›(𝑦).ο—ο˜ο—ο—οŠ¨οŠ¨meanofmeanof

We may also write the slope 𝑏 as 𝑏=π‘Ÿπ‘ π‘ ο˜ο—, where π‘Ÿ is the correlation coefficient and π‘ ο˜ and 𝑠 are the standard deviations of π‘₯ and 𝑦, respectively, or, alternatively, by substituting the expressions for π‘†ο—ο˜ and 𝑆 into the formula for the slope as 𝑏=π‘›βˆ‘π‘₯π‘¦βˆ’βˆ‘π‘₯βˆ‘π‘¦π‘›βˆ‘π‘₯βˆ’ο€Ήβˆ‘π‘₯.

We note further that π‘†ο—ο˜ and 𝑆 are often written as 𝑆=ο„šο€Ήπ‘₯βˆ’π‘₯(π‘¦βˆ’π‘¦),𝑆=ο„šο€Ήπ‘₯βˆ’π‘₯,ο—ο˜ο—ο—οŠ¨ which are equivalent to the expressions given above. In this form, we can see that π‘†ο—ο˜ is the sum of the product of the differences between each π‘₯ and the mean of π‘₯ and each 𝑦 and the mean of 𝑦, and 𝑆 is the sum of the squares of the differences between each π‘₯ and the mean of π‘₯.

In practice, we calculate the slope, 𝑏, first, since 𝑏 is needed to calculate π‘Ž. Let’s look at an example of how to calculate the least squares regression line from a table of bivariate data.

Example 1: Calculating the Least Squares Regression Line given a Table of Data Summarizing the Sums of the Observed Values

Use the information in the table to find the equation of the least squares regression line of 𝑦 on π‘₯. Write the equation in the form 𝑦=π‘Žπ‘₯+𝑏, where π‘Ž and 𝑏 are accurate to three decimal places.

π‘₯𝑦π‘₯𝑦π‘₯οŠ¨π‘¦οŠ¨
12218396484324
22219418484361
32320460529400
42618468676324
53123713961529
632247681β€Žβ€‰β€Ž024576
734227481β€Žβ€‰β€Ž156484
837259251β€Žβ€‰β€Ž369625
941291β€Žβ€‰β€Ž1891β€Žβ€‰β€Ž681841
1042271β€Žβ€‰β€Ž1341β€Žβ€‰β€Ž764729
Sum3102257β€Žβ€‰β€Ž21910β€Žβ€‰β€Ž1285β€Žβ€‰β€Ž193

Answer

To find the least squares regression line 𝑦=π‘Ž+𝑏π‘₯, we must find the slope, 𝑏, and the 𝑦-intercept, π‘Ž. To do this, we use the formulae 𝑏=𝑆𝑆=π‘›βˆ‘π‘₯π‘¦βˆ’βˆ‘π‘₯βˆ‘π‘¦π‘›βˆ‘π‘₯βˆ’ο€Ήβˆ‘π‘₯ο…π‘Ž=π‘¦βˆ’π‘π‘₯,ο—ο˜ο—ο—οŠ¨οŠ¨and where π‘₯=βˆ‘π‘₯𝑛 is the mean of π‘₯ and 𝑦=βˆ‘π‘¦π‘› is the mean of 𝑦.

The number of data pairs in our data set is 𝑛=10 and in the final row of the table we are given the sums that we need. These are ο„šπ‘₯𝑦=7219, ο„šπ‘₯=310, ο„šπ‘¦=225 and ο„šπ‘₯=10128. Since we will need the slope, 𝑏, to calculate π‘Ž, let’s first use the given values to find 𝑏: 𝑏=π‘›βˆ‘π‘₯π‘¦βˆ’βˆ‘π‘₯βˆ‘π‘¦π‘›βˆ‘π‘₯βˆ’ο€Ήβˆ‘π‘₯=10Γ—7219βˆ’310Γ—22510Γ—10128βˆ’(310)=72190βˆ’69750101280βˆ’96100=24405180=122259=0.4713.tod.p

To calculate the value of the 𝑦-intercept, π‘Ž, we need the means of the π‘₯- and the 𝑦-values. These are π‘₯=βˆ‘π‘₯𝑛=31010=31,𝑦=βˆ‘π‘¦π‘›=22510=22.5.

We can now use these values, in fraction form for accuracy, together with our slope 𝑏=122259, to find π‘Ž: π‘Ž=π‘¦βˆ’π‘π‘₯=22510βˆ’122259Γ—31010=7.898.tothreedecimalplaces

Hence, with the π‘₯-term first, the least squares regression line is 𝑦=0.471π‘₯+7.898.

Within our calculations, we have used expressions such as π‘₯, 𝑦, and ο„šπ‘₯𝑦, which are known as summary statistics.

Definition: Summary Statistics

Summary statistics are statistics that we calculate from the observations in a sample data set, which summarize the data in a way that allows us to communicate, and hence interpret, as much information as possible.

In our next example, we will find the least squares regression line directly from the summary statistics.

Example 2: Calculating a Regression Coefficient for a Least Squares Regression Model from Summary Statistics

For a given data set, ο„šπ‘₯=47, ο„šπ‘¦=45.75, ο„šπ‘₯=329, ο„šπ‘¦=389.3125, ο„šπ‘₯𝑦=310.25, and 𝑛=8. Calculate the value of the regression coefficient 𝑏 in the least squares regression model 𝑦=π‘Ž+𝑏π‘₯. Give your answer correct to three decimal places.

Answer

To calculate the regression coefficient 𝑏 from the summary statistics given, we can use the formula 𝑏=𝑆𝑆=π‘›βˆ‘π‘₯π‘¦βˆ’βˆ‘π‘₯βˆ‘π‘¦π‘›βˆ‘π‘₯βˆ’ο€Ήβˆ‘π‘₯.ο—ο˜ο—ο—οŠ¨οŠ¨

Substituting in our values for ο„šπ‘₯, ο„šπ‘¦, ο„šπ‘₯, ο„šπ‘₯𝑦=310.25 and 𝑛=8, we have 𝑏=8Γ—310.25βˆ’47Γ—45.758Γ—329βˆ’(47)=2482βˆ’2150.252632βˆ’2209=331.75423β‰ˆ0.78428.

Hence, to three decimal places, the regression coefficient 𝑏=0.784.

Our next example demonstrates how to find the equation of the least squares regression line for a given set of bivariate data.

Example 3: Calculating the Equation of the Least Squares Regression Line

The scatterplot shows a set of data for which a linear regression model appears appropriate.

The data used to produce this scatterplot is given in the table shown.

π‘₯0.511.522.533.54
𝑦9.257.68.256.55.454.51.751.8

Calculate the equation of the least squares regression line of 𝑦 on π‘₯, rounding the regression coefficients to the nearest thousandth.

Answer

The equation of the line of least squares regression is ̂𝑦=π‘Ž+𝑏π‘₯, where the slope or regression coefficient is 𝑏=𝑆𝑆=π‘›βˆ‘π‘₯π‘¦βˆ’βˆ‘π‘₯βˆ‘π‘¦π‘›βˆ‘π‘₯βˆ’ο€Ήβˆ‘π‘₯.ο—ο˜ο—ο—οŠ¨οŠ¨

The 𝑦-intercept is given by π‘Ž=π‘¦βˆ’π‘π‘₯, with π‘₯=βˆ‘π‘₯𝑛, the mean of π‘₯, and 𝑦=βˆ‘π‘¦π‘›, the mean of 𝑦. We have eight paired data points (π‘₯,𝑦); hence, 𝑛=8. To find the two coefficients, π‘Ž and 𝑏, we begin by putting our data into a table with columns for the product π‘₯𝑦 and π‘₯, since we will need their sums for our calculation. So, for example, in the third column, the first entry is 0.5Γ—9.25=4.625, and so on for each pair (π‘₯,𝑦).

π‘₯𝑦π‘₯𝑦π‘₯
0.59.254.6250.25
17.67.6001.00
1.58.2512.3752.25
26.513.0004.00
2.55.4513.6256.25
34.513.5009.00
3.51.756.12512.25
41.87.20016.00
Sum

Our next step is to sum each of the columns so that we have the sums in the final row.

π‘₯𝑦π‘₯𝑦π‘₯
0.59.254.6250.25
17.67.6001.00
1.58.2512.3752.25
26.513.0004.00
2.55.4513.6256.25
34.513.5009.00
3.51.756.12512.25
41.87.20016.00
Sumο„šπ‘₯=18.00ο„šπ‘¦=45.10ο„šπ‘₯𝑦=78.05ο„šπ‘₯=51.00

We can now use these sums in the formula for 𝑏 to calculate the slope of the regression line: 𝑏=8Γ—78.05βˆ’18.0Γ—45.18Γ—51.0βˆ’(18.0)=624.4βˆ’811.8408βˆ’324=βˆ’187.484=βˆ’937420=βˆ’2.231.tothenearestthousandth

Note that, from the scatter diagram, we see that as π‘₯ increases, the 𝑦-values, in general, decrease and this is confirmed by the fact that the slope, 𝑏, is negative. To find the value of the constant π‘Ž=π‘¦βˆ’π‘π‘₯, we must first calculate the means π‘₯ and 𝑦. Using the sums from our table, these are π‘₯=βˆ‘π‘₯𝑛=188=2.25,𝑦=βˆ‘π‘¦π‘›=45.18=45180=5.6375.

Hence, keeping these values in exact fractional form for accuracy, together with our value for 𝑏, which in exact form is equal to βˆ’937420, our 𝑦-intercept is π‘Ž=45180βˆ’ο€Όβˆ’937420οˆβ‹…188=10.657.tothenearestthousandth

The equation of the least squares regression line of 𝑦 on π‘₯ for this data, to the nearest thousandth, is therefore ̂𝑦=10.657βˆ’2.231π‘₯.

In our next example, we will apply our knowledge of the calculation of the least squares regression line to a real-life situation. However, when considering real-life variables in the context of regression, if possible, we first establish which of our variables is the dependent variable and which is the independent variable. These are defined as follows.

Definition: Dependent and Independent Variables

Independent variables are variables that we may control or change and that we believe have a direct effect on a dependent variable. Independent variables are also sometimes called explanatory variables and are often labeled π‘₯, or π‘₯βˆΆπ‘–=1,…,𝑛, for 𝑛 explanatory variables.

Dependent variables are variables that are being tested and are dependent on independent variables. Dependent variables are often called response variables as they respond to changes in explanatory variables and are often labeled 𝑦.

Example 4: Finding the Equation of a Regression Line in a Regression Model

Using the information in the table, find the regression line ̂𝑦=π‘Ž+𝑏π‘₯. Round π‘Ž and 𝑏 to 3 decimal places.

Cultivated Land in Feddan126 13 104 180 38 161 14 99 55 177
Production of a Summer Crop in Kilograms 160 40 80 340 260 200 280 280 140 100

Answer

We begin by determining which of our variables is the independent variable and which is the dependent variable. Since we would expect the amount of a summer crop produced to depend on the amount of land on which it is cultivated, it makes sense that the β€œproduction” variable is the dependent variable (𝑦) and the β€œland” variable is the independent variable (π‘₯).

To find the equation of the line of least squares regression, ̂𝑦=π‘Ž+𝑏π‘₯, we must find the slope or regression coefficient 𝑏 and the 𝑦-intercept π‘Ž. We have ten pairs of data, that is, ten measurements of the independent variable β€œcultivated land in feddan” that are paired with ten measurements of the dependent variable β€œproduction of a summer crop in kilograms”, so 𝑛=10 We can use the following formula to calculate 𝑏: 𝑏=𝑆𝑆=π‘›βˆ‘π‘₯π‘¦βˆ’βˆ‘π‘₯βˆ‘π‘¦π‘›βˆ‘π‘₯βˆ’ο€Ήβˆ‘π‘₯.ο—ο˜ο—ο—οŠ¨οŠ¨

We will therefore need to find the sums ο„šπ‘₯𝑦, ο„šπ‘₯, ο„šπ‘¦, and ο„šπ‘₯. Let us put our data into a table with columns for the product π‘₯𝑦 and for π‘₯ so that we may more easily calculate the required sums.

Cultivated Land (Feddan) π‘₯Summer Crop (kg) 𝑦π‘₯𝑦π‘₯
12616020β€Žβ€‰β€Ž16015β€Žβ€‰β€Ž876
1340520169
104808β€Žβ€‰β€Ž32010β€Žβ€‰β€Ž816
18034061β€Žβ€‰β€Ž20032β€Žβ€‰β€Ž400
382609β€Žβ€‰β€Ž8801β€Žβ€‰β€Ž444
16120032β€Žβ€‰β€Ž20025β€Žβ€‰β€Ž921
142803β€Žβ€‰β€Ž920196
9928027β€Žβ€‰β€Ž7209β€Žβ€‰β€Ž801
551407β€Žβ€‰β€Ž7003β€Žβ€‰β€Ž025
17710017β€Žβ€‰β€Ž70031β€Žβ€‰β€Ž329
Sumο„šπ‘₯=967ο„šπ‘¦=1880ο„šπ‘₯𝑦=189320ο„šπ‘₯=130977

The sums, which are in the final row, have been calculated for each column, and we may now use these in the formula to find 𝑏: 𝑏=10Γ—189320βˆ’967Γ—188010Γ—130977βˆ’(967)=1893200βˆ’18179601309770βˆ’935089=75240374681β‰ˆ0.20081=0.2013.tod.p

The 𝑦-intercept is given by π‘Ž=π‘¦βˆ’π‘π‘₯, where π‘₯ is the mean of the π‘₯-values and 𝑦 is the mean of the 𝑦-values. These are π‘₯=βˆ‘π‘₯𝑛=96710=96.7,𝑦=βˆ‘π‘¦π‘›=188010=188.

To calculate π‘Ž accurately to three decimal places, we need to substitute a suitably accurate value for 𝑏. Here, we can substitute the exact fraction, or a decimal accurate to at least five decimal places. Therefore, calculating π‘Ž, we have π‘Ž=188βˆ’96.7Γ—0.20081…=188βˆ’19.41840β€¦β‰ˆ168.5816…=168.5823.tod.p

With the values of our regression coefficient and 𝑦-intercept to three decimal places, the line of least squares regression is ̂𝑦=0.201π‘₯+168.582.

We may interpret this as follows: for every additional unit of cultivated land in feddan, we expect the production of the summer crop to increase by approximately 0.2 kg. We might also interpret the value of π‘Ž, since this is the 𝑦-intercept. However, we need to be careful that our interpretation makes sense within the context of the data. In our case, with π‘Ž=168.582, we might conclude that, with no cultivated land, that is, π‘₯=0, we could expect to produce 168.582 kg of the summer crop, which does not make physical sense. We might perhaps infer that we begin with 168.582 kg of the summer crop from other sources, but we do not know this from the data. This illustrates how care must be taken when considering how variables behave outside of the range of the given data.

Once we have a regression model, which in the case of linear data is the least squares line of regression, we may, with care, use our model to estimate values for the dependent variable. We see how this works in our next example.

Example 5: Calculating an Estimated Value for a Variable at a Given Point in a Regression Model

Using the information in the table, estimate the value of 𝑦 when π‘₯=13. Give your answer to the nearest integer.

π‘₯ 23 9 24 15 7 12
𝑦 22 24 25 13 21 9

Answer

We are given a set of bivariate data where we have six pairs of values for each of the two variables π‘₯ and 𝑦. To estimate a 𝑦-value for a given π‘₯-value, assuming the data is approximately linear, we must first find the equation of the regression line, ̂𝑦=π‘Ž+𝑏π‘₯. To do this, we first calculate the slope 𝑏, using the formula below: 𝑏=𝑆𝑆=π‘›βˆ‘π‘₯π‘¦βˆ’βˆ‘π‘₯βˆ‘π‘¦π‘›βˆ‘π‘₯βˆ’ο€Ήβˆ‘π‘₯.ο—ο˜ο—ο—οŠ¨οŠ¨

This requires the sums ο„šπ‘₯𝑦, ο„šπ‘₯, ο„šπ‘¦, and ο„šπ‘₯, and by putting our data into a table with columns for the product π‘₯𝑦 and for π‘₯, we can easily calculate these as shown below.

π‘₯𝑦π‘₯𝑦π‘₯
2322506529
92421681
2425600576
1513195225
72114749
129108144
Sumο„šπ‘₯=90ο„šπ‘¦=114ο„šπ‘₯𝑦=1772ο„šπ‘₯=1604

Substituting the necessary sums into our formula for 𝑏 gives 𝑏=6Γ—1772βˆ’90Γ—1146Γ—1604βˆ’(90)=10632βˆ’102609624βˆ’8100=3721524β‰ˆ0.2440945=0.244.tothreedecimalplaces

The 𝑦-intercept π‘Ž=π‘¦βˆ’π‘π‘₯ and we will use our value for 𝑏 to calculate this. However, first we must find the means of the π‘₯- and the 𝑦-values. These are π‘₯=βˆ‘π‘₯𝑛=906=15,𝑦=βˆ‘π‘¦π‘›=1146=19.

With these values, we then have π‘Ž=19βˆ’0.2440945Γ—15β‰ˆ15.3386.

The regression line is, therefore, ̂𝑦=15.3386+0.2440945π‘₯. Now, if we substitute π‘₯=13, we find ̂𝑦=15.3386+0.2440945Γ—13β‰ˆ18.512.

Hence, to the nearest integer, when π‘₯=13, we estimate 𝑦=19.

In this example, we estimated a value of the dependent variable 𝑦 for a value of π‘₯ that was within our range of known values. This is called interpolation and the following definition clarifies this.

Definition: Interpolation and Extrapolation

Interpolation Estimating or predicting a value of the dependent variable from within the range of known values of the independent variable.

Extrapolation Estimating or predicting a value of the dependent variable from outside the range of known values of the independent variable.

Extrapolation should be used with the utmost caution, if at all. The behavior of the variables may change outside the known range of data leading to errors. Therefore, extrapolation should be avoided where possible.

We complete this explainer by recalling some of the key points covered.

Key Points

  • The least squares regression line is a linear model for bivariate data sets consisting of 𝑛 data points (π‘₯,𝑦), where π‘₯ is the independent, or explanatory, variable and 𝑦 is the dependent, or response, variable.
  • The least squares regression line is the line whose sum of squares of distances of the data points from that line is a minimum. The equation of the line is ̂𝑦=π‘Ž+𝑏π‘₯, where 𝑏=𝑆𝑆=π‘›βˆ‘π‘₯π‘¦βˆ’βˆ‘π‘₯βˆ‘π‘¦π‘›βˆ‘π‘₯βˆ’ο€Ήβˆ‘π‘₯ο…π‘Ž=π‘¦βˆ’π‘π‘₯,ο—ο˜ο—ο—οŠ¨οŠ¨and with π‘₯=βˆ‘π‘₯𝑛, the mean of π‘₯, and 𝑦=βˆ‘π‘¦π‘›, the mean of 𝑦, and where 𝑆=ο„šπ‘₯π‘¦βˆ’βˆ‘π‘₯βˆ‘π‘¦π‘›,𝑆=ο„šπ‘₯βˆ’ο€Ήβˆ‘π‘₯𝑛.ο—ο˜ο—ο—οŠ¨οŠ¨
  • The slope, 𝑏, may also be written as 𝑏=π‘Ÿπ‘ π‘ ο˜ο—, where π‘Ÿ is the correlation coefficient and π‘ ο˜ and 𝑠 are the standard deviations of π‘₯ and 𝑦.
  • We may use the least squares regression line to estimate or predict values of the dependent variable using interpolation, that is, using π‘₯-values within the known range. Extrapolation, that is, using values outside this range to estimate or predict is not advisable as results may be erroneous.

Nagwa uses cookies to ensure you get the best experience on our website. Learn more about our Privacy Policy.