Video Transcript
In this video, weβll learn how to
find and use the least squares regression line equation. The term regression was first used
by an English statistician, Sir Francis Galton, in the Victorian era when looking at
the relationship between the heights of parents and their children. He found that the children of
taller parents grew up to be slightly shorter than their parents, whereas the
children of smaller parents grew up to be taller than their parents. He called this effect regression
towards mediocrity; that is, the heights tended, or regressed, towards the mean.
Now we use regression analysis to
identify and analyze relationships between variables. The method of least squares
regression allows us to determine the line of best fit for a set of bivariate
data. And in this video, weβll learn how
to find the line of least squares regression using formulae for the coefficients in
the equation of the line. We recall that bivariate data is
data collected on two quantitative, that is, numerical, variables, where the
observations are paired for each subject. Say, for example, π₯ is equal to
height and π¦ is equal to weight. If we have π people in our sample,
then our data set will consist of π pairs of measurements, each pair relating to
one person. So, for example, π₯ one would be
the height of person number one, and π¦ one would be the weight of person number
one.
Now suppose both a scatter plot and
a correlation coefficient indicate that variablesβ height and weight are linearly
related. That is, as one variable increases,
the other increases linearly or decreases linearly. Our next step will be to try and
model this relationship with the line that fits our data best. That is, we want to find the
straight line, π¦ is equal to π plus ππ₯, whose distance from each of our data
points is minimized. The vertical distance between the
point π₯π, π¦π and the line is π¦π minus π¦ hat, where π¦ hat is the π¦-value on
the line associated with thatβs directly above π₯π. This distance for each point is
called the residual or the error. The least squares regression line,
which weβd often see with a hat above the π¦, minimizes the sum of the errors
squared, hence the phrase least squares.
So how do we find the least squares
regression line? If π¦ hat is equal to π plus ππ₯
is the least squares regression line for a set of bivariate data with variables π₯
and π¦, then the slope π is given by π π₯π¦ over π subscript π₯π₯, where π
subscript π₯π¦ is the sum of the products π₯π¦ minus the sum of the π₯βs multiplied
by the sum of the π¦βs divided by π. And π subscript π₯π₯ is the sum of
the squares of the π₯βs minus the sum of the π₯βs all squared divided by π. And π is the number of data
pairs. So given a set of bivariate data,
this is how we find the slope of our line, π. The π¦-intercept π of our line is
given by π¦ bar, which is the mean of the π¦βs, minus π times the mean of the π₯βs,
where we recall that the mean of π¦ is the sum of the π¦-values divided by π and
similarly for π₯.
We can see from our formulae that
in order to find the π¦-intercept π, we first need to find the slope π and the
mean of the π₯-values and the mean of the π¦-values. You may see some of these
expressions written in slightly different but equivalent forms. So letβs just make a note of some
of these. We can also write the slope π as
π multiplied by π subscript π¦ over π subscript π₯, where π is Pearsonβs
correlation coefficient, π subscript π¦ is the standard deviation of π¦, and π
subscript π₯ is the standard deviation of π₯. If we substitute our expressions π
subscript π₯π¦ and π subscript π₯π₯ into our formula for π, we find the expression
for π as shown. And in fact, weβll use this in our
examples. You may also see π subscript π₯π¦
and π subscript π₯π₯ written as shown.
So now that we have the formulae
for our coefficients π and π, letβs look at an example where we see how to find
the slope π of the regression line from summary statistics.
For a given data set, the sum of
the π₯-values is 47, the sum of the π¦-values is 45.75, the sum of the squares of
the π₯βs is 329, the sum of the squares of the π¦βs is 389.3125, the sum of the
products π₯π¦ is 310.25, and π is equal to eight. Calculate the value of the
regression coefficient π in the least squares regression model π¦ is equal to π
plus ππ₯. Give your answer correct to three
decimal places.
Weβre given the summary statistics
for a data set. We have the sum of the π₯-values,
the sum of the π¦-values, the sum of the squares of the π₯βs, the sum of the squares
of the π¦βs, and the sum of the product π₯π¦. And we know that our data set
consists of π is equal to eight bivariate data pairs. Weβre asked to find the coefficient
π, that is, the slope of the line, in the least squares regression model π¦ is
equal to π plus ππ₯. We use the formula shown to
calculate π, and we begin by writing out our summary statistics.
Since weβre given π, the sum of
the product π₯π¦, the sum of the π₯βs, the sum of the π¦βs, and the sum of the
squares of the π₯βs, the only thing left to find to be able to use the formula is
the sum of the π₯βs all squared. And since the sum of the π₯βs is
47, the sum of the π₯βs all squared is 47 squared; that is 2209. Substituting our values into the
formula then, we have eight, which is π, multiplied by 310.25, thatβs the sum of
the product, minus the sum of the π₯βs, which is 47, multiplied by the sum of the
π¦βs, which is 45.75, all over eight multiplied by 329 minus 2209. Evaluating our products, this gives
us 2482 minus 2150.25 divided by 2632 minus 2209. Our numerator evaluates to 331.75
and our denominator to 423, which evaluates to approximately 0.78428. Thatβs to five decimal places.
Hence, to three decimal places, the
regression coefficient π is equal to 0.784.
And although weβre not actually
asked to find the equation of the line and the π¦-intercept π, which is given by π¦
bar minus π times π₯ bar, where π¦ bar and π₯ bar are the mean of π¦ and π₯,
respectively, we can calculate π quite quickly and therefore the equation of the
line. The mean of the π¦-values is the
sum of the π¦-values divided by π; thatβs 45.75 divided by eight. That is 5.71875. Similarly, the mean of the π₯βs is
the sum of the π₯βs over π. Thatβs 47 over eight, and thatβs
5.875. Substituting these values into our
formula for π then, we have π is equal to 5.71875, thatβs the mean of the π¦βs,
minus 0.78428, which is π to five decimal places for accuracy, multiplied by 5.875,
which is the mean of the π₯βs. And so the π¦-intercept π is equal
to 1.111 to three decimal places.
And so the equation of the line of
least squares regression for the data set is π¦ is equal to 1.111 plus 0.784π₯,
where weβve calculated our coefficients to three decimal places.
In this example, we were given the
summary statistics for a data set. And in our next example, weβll see
how to find the line of least squares regression from the data itself.
The scatter plot shows a set of
data for which the linear regression model appears appropriate. The data used to produce this
scatter plot is given in the table shown. Calculate the equation of the least
squares regression line of π¦ on π₯, rounding the regression coefficients to the
nearest thousandth.
The equation of the line of least
squares regression is π¦ hat is equal to π plus ππ₯, where π¦ hat is the predicted
value of π¦ for each value of π₯, π is the π¦-intercept, and π is the slope of the
line. To find the equation of the line,
we first find the slope π, which is given in the formula shown. We then use this value for π to
find the π¦-intercept π, which is given by the mean of π¦ minus π multiplied by
the mean of π₯, where we recall that the mean of the π¦-values is given by the sum
of the π¦-values divided by the number of data pairs π and similarly for the mean
of π₯. In fact, in our case, we have eight
data pairs so that π is equal to eight. So letβs make a note of this.
Now, to find the coefficients π
and π, weβre going to need the various sums shown in the formulae. And in order to calculate these
sums, we begin by expanding our table to include a row for the product π₯π¦ and
another for the squares of the π₯-values. In the first cell of our new row
for the product π₯π¦, we have the product of the first π₯-value 0.5 with the first
π¦-value 9.25, and thatβs 4.625. So we put this in the first cell of
our new row. Our second new entry will be the
second π₯-value, thatβs one, multiplied by the second π¦-value 7.6. That is 7.6. And this goes into the second cell
for the new row of products. And we can fill in the remaining
products π₯π¦, as shown.
The first entry in our second new
row is the first π₯-value squared. That is 0.5 squared, which is
0.25. And so this goes in the first cell
of our second new row. Our second π₯-value squared is one
squared, which is one. And we can fill in the remainder of
the π₯ squared values in our second new row as shown. Now remember, weβre trying to find
this sum, so our next step is to sum each of the rows. And if we introduce a new column
for our sums, then, for example, the sum of our π₯-values is 18. And this is the first entry in our
new column. Summing our π¦-values gives us
45.1. The sum of the products is
78.05. And the sum of the squares of the
π₯βs is 51.
So now we can use these values to
calculate the slope π of our line. Into our formula then, we have
eight, which is π, multiplied by 78.05, the sum of our products, minus 18, which is
the sum of the π₯βs, multiplied by 45.1, the sum of the π¦βs, over π, eight, times
51, which is the sum of the squares of the π₯βs, minus 18 squared, which is the sum
of the π₯βs all squared. Evaluating our products gives us
624.4 minus 811.8 all divided by 408 minus 324. And typing this carefully into our
calculator, we find π is approximately equal to negative 2.23095. To three decimal places, that is,
to the nearest thousandth, thatβs negative 2.231.
We can see from the data points on
our scatter plot that as the π₯-values of the data points increase, the π¦-values of
the data points decrease. And this is confirmed by the fact
that our coefficient π is negative, that is, negative 2.231. And now making some space so that
we can calculate our π¦-intercept π, we see from our formula that weβre first going
to need to calculate the mean of the π₯-values and the mean of the π¦-values. The mean of the π¦-values is 45.1
divided by eight. That is 5.6375. The mean of the π₯-values is 18
divided by eight, which is 2.25.
And so, making some space again, we
can use these to calculate our coefficient π. And we have π is equal to 5.6375
minus negative 2.23095, which is π to five decimal places for accuracy, multiplied
by 2.25. This evaluates to approximately
10.65714, which is 10.657 to three decimal places, that is, to the nearest
thousandth. The equation of the least squares
regression line of π¦ on π₯ for this data then is π¦ hat is equal to 10.657 minus
2.231π₯ all to the nearest thousandth. Note that we write π¦ with a hat on
to indicate that this is a predicted value for π¦ from the line calculated with the
given data. Youβll often see this written
simply as π¦ is equal to π plus ππ₯.
Now, so far, weβve not been given
any definition of what the variables π₯ and π¦ refer to. But when considering real-life
variables in the context of regression, if possible, we first establish which of our
variables is the dependent and which is the independent variable. Recall that independent variables
are variables we may control or change. We believe they have a direct
effect on a dependent variable. Another name for independent
variables is explanatory variables, and theyβre often labeled π₯. Dependent variables, on the other
hand, are variables that are being tested and are dependent on one or more
independent variables. Since they respond to changes in
the independent variable or variables, theyβre often called response variables, and
theyβre often labeled π¦.
In our next example, weβll
calculate the coefficients for the least squares regression line for real-life
data. And so weβll need to begin by
determining which is the dependent and which is the independent variable.
Using the information in the table,
find the regression line π¦ hat is equal to π plus ππ₯. Round π and π to three decimal
places.
Since we want to find the
regression line, we begin by determining which of our variables is the dependent and
which is the independent variable. We might expect that the amount of
summer crop produced in kilograms is dependent on the amount of land itβs produced
on. And so we specify the production in
kilograms is the dependent variable π¦, whereas cultivated land measured in feddan
is the independent variable π₯. And note that a feddan is a unit of
area measuring just over one acre.
To find the regression line, we
must find the slope π and the π¦-intercept π. And to find these values, we use
the two formulae shown. We first calculate the slope π
since weβll need this to calculate the π¦-intercept π. And we see from our formula for π
that weβre going to need to find various sums, that is, the sum of the products
π₯π¦, the sum of the π₯-values, the sum of the π¦-values, the sum of the squared
π₯-values, and weβll also need the sum of the π₯βs all squared. And to find the value for π, weβre
going to need the mean of the π¦-values, that is, the sum of the π¦-values divided
by π, which is the number of data pairs, and similarly for the mean of the
π₯-values.
In our data set, we have 10 pairs
of data so that π is equal to 10. And we make a note of this before
we start making our calculations. Our next step is to find the
sums. And to find the sum of our products
π₯π¦ and our π₯ squared values, we introduce two new rows to our table. To calculate the products π₯π¦,
taking our first π₯ and our first π¦, we have 126 multiplied by 160. That is 20160. And this goes into the first cell
of our first new row. Our second product is our second
π₯-value multiplied by our second π¦-value. That is 13 multiplied by 40, which
is 520. And this goes into our second cell
in the first new row. We can then complete this row with
the products as shown.
The first element in our second new
row is the first π₯-value squared, that is, 126 squared, which is 15876. And this goes into our second new
row. Our second π₯-value squared is 13
squared, which is 169. And this goes into the second cell
of our second new row. And we continue in this way to
complete the row. Our next step is to find the sum
for each of the rows. So we introduce a new column. The sum of the π₯-values is
967. The sum of the π¦-values is
1880. The sum of the products π₯π¦ is
189320. And the sum of the squares of the
π₯βs is 130977. So now with all our sums, weβre in
a position to calculate π.
Substituting our sums into the
formula for π with π is equal to 10, we have 10 times 189320, thatβs the sum of
the products π₯π¦, minus 967, which is the sum of the π₯βs, multiplied by 1880,
which is the sum of the π¦βs, all divided by 10, which is π, multiplied by the sum
of the squared π₯-values, which is 130977, minus 967 squared. Thatβs the sum of the π₯βs all
squared. And evaluating our numerator and
denominator, we have 75240 divided by 374681. And this evaluates to approximately
0.20081. To three decimal places then, we
have π is equal to 0.201.
Now to find the π¦-intercept π, we
need to find the means of the π¦βs and the π₯-values. The mean of the π¦βs is the sum of
all the π¦-values divided by π. Thatβs 1880 divided by 10, and
thatβs 188. Similarly, the mean of the
π₯-values is the sum of the π₯βs divided by π. And thatβs 967 divided by 10, which
is 96.7. So now we can use these values
together with our slope π, where weβll use the value of π to five decimal places
for accuracy, to calculate the π¦-intercept π. Evaluating this gives us π is
equal to 168.58167 and so on. That is 168.582 to three decimal
places. The line of least squares
regression then for this data to three decimal places is π¦ hat is equal to 168.582
plus 0.201π₯.
We can interpret this as for every
additional unit of land, we expect the production of the summer crop to increase by
approximately 0.2 kilograms.
Once we have our line of
regression, we can use this to estimate values for the dependent variable for
particular values of the independent variable π₯. However, if we do this, we must be
very careful to restrict ourselves to π₯-values within the range of the given
data. Letβs consider how this might work
using the variables given in this example. Our dependent variable π¦ is crop
production in kilograms, and our independent variable π₯ is cultivated land measured
in feddan. Our line of least squares
regression, which weβve just calculated to three decimal places from the given data,
is π¦ is equal to 168.582 plus 0.201π₯.
Now suppose we want to know how
many kilograms of summer crop we would expect from 100 feddan of cultivated
land. Substituting π₯ is equal to 100 in
our equation evaluates to 188.682 kilograms. Thatβs to three decimal places. Now itβs fine to use this value of
π₯ since it lies within the range of π₯ for the data, that is, between 13 and
180. So we can use π₯ is 100 in the
equation of the line to estimate the value for π¦, crop production.
Now letβs look at an example of
what might happen if we try and predict using an π₯-value outside the range of the
data. Suppose we let π₯ equal zero. This means weβre going to interpret
the π¦-intercept. If we let π₯ equal zero in our
equation, we find π¦ hat is equal to 168.582. But this tells us that with zero
units of cultivated land, crop production is estimated at approximately 169
kilograms, which is absurd since if we have no land, we canβt produce any crops. This is an example of extrapolation
where we try and predict outside the range of the given data. Interpolation, on the other hand,
is when we try and predict or estimate within the range of the data. This example illustrates that
extrapolation should only be used with the utmost caution.
Letβs complete this video by
recalling some of the key points weβve covered. The least squares regression line
π¦ hat is equal to π plus ππ₯ is a linear model for bivariate data. The coefficient π, which is the
slope of the line, and π, which is the π¦-intercept, can be calculated using the
formulae shown, where π¦ bar is the mean of the π¦-values and π₯ bar is the mean of
the π₯-values and π is the number of data pairs. We may use the regression model to
estimate using π₯-values within the range of the given data. And thatβs called
interpolation. However, using π₯-values outside
the range of the known data to estimate or predict, thatβs extrapolation, is not
advisable.