A professor is attempting to identify trends among final exam scores. His class has a mixture of students, so he wonders if there is any relationship between age and final exam scores. One way for him to analyze the scores is by creating a diagram that relates the age of each student to the exam score received. In this exposition, we will examine one such diagram known as a scatter plot.
A scatter plot is a graph of plotted points that may show a relationship between two sets of data. If the relationship is from a linear model, or a model that is nearly linear, the professor can draw conclusions using his knowledge of linear functions. Figure 1 shows a sample scatter plot.
Notice this scatter plot does not indicate a linear relationship. The points do not appear to follow a trend. In other words, there does not appear to be a relationship between the age of the student and the score on the final exam.
Table 1 shows the number of cricket chirps in 15 seconds, for several different air temperatures, in degrees Fahrenheit. Plot this data, and determine whether the data appears to be linearly related.
Plotting this data, as depicted in Figure 2 suggests that there may be a trend. We can see from the trend in the data that the number of chirps increases as the temperature increases. The trend appears to be roughly linear, though certainly not perfectly so.
Once we recognize a need for a linear function to model that data, the natural follow-up question is “what is that linear function?” One way to approximate our linear function is to sketch the line that seems to best fit the data. Then we can extend the line until we can verify the -intercept. We can approximate the slope of the line by extending it until we can estimate the .
Find a linear function that fits the data in Table 1 by “eyeballing” a line that seems to fit.
On a graph, we could try sketching a line. Using the starting and ending points of our hand drawn line, points (0, 30) and (50, 90), this graph has a slope of and a -intercept at 30. This gives an equation of
where is the number of chirps in 15 seconds, and is the temperature in degrees Fahrenheit. The resulting equation is represented in Figure 3.
This linear equation can then be used to approximate answers to various questions we might ask about the trend.
While the data for most examples does not fall perfectly on the line, the equation is our best guess as to how the relationship will behave outside of the values for which we have data. We use a process known as interpolation when we predict a value inside the domain and range of the data. The process of extrapolation is used when we predict a value outside the domain and range of the data.
Figure 4 compares the two processes for the cricket-chirp data addressed in Example 2. We can see that interpolation would occur if we used our model to predict temperature when the values for chirps are between 18.5 and 44. Extrapolation would occur if we used our model to predict temperature when the values for chirps are less than 18.5 or greater than 44.
There is a difference between making predictions inside the domain and range of values for which we have data and outside that domain and range. Predicting a value outside of the domain and range has its limitations. When our model no longer applies after a certain point, it is sometimes called model breakdown. For example, predicting a cost function for a period of two years may involve examining the data where the input is the time in years and the output is the cost. But if we try to extrapolate a cost when , that is in 50 years, the model would not apply because we could not account for factors fifty years in the future.
Different methods of making predictions are used to analyze data.
The method of interpolation involves predicting a value inside the domain and/or range of the data.
The method of extrapolation involves predicting a value outside the domain and/or range of the data.
Model breakdown occurs at the point when the model no longer applies.
Use the cricket data from Table 1 to answer the following questions:
Our model predicts the crickets would chirp 8.33 times in 15 seconds. While this might be possible, we have no reason to believe our model is valid outside the domain and range. In fact, generally crickets stop chirping altogether below around 50 degrees.
While eyeballing a line works reasonably well, there are statistical techniques for fitting a line to data that minimize the differences between the line and data values. One such technique is called least squares regression and can be computed by many graphing calculators, spreadsheet software, statistical software, and many web-based calculators. Least squares regression is one means to determine the line that best fits the data, and here we will refer to this method as linear regression.
Given data of input and corresponding outputs from a linear function, find the best fit line using linear regression.
Find the least squares regression line using the cricket-chirp data in Table 2.
Notice that this line is quite similar to the equation we “eyeballed” but should fit the data better. Notice also that using this equation would change our prediction for the temperature when hearing 30 chirps in 15 seconds from 66 degrees to: Rounding to 1 decimal place, we find The graph of the scatter plot with the least squares regression line is shown in Figure 6.
Will there ever be a case where two different lines will serve as the best fit for the data?
No. There is only one best fit line.
As we saw above with the cricket-chirp model, some data exhibit strong linear trends, but other data, like the final exam scores plotted by age, are clearly nonlinear. Most calculators and computer software can also provide us with the correlation coefficient, which is a measure of how closely the line fits the data. Many graphing calculators require the user to turn a “diagnostic on” selection to find the correlation coefficient, which mathematicians label as . The correlation coefficient provides an easy way to get an idea of how close to a line the data falls.
We should compute the correlation coefficient only for data that follows a linear pattern or to determine the degree to which a data set is linear. If the data exhibits a nonlinear pattern, the correlation coefficient for a linear regression is meaningless. To get a sense for the relationship between the value of and the graph of the data, Figure 7 shows some large data sets with their correlation coefficients. Remember, for all plots, the horizontal axis shows the input and the vertical axis shows the output.
The correlation coefficient is a value, , between and 1.
Calculate the correlation coefficient for cricket-chirp data in Table 1.
Because the data appear to follow a linear pattern, we can use technology to calculate . Enter the inputs and corresponding outputs and select the Linear Regression. The calculator will also provide you with the correlation coefficient, . This value is very close to 1, which suggests a strong increasing linear relationship.
Note: For some calculators, the Diagnostics must be turned “on” in order to get the correlation coefficient when linear regression is performed: , then scroll to DIAGNOSTICSON.
Once we determine that a set of data is linear using the correlation coefficient, we can use the regression line to make predictions. As we learned above, a regression line is a line that is closest to the data in the scatter plot, which means that only one such line is a best fit for the data.
Gasoline consumption in the United States has been steadily increasing. Consumption data from 1994 to 2004 is shown in Table 3. Determine whether the trend is linear, and if so, find a model for the data. Use the model to predict the consumption in 2008.
|Consumption (billions of gallons)||113||116||118||119||123||125||126||128||131||133||136|
The scatter plot of the data, including the least squares regression line, is shown in Figure 8.
We can introduce a new input variable, , representing years since 1994.
The least squares regression equation is:
Using technology, the correlation coefficient was calculated to be 0.9965, suggesting a very strong increasing linear trend.
Using this to predict consumption in 2008 , we have