Lesson

When we want to analyse the fit of a linear model to a bivariate set of data, we start by analysing the value of the correlation coefficient.

However, we might find we have a strong value for $r$`r`, but looking at the data more closely, we realise that it is not actually linear, but instead is curved in shape.

The scatter plots below illustrate this idea. Both sets of data have a strong correlation, with similar lines of best fit.

On the left, the data points appear to be scattered randomly above and below the least-squares regression line. This randomness is expected when the linear model is suitable for the data.

However, on the right, the scatter plot shows a distinct pattern in the arrangement of the data points - starting below the line-of-best fit, then above the line, before returning below the line. Any pattern, such as this, suggests that the linear model is not appropriate.

**Linear data**

**Non-linear data**

To help us recognise any patterns and determine the suitability of a linear model, we can use a tool called a residual plot.

Residuals are the vertical distances from each data point to a line. When your calculator determines the least-squares regression line, it is minimising the residuals (actually the sum of the squares of the residuals) to choose the optimal coefficients for the line of best fit.

The scatter plots below show how the residuals are short when the line of best fit is chosen appropriately, and longer for a line that is a poor fit to the data.

**Good fit (least-squares regression line)**

**Poor fit**

To explore further, use this applet to move the points and show how the residuals are measured vertically from the least-squares regression line. Then switch to the "Residuals" to see how these residuals can be converted to a residual plot.

Summary

Line of best fit - The line which most closely models a set of bivariate data.

Least-squares regression - A technique for finding the line of best fit, which would then be called the **least-squares regression line**. This technique involves minimising the sum of the squares of the residuals, which is best done with technology.

Residual - The vertical distance between a data point and the line of best fit.

Residual plot - A graph that displays the residual for each point, rather than the actual data points.

Calculating residuals

$\text{Residual}=\text{Actual value}-\text{Predicted value}$Residual=Actual value−Predicted value

$\text{Residual}=y-\hat{y}$Residual=`y`−^`y`

Remember that the predicted value, $\hat{y}$^`y`, is obtained from the equation of the least-squares regression line.

A **positive **residual means the actual data point is **above **the least-squares regression line and a **negative **residual means the raw data point is **below **the line.

Using the above relationships between the residual, actual value and predicted values, we are able to calculate any one of these values if we know the other two.

For instance, if the predicted value is $22$22 and the actual value is $19$19, then we can calculate the residual:

$\text{residual}$residual | $=$= | $y-\hat{y}$y−^y |

$=$= | $19-22$19−22 | |

$y$y |
$=$= | $-3$−3 |

If the residual is equal to $5$5 and the predicted value is $18$18, then we can calculate the actual value, with some rearranging to solve the equation:

$\text{residual}$residual | $=$= | $y-\hat{y}$y−^y |

$5$5 | $=$= | $y-18$y−18 |

$y$y |
$=$= | $5+18$5+18 |

$=$= | $23$23 |

Similarly, if the residual is equal to $-7$−7 and the predicted value is actual value is $4$4, then we can calculate the predicted value (without knowing the equations for the least-squares regression line):

$\text{residual}$residual | $=$= | $y-\hat{y}$y−^y |

$-7$−7 | $=$= | $4-\hat{y}$4−^y |

$\hat{y}$^y |
$=$= | $4+7$4+7 |

$=$= | $11$11 |

The following table shows the sets of data $\left(x,y\right)$(`x`,`y`) and the predicted $\hat{y}$^`y` values based on a least-squares regression line. Complete the table by finding the residuals.

$x$ `x`-values$1$1 $3$3 $5$5 $7$7 $9$9 $y$ `y`-values$22.7$22.7 $22.3$22.3 $24.2$24.2 $21.8$21.8 $21.5$21.5 $\hat{y}$^ `y`$25.2$25.2 $23.4$23.4 $21.6$21.6 $19.8$19.8 $18$18 Residuals $\editable{}$ $\editable{}$ $\editable{}$ $\editable{}$ $\editable{}$

The residual plot for a set of data is shown below.

Which of these scatter plots shows the original data set?

ABCDABCD

Worked example

The table shows a company's profit $P$`P` (in $millions) for total monthly sales $S$`S`. The equation $P=0.4S-10$`P`=0.4`S`−10 is being used to model the data.

**(a) **Complete table with predicted profit and residuals, based on the linear model.

$S$ |
$P$ |
$\hat{P}$^ |
$P-\hat{P}$ |
---|---|---|---|

$30$30 | $-8$−8 | ||

$80$80 | $24$24 | ||

$50$50 | $12$12 | ||

$100$100 | $23$23 | ||

$60$60 | $17$17 | ||

$70$70 | $23$23 | ||

$90$90 | $24$24 | ||

$40$40 | $3$3 |

**Think: **calculate the predicted value and residual value of $P$`P` for each of the given $S$`S` values.

**Do:** The residual is calculated using the formula, $\text{residual}=y-\hat{y}$residual=`y`−^`y`

- Calculate the predicted values of $\hat{P}$^
`P`:- substitute each value of $S$
`S`into the equation of the least-squares regression line;

- substitute each value of $S$
- Calculate the residual values:
- subtract the predicted value of $\hat{P}$^
`P`from the corresponding actual $P$`P`value.

- subtract the predicted value of $\hat{P}$^

The required substitutions and calculations for the first row are:

$\hat{P}$^P |
$=$= | $0.4\times30-10$0.4×30−10 |

$=$= | $2$2 | |

$\text{residual}$residual | $=$= | $-0.8-0.2$−0.8−0.2 |

$=$= | $-10$−10 |

The remaining values are shown in the completed table:

Sales $S$ |
Profit $P$ |
Predicted profit $\hat{P}$^ |
Residual $P-\hat{P}$ |
---|---|---|---|

$30$30 | $-8$−8 | $2$2 | $-10$−10 |

$80$80 | $24$24 | $22$22 | $2$2 |

$50$50 | $12$12 | $10$10 | $2$2 |

$100$100 | $23$23 | $30$30 | $-7$−7 |

$60$60 | $17$17 | $14$14 | $3$3 |

$70$70 | $23$23 | $18$18 | $5$5 |

$90$90 | $24$24 | $26$26 | $-2$−2 |

$40$40 | $3$3 | $6$6 | $-3$−3 |

**(b) **Construct a residual plot for the data in part (a).

**Think:** Each value of $S$`S` and the corresponding residual value will make up the coordinates for each point on the residual plot.

**Do:** Construct the graph, choosing appropriate scales and labelling the axes. Take care to place each point accurately.

**(c) **Is this model a good fit for the data? Justify your answer.

**Think:** If the linear model is a good fit, the residual plot should show a random scattering of points values, above and below $0$0, with no obvious pattern.

**Do:** No, a linear model is not a good fit for this data as there is a pattern present in the residual plot.

Calculating residuals and constructing the residual plot manually for a large set of data is tedious, so we can use our CAS calculator to do this for us.

Select the brand of calculator you use below to work through an example of using a calculator to generate a residual plot.

Casio Classpad

How to use the CASIO Classpad to complete the following tasks regarding creating residual plots.

Consider the data set given below:

$x$x |
$2$2 | $4$4 | $5$5 | $7$7 | $11$11 | $15$15 | $16$16 | $19$19 | $22$22 | $25$25 |
---|---|---|---|---|---|---|---|---|---|---|

$y$y |
$1.5$1.5 | $5.8$5.8 | $6.9$6.9 | $13.2$13.2 | $20.0$20.0 | $34.5$34.5 | $34.7$34.7 | $41.0$41.0 | $49.2$49.2 | $55.1$55.1 |

Use your calculator to generate the residual plot associated with the least squares regression line for the data.

TI Nspire

How to use the TI Nspire to complete the following tasks regarding creating residual plots.

Consider the data set given below:

$x$x |
$2$2 | $4$4 | $5$5 | $7$7 | $11$11 | $15$15 | $16$16 | $19$19 | $22$22 | $25$25 |
---|---|---|---|---|---|---|---|---|---|---|

$y$y |
$1.5$1.5 | $5.8$5.8 | $6.9$6.9 | $13.2$13.2 | $20.0$20.0 | $34.5$34.5 | $34.7$34.7 | $41.0$41.0 | $49.2$49.2 | $55.1$55.1 |

Use your calculator to generate the residual plot associated with the least squares regression line for the data.

Once we've plotted our residuals against the independent variable, we want to analyse the plot for the suitability of using a linear regression model.

Analysing residuals

If the linear model is suitable:

- The residuals are randomly scattered above and below the horizontal axis
- No clustering of the residuals
- Residuals are relatively small in size

If the linear model is NOT suitable:

- The residual plot will show a clear pattern and/or
- The residuals are relatively large in size

Here are some examples where the residual plot indicates that a linear model is suitable or not.

**Linear model is suitable**

**Linear model is NOT suitable**

If we take a look at the image below, we see on the left a scatterplot and a linear regression line fitted to some data. On the right we see the residual plot for the data.

Were we to only look at the scatterplot and the strong correlation ($0.994$0.994), we'd assume a linear model was appropriate. But when we examine the residual plot, there is certainly a pattern evident in the residuals (in this case, a parabolic pattern) and so we might need to rethink what sort of model might best suit this data.

The table below shows the residual values after a least-squares regression line has been fitted to a set of data.

$x$x |
$12$12 | $20$20 | $10$10 | $18$18 | $9$9 | $20$20 |
---|---|---|---|---|---|---|

Residual | $-4$−4 | $-2$−2 | $5$5 | $2$2 | $3$3 | $-1$−1 |

Create a residual plot for this data set.

Loading Graph...Which of the following best describes the suitability of a linear model for this data set?

A linear model is suitable because there is a distinct pattern in the residual plot.

AA linear model is not suitable because there is no pattern in the residual plot.

BA linear model is not suitable because there is a distinct pattern in the residual plot.

CA linear model is suitable because there is no pattern in the residual plot.

DA linear model is suitable because there is a distinct pattern in the residual plot.

AA linear model is not suitable because there is no pattern in the residual plot.

BA linear model is not suitable because there is a distinct pattern in the residual plot.

CA linear model is suitable because there is no pattern in the residual plot.

D

The table shows a company's costs $y$`y` (in millions) in week $x$`x`. The equation $y=5x+12$`y`=5`x`+12 is being used to model the data.

Complete the table of residuals:

$x$ `x`$y$ `y`Value generated by model Residual $1$1 $22$22 $\editable{}$ $\editable{}$ $2$2 $25$25 $\editable{}$ $\editable{}$ $4$4 $33$33 $\editable{}$ $\editable{}$ $6$6 $39$39 $\editable{}$ $\editable{}$ $9$9 $53$53 $\editable{}$ $\editable{}$ $12$12 $69$69 $\editable{}$ $\editable{}$ $14$14 $81$81 $\editable{}$ $\editable{}$ $17$17 $99$99 $\editable{}$ $\editable{}$ Plot the residuals on the scatter plot.

Loading Graph...Is this model a good fit for the data?

Yes

ANo

BYes

ANo

B

use a residual plot to assess the appropriateness of fitting a linear model to the data