As a first measure of fit, we can consider the sum of squares of the residuals:\[SSE=\sum \hat{\epsilon}^2_i=\sum (y_i-\hat{y}_i)^2\] We will also be interested in the variability (up to a constant factor) of the predicted values \(\hat{y}_i\)and the response values (or labels) \(y_i\). These are called the regression sum of squares and total sum of squares respectively: \[SSR=\sum_i (\hat{y}_i-\overline{\hat{y}})\] \[SST = \sum_i (y_i-\overline{y})\] The SSR in particular can be simplified a little, but before we do that, we prove the following lemma:

We have

- \( \sum_i \hat{\epsilon_i}=0\)
- \(\sum_i \hat{y_i}\hat{\epsilon_i}=0\)
- \(\overline{y}=\overline{\hat{y}}\)

For the first claim , we note that \[ \sum_i \hat{\epsilon}_i = \sum_i (y_i-\hat{\beta}_0-\hat{\beta}_1 x_i)= n(\overline{y}-\hat{\beta}_0-\hat{\beta}_1\overline{x})=0\]
by definition of \(\hat{\beta}_0\). For the second claim we first plan to show that \(\sum_i x_i\hat{\epsilon}_i=0\).

This turns out to be equivalent to \(\frac{\partial}{\beta_1} L\bigg((x_1,\ldots x_n)\vert (\beta_0,\beta_1)\bigg)=0\), which we know to be true from last week. Concluding that \(\sum_i \hat{y}_i\epsilon_1=0\) is rather straightforward now since \[(\hat{\beta}_0+\hat{\beta}_1x_i)\hat{\epsilon}_i=\beta_0\sum_i\hat{\epsilon}_i+\hat{\beta}_1\sum_ix_i\epsilon_i=0+0=0\] The last claim follows from the first by dividing by \(n\)

This turns out to be equivalent to \(\frac{\partial}{\beta_1} L\bigg((x_1,\ldots x_n)\vert (\beta_0,\beta_1)\bigg)=0\), which we know to be true from last week. Concluding that \(\sum_i \hat{y}_i\epsilon_1=0\) is rather straightforward now since \[(\hat{\beta}_0+\hat{\beta}_1x_i)\hat{\epsilon}_i=\beta_0\sum_i\hat{\epsilon}_i+\hat{\beta}_1\sum_ix_i\epsilon_i=0+0=0\] The last claim follows from the first by dividing by \(n\)

We have \[SSR=\sum_i (\hat{y}_i-\overline{y})\]

We have \[SST=SSE+SSR\]

Clearly \[SST=\sum (y_i-\hat{y}_i+\hat{y}_i-\overline{y})^2=\sum_i (y_i-\hat{y}_i)^2+2\sum(y_i-\hat{y}_i)(\hat{y}_i-\overline{y})+\sum(\hat{y}_i-\overline{y})^2\]
The first term in this sum is SSE and the last is SSR by the lemma. Finally, the middel term is zero as it is equal to \[2\sum \epsilon_i(\hat{y}_i-\overline{y})=0-0\]
by the lemma again.

This yields yet another important measure for the efficiency of a linear regression model:
The coefficient of determination is the value \[R^2=\frac{SSR}{SST}\]

The \(R^2\)-coefficient has the following properties:
*residuals \((\hat{\epsilon}_i)_1^n=(\hat{y}_i-y_i)_1^n\) * in \(\mathbb{R}^n\) for any choice of labels \(y_i \in \mathbb{R}^n\). By the lemma above, this subspace satisfies tow linear equations \(\sum \epsilon_i=0\) and \(\sum x_i \epsilon_i=0\). This space is thus \(n-2\) dimensional. In general, to make an estimator unbiased, one should divide by the 'dimension of the subspace it spans' also referred to as the degrees of freedom.

- \(0\le R^2\le 1\)
- \(R^2=1-\frac{SSE}{SSR}\)
- \(R^2=1\) iff the predictions coincide with the responses
- \(R^2=0\) iff \(\hat{\beta_1}=0\)

These are all very easily verified

We now wish to investigate the properties of these quantities as estimators. I.e. after choice of parameter \((\beta_0,\beta_1)\) we consider the random variables \[Y_i=\beta_0+\beta_1 x_i +\epsilon_i \text{ where } N(0,\sigma^2)\]
And simply replacing the \(y_i\)'s with \(Y_i\) now turns SSE,SSR, SST and \(R^2\) into random variables. Unfortunately, it is not tue that SSE forms an un biased estimator for \(\sigma^2\) (compare with the bias of the sample variance estimator). However, we do have:
Let \(s^2=\frac{1}{n-2}SSE\). Then \(\mathbb{E}[s^2]=\sigma^2\), ie \(s^2\) is an unbiased estimator for the variance

The factor \(\frac{1}{n-2}\) has an intuitive interpretation (thei requires a little B24 knowledge): if we wish to consider the subspace of created by Louis de Thanhoffer de Volcsey with thanks to Oliver Capon