A simple approach to this problem, would be to ask around and see what other people pay for a comparable appartment. For the sake of simplicity, let's assume you have gathered similar data \((s_i,p_i)\) for a bunch of places where \(s_i\) denotes the squre footage of the place and \(p_i\) the price of the rent. You could then plot these points in\(\mathbb{R}^2\), and it could look something like this: Clearly, it seems like there is a linear relation between the \(s_i\) and \(p_i\). You thus suppose there is a line that describes this behavior given by \L(s)=(\hat{\beta}_0+\hat{\beta}_1 x\). A good estimate for the appropriate rent you should pay would then be \(L(s_0)\) where \(s_0\) is the squre footage of your place! This is exactly what linear regression allows you to do.

It is a very interesting observation that this problem at first seems in now way like it fits in the statistical framework: there are no probabilities or no statistical models to be found here. In fact, this problem is purely geometric (fitting a line through points). It makes sense to first see f you can't develop some math to solve this problem simply using the geometry (and calculus of the plane)

To this end, we assume given points \((x_i,y_i)\in \mathbb{R}^2\) and wish to find the line \(L\) wich minimizes a certain value that will stand for the total error between the points and the line. So, recall that a line has the form \(y=\beta_0+\beta_1x\). For a point \(x_i,y_i\), the error will be \(\epsilon_{i}=y_i-\beta_0-\beta_1x_i\). If we want to measure the total error for all points. it makes sense to consider \(\epsilon\) as a vector with \(n\) components and to compute its magnitude (ie the norm) \[\vert \vert \epsilon\vert \vert =\sqrt{\sum_i \epsilon^2_i}=\sqrt{\sum (y_i-\beta_0-\beta_1x_i)^2}\] In other words, the solution to our geometric problem are two values \(\hat{\beta}_0,\hat{\beta}_1\) such that \[(\hat{\beta}_0,\hat{\beta_1})=\text{argmin}_{\beta_0,\beta_1 \vert \vert \epsilon\vert \vert\] To find these values, we recall that it suffices simply to look at the function \[\vert \vert \epsilon \vert \vert\] as a function of \(\beta_0,\beta_1\) and to find the values where the gradient is zero:

Let \[F: (\beta_0,\beta_1)\longrightarrow \sum_i (y_i-\beta_0-\beta_1x_i)^2\]
Then \(\nabla F(\beta_0,\beta_1)=(0,0)\) iff \[(\beta_0,\beta_1)=(\overline{x}-\beta_1\overline{y},\frac{\sum_i (x_i-\overline{x}\cdot (y_i-\overline{y]}))}{\sum_i (x_i-\overline{x})})^2=(\overline{y}-\beta_1\overline{y},\frac{\overline{xy}-\overline{x}\overline{y}}]}{\overline{x^2}-\overline{x}^2})\]

In other words, rephrasing this problem geometrically, the above result shows:
For points \((x_i,y_i) \in \mathbb{R}^2\), the linear regression is given by the line \[\hat{\beta}_0+\hat{\beta}_1 x\]
Where
\[\beta+1=\frac{\overline{xy}-\overline{x}\overline{y}}]}{\overline{x^2}-\overline{x}^2}) \text{ and }\beta_0=\overline{y}-\beta_0\overline{x}\]

This is rather satisfying for a few reasons, not least of which the fact that there is a nice statistical interpretation to these numbers:First, the expression for \(\hat{\beta}_1\) can be written as \[\hat{\beta}_1=\frac{c^2}{s^2}\] Where \(c^2\) is the sample covariance of the data and \(s^2\) is the sample variance. \(\hat{beta}_0\) in turn is the sample mean of the data \(y_i-\hat{\beta}_1x_i\) (another easy way to remember these formulas).

More importantly we showed in class that \[\text{argmax}\prod_i\bigg(\frac{1}{\sqrt{2\pi\sigma^2}}\bigg)\exp\bigg(\frac{1}{2\sigma^2}(y_i-\beta_0-\beta_1 x_i-)^2\bigg)\] In other words, if for any choice of parameters \(\beta_0,\bet_1\) we consider the random variable \[\mathbb{R}^2\\longrightarrow\mathbb{R}:(x,y)\longrightarrow \epsilon(x,y)=y-\beta_0-\beta_x\] We further wish to investigate the properties of these various quantities as

For any choice of parameters \((\beta_0,\beta_1)\)

- \(\mathbb{E}[\hat{\beta}_0]=\beta_0\) (ie \(\hat{\beta}_0\) is an unbiased estimator of the characteristic \(\beta_0\) )
- \(\mathbb{E}[\hat{\beta}_1]=\beta_1\) (ie \(\hat{\beta}_1\) is an unbiased estimator of the characteristic \(\beta_1\))
- \(Var[\hat{\beta}_0]=\sigma^2\bigg(\frac{1}{n}+\frac{\overline{x}}{\sum (x_i-\overline{x})^2} \bigg) \)
- \(Var[\hat{\beta}_1]=\frac{\sigma^2}{\sum (x_i-\overline{x})^2}\)

To show the second claim, we have \[\mathbb{E}[\hat{\beta}_1]=\sum w_i \mathbb{E}[Y_i]=\sum w_i \mathbb{E}[Y_i]=\sum w_i(\beta_0+\beta_1 x_i)=\beta_1 \]
the first claim follows from the second.

To compute the variance of \(\hat{\beta}_1\), we note: \[Var[\hat{\beta}_1]=Var[\sum w_i Y_i]=\sum w_i^2 Var[Y_i]=\sum w_i \sigma^2=\frac{\sigma^2}{\sum (x_i-\overline{x})^2}\] We leave the computation of \(Var[\hat{\beta}_0]\) as a (long exercise).

To compute the variance of \(\hat{\beta}_1\), we note: \[Var[\hat{\beta}_1]=Var[\sum w_i Y_i]=\sum w_i^2 Var[Y_i]=\sum w_i \sigma^2=\frac{\sigma^2}{\sum (x_i-\overline{x})^2}\] We leave the computation of \(Var[\hat{\beta}_0]\) as a (long exercise).

created by Louis de Thanhoffer de Volcsey with thanks to Oliver Capon