Week 3

This week, we discussed chapter 5 from the textbook on the definition of statistical inference.


We began by a discussion on the final (and most important) example of a distribution: the normal distribution. Recall that this is the distribution whose density is given by \[ f_{\mu,\sigma}=\frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{(x-\mu)^2}{2\sigma^2}} \] In order to motivate this distribution, we needed a little more terminology:
Let \( X:(S,P)\longrightarrow T\) be a random variable. A random sample is a sequence of random variables \(X_1,\ldots X_n\) such that
  • \(X_i\) and \(X_j\) are independent for \(i\neq j\)
  • \(P_{X_i}= P_X\) for any \(i\)
  • I went on to explain how random samples always exist by considering the RV \[ X_i:(S^n,P^n)\longrightarrow T: (s_1,\ldots s_n)\mapsto X(s_i)\] (in more human language: if \(S\) consisted of 100 numbered balls and \(X\) was the RV that records the number \(X_{20}\) out of a random sample of 100 would be the RV that draws any 100 balls and record the value of the 20th).
    Knowing what a random sample is, allows us to describe the coveted normal distribution very elegantly:
    (central limit theorem) Let \(X\) be an RV with mean \(\mu\) and variance \(\sigma^2\). For each \(n\), let \(X_1\ldots, X_n\) be a random sample. Cosider the random variable \(L_n=\sqrt{n}(\frac{1}{n}\sum (X_n-\mu\). Then \(L_n\) converges in distribution to a normal distribution. That is \[\lim_{n\to \infty} f_{L_n}(x)=f_{0,\sigma}(x)\] where \(f\) is the density funtion of a normal distribution with mean 0 and variance \(\sigma^2\).

    We then went on to discuss statistical inference: statistics is about trying to make inferences in a specific setting. We are given a set \(S\) (the sample space) together with a subset \(\Delta\) of data, and use this to define a probability on \(S,P\):\[(S,\Delta)\Longrightarrow (S,P)\Longrightarrow \textrm{ inference }\] Example 5.2.1 yielded some examples of typical inferences we do in the field of statistics:
    • predict a typical value (when will the machine typically breakdown)
    • find a typical set with a high probability (after how long have 95% of the machines broken down?)
    • determine whether an even is likely to happen (how likely is it the machine breaks down after 7 years?)
    To formalize the idea of using data to deduce a probability measure, we introduced statistical models:
    A statistical model \(\Omega\Longrightarrow S\) consists of a family of probability measures \(\{P_\theta\}_{\theta\in \Omega}\), such that \(\theta\longrightarrow P_\theta\) is injective. We also endow it with a chosen value \(\theta \in \Omega\), the true parameter -whenever necessary.
    I gave an example expanding on 5.3.1: if \(S\) consists of 100 balls some white, some black, then we can consider the amount of black ones as an unknown parameter \(\theta \in \{1,\ldots , 100\}\). We can then look at the RV \(X: S^2\longrightarrow \{w,b\}^2\) which draws out 2 balls and records the resulting colors. The probability ditribution on \(\{w,b\}^2\) becomes:
    Outcome (w,w) (w,b) (b,w) (b,b)
    prob. \(\frac{(100-\theta)(100-\theta-1)}{9900}\) \(\frac{(100-\theta)(\theta)}{9900}\) \(\frac{(\theta)(100-\theta)}{9900}\) \(\frac{(\theta)(\theta-1)}{9900}\)
    Chapter 5 deals with the easiest form of making inferences: descriptive statistics:
    Let \(\Delta \subset S\). The descriptive probability \(P_\theta\) on \(S\) is constructed as follows
    1. define \(P_\Delta(x)=\frac{1}{\vert \Delta \vert }\) for each \(x \in \Delta\)
    2. let \(P_\theta={P_\Delta}_X\) where \(X:\Delta\longrightarrow S\)
    For a statistical concept and \(\Delta\), we call the smaple version, the result when we apply the corresponding concept to the above definition of \(P_\theta\). For example,
    1. since any element either has probability \(\frac{1}{\vert \Delta\vert}\) or 0 depending wether it lies in \(\Delta\) sample density \(f_\Delta\) for a dataset \(\Delta\subset S\) is given by \[f_\Delta(x)=\frac{1}{\vert \Delta\vert} \cdot 1_\Delta\]
    2. The also means that the cdf satisfies \[F_\Delta(x)=P(y \le x)=\frac{\vert \{y \in \Delta \vert y \le x\}\vert}{\vert \Delta \vert}\]
    3. We can then use the sample cdf to compute the sample quantiles \(\min_x p\le F_\theta(x)\) simply by arranging all datapoints \(x_1\le x_2\ldots \le x_n\), in which case the p-th quantile would be \(x_{\lceil{np}\rceil}\) (where \(\lceil x\ rceil\) denotes the smalest integer larger than \(x\))
    4. the sample mean (assuming S will be finite as well, is given by \[\mu=\sum_{x\in S} x\cdot P_\theta(x)=\sum_{x\in \Delta}x\cdot \frac{1}{\vert \Delta \vert}=\frac{1}{\vert \Delta \vert }\sum_{x\in \Delta}\cdot x \]
    5. We make final remark: it would seem plausible that the sample variance be defined as \[\frac{1}{\vert \Delta \vert}\sum_{x\in \Delta} (x-\mu)^2\] However, usually, one divides by \(\vert \Delta\vert -1\) instead. Later we will explain the rational behind this, but I gave a very broadstroke idea in class: essentially, you are computing the distance from \(x\)-values to a given \(\mu\). This problem is one dimension lower, and as such it makes sense to divide by \(\vert \Delta\vert -1\) instead...