06 May, 2022
What is a statistical model?
What is a statistical model?
Mathematical model
Statistical model
What is a statistical model?
A random variable is one whose values depend on a set of random events and are described by a probability distribution
What is a statistical model?
embodies a data generation process along with the distributional assumptions underlying this generation
incorporates uncertainty
\(response = model + error\)
What is the purpose of statistical modelling?
How do we estimate model parameters? - \(y_i \sim{} \beta_0 + \beta_1 x_i\)
What criterion do we use to assess best fit?
If we assume \(y_i\) is drawn from a normal (gaussian) distribution…
y | x |
---|---|
9.64 | 1 |
3.79 | 2 |
11.00 | 3 |
27.88 | 4 |
32.84 | 5 |
32.56 | 6 |
37.84 | 7 |
29.86 | 8 |
45.05 | 9 |
47.65 | 10 |
\[ y_i = \beta_0\times 1 + \beta_1\times x_i \]
\[ \underbrace{\begin{bmatrix} y_1\\ y_2\\ y_3\\ ...\\ y_i \end{bmatrix}}_{Y} = \underbrace{\begin{bmatrix} 1&x_1\\ 1&x_2\\ 1&x_3\\ ...\\ 1&x_i \end{bmatrix}}_{\boldsymbol{X}} \underbrace{\vphantom{\begin{bmatrix} y_1\\ y_2\\ y_3\\ ...\\ y_i \end{bmatrix}}\begin{bmatrix} \beta_0&\beta \end{bmatrix}}_{\boldsymbol{\beta}} \]
\[ \begin{align} Y =& \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\\ \boldsymbol{\varepsilon} =& Y - \boldsymbol{X}\boldsymbol{\beta} \end{align} \]
Minimize resid sum of squares (\(\boldsymbol{\varepsilon}^T\boldsymbol{\varepsilon}\))
\[ \tiny \begin{bmatrix} \varepsilon_1&\varepsilon_2&\varepsilon_3&...&\varepsilon_n \end{bmatrix} \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \varepsilon_3\\ ...\\ \varepsilon_n\\ \end{bmatrix} = \begin{bmatrix} \varepsilon_1\times\varepsilon_1 + \varepsilon_2\times\varepsilon_2 + \varepsilon_3\times\varepsilon_3 + ... + \varepsilon_n\times\varepsilon_n \end{bmatrix} \]
\[ \begin{align} \boldsymbol{\varepsilon}^T\boldsymbol{\varepsilon} =& (Y - \boldsymbol{X}\boldsymbol{\beta})^T(Y - \boldsymbol{X}\boldsymbol{\beta})\\ \boldsymbol{\beta} =& (\boldsymbol{X}^T\boldsymbol{X})^{-1} \boldsymbol{X}^T Y \end{align} \]
## Generate the model matrix X = model.matrix(~1 + x, data = dat) ## Solve for beta beta = solve(t(X) %*% X) %*% t(X) %*% dat$y beta
## [,1] ## (Intercept) 2.650667 ## x 4.574606
Provided data (and residuals):
\(f(x\mid\mu, \sigma^2) = \frac{1}{\sqrt{2\sigma^2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\)
What do we do, if the data do not satisfy the assumptions?
\[ \begin{align} y_i =& \beta_0 + \beta_1 x_i + \varepsilon_i\\ OR\\ y_i \sim{}& N(\mu_i, \sigma^2)\\ \mu_i =& \beta_0 + \beta_1 x_i \end{align} \]
Type | Example | Distribution | Range |
---|---|---|---|
Measurements | length, weight | Gaussian | real, \(-\infty\le x\ge\infty\) |
logNormal | real, \(0< x\ge\infty\) | ||
Gamma | real, \(0< x\ge\infty\) | ||
Counts | Abundance | Poisson | discrete, \(0\ge x\le\infty\) |
Negative Binomial | discrete, \(0\ge x\le\infty\) | ||
Binary | Presence/Absence | Binomial | discrete, \(x=0,1\) |
Proportions | Ratio | Binomial | discrete, \(0\ge x\le n\) |
Percentages | Percent cover | Binomial | real, \(0\le x\ge 1\) |
Beta | real, \(0< x >1\) | ||
Categorical | Severity | Ordinal | discrete, \(x=0,1\) |
What about density (count / area)?
Type | Example | Distribution | Range |
---|---|---|---|
Measurements | length, weight | Gaussian | real, \(-\infty\le x\ge\infty\) |
logNormal | real, \(0< x\ge\infty\) | ||
Gamma | real, \(0< x\ge\infty\) | ||
Counts | Abundance | Poisson | discrete, \(0\ge x\le\infty\) |
Negative Binomial | discrete, \(0\ge x\le\infty\) | ||
Binary | Presence/Absence | Binomial | discrete, \(x=0,1\) |
Proportions | Ratio | Binomial | discrete, \(0\ge x\le n\) |
Percentages | Percent cover | Binomial | real, \(0\le x\ge 1\) |
Beta | real, \(0< x> 1\) | ||
Categorical | Severity | Ordinal | discrete, \(x=0,1\) |
zero-bound variables with large var.
\(f(x\mid s, a) = \frac{1}{(s^a\Gamma(a))} x^(a-1) e^{-(x/s)}\)
\(a=shape, s=scale\)
\(\mu = as, \sigma^2 = as^2\)
Type | Example | Distribution | Range |
---|---|---|---|
Measurements | length, weight | Gaussian | real, \(-\infty\le x\ge\infty\) |
logNormal | real, \(0< x\ge\infty\) | ||
Gamma | real, \(0< x\ge\infty\) | ||
Counts | Abundance | Poisson | discrete, \(0\ge x\le\infty\) |
Negative Binomial | discrete, \(0\ge x\le\infty\) | ||
Binary | Presence/Absence | Binomial | discrete, \(x=0,1\) |
Proportions | Ratio | Binomial | discrete, \(0\ge x\le n\) |
Percentages | Percent cover | Binomial | real, \(0\le x\ge 1\) |
Beta | real, \(0< x> 1\) | ||
Categorical | Severity | Ordinal | discrete, \(x=0,1\) |
What about abundance?
Count data
\(f(x\mid \lambda) = \frac{e^{-\lambda}\lambda^x}{x!}\)
\(\mu = \sigma^2 = \lambda = df\)
\(\theta~(dispersion) = \frac{\sigma^2}{\mu} = 1\)
Count data
\(f(x\mid \mu, \omega) = \frac{\Gamma(x + \omega)}{\Gamma(\omega)x!}\times\frac{\mu^x\omega^\omega}{(\mu + \omega)^{\mu+\omega}}\)
\(\theta~(dispersion) = 1/\omega\)
\(\omega = -\frac{\mu^2}{\mu - \sigma^2} \hspace{1cm} \theta = 0~(when~\omega=\infty)\)
Type | Example | Distribution | Range |
---|---|---|---|
Measurements | length, weight | Gaussian | real, \(-\infty\le x\ge\infty\) |
logNormal | real, \(0< x\ge\infty\) | ||
Gamma | real, \(0< x\ge\infty\) | ||
Counts | Abundance | Poisson | discrete, \(0\ge x\le\infty\) |
Negative Binomial | discrete, \(0\ge x\le\infty\) | ||
Binary | Presence/Absence | Binomial | discrete, \(x=0,1\) |
Proportions | Ratio | Binomial | discrete, \(0\ge x\le n\) |
Percentages | Percent cover | Binomial | real, \(0\le x\ge 1\) |
Beta | real, \(0< x> 1\) | ||
Categorical | Severity | Ordinal | discrete, \(x=0,1\) |
What about binary and proportions?
Proportions or Presence/absence
\(f(x\mid n, p) = \binom{n}{p}p^x(1-p)^{n-x}\)
\(\mu = np, \sigma^2 = np(1-p)\)
for presence/absence \(n=1\)
Type | Example | Distribution | Range |
---|---|---|---|
Measurements | length, weight | Gaussian | real, \(-\infty\le x\ge\infty\) |
logNormal | real, \(0< x\ge\infty\) | ||
Gamma | real, \(0< x\ge\infty\) | ||
Counts | Abundance | Poisson | discrete, \(0\ge x\le\infty\) |
Negative Binomial | discrete, \(0\ge x\le\infty\) | ||
Binary | Presence/Absence | Binomial | discrete, \(x=0,1\) |
Proportions | Ratio | Binomial | discrete, \(0\ge x\le n\) |
Percentages | Percent cover | Binomial | real, \(0\le x\ge 1\) |
Beta | real, \(0< x> 1\) | ||
Categorical | Severity | Ordinal | discrete, \(x=0,1\) |
What about percentages?
Continuous between 0 and 1
\(f(x\mid a, b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} x^{a-1} (1-x)^{b-1}\)
\(\mu = \frac{a}{a+b}, \sigma^2 = \frac{ab}{(a + b)^2.(a + b + 1)}\)
\[Y = \beta_0 + \beta_1x_1~+~...~+~\beta_px_p~+~e\] \[\underbrace{g(\mu)}_{Link~~function} = \underbrace{\beta_0 + \beta_1x_1~+~...~+~\beta_px_p}_{Systematic}\]
Linear model is just a special case
\[Y = \beta_0 + \beta_1x_1~+~...~+~\beta_px_p~+~e\] \[\underbrace{g(\mu)}_{Link~~function} = \underbrace{\beta_0 + \beta_1x_1~+~...~+~\beta_px_p}_{Systematic}\]
Response variable | Probability distribution | Link function | Model name |
---|---|---|---|
Continuous measurements | Gaussian | identity: \(\mu\) | Linear regression |
Gamma | inverse: \(\frac{1}{\mu}\) | Gamma regression | |
log: \(log(\mu)\) | |||
Counts | Poisson | log: \(log(\mu)\) | Poisson regression |
log-linear model | |||
Negative binomial | log: \(log(\mu)\) | Negative binomial regression | |
Quasi-poisson | log: \(log(\mu)\) | Poisson regression | |
Binary, proportions | Binomial | logit: | Logistic regression |
\(log\left(\frac{\pi}{1-\pi}\right)\) | |||
probit: \(\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\alpha+\beta.X} exp\left(-\frac{1}{2}Z^2\right)dZ\) | Probit regression | ||
Complimentary log-log: \(log(-log(1-\pi))\) | Logistic regression | ||
Percentages | Beta | logit: \(log\left(\frac{\pi}{1-\pi}\right)\) | Beta regression |
Gaussian: \[ P(x_i|\mu, \sigma) = \frac{1}{\sqrt{\sigma^2 2\pi}} exp^{-\frac{1}{2}((x_i-\mu)/\sigma)^2} \]
Likelihood:
\[ L(X|\theta) = \prod^n_{i=1} f'(x_i|\theta) \]
Product of the likelihood of each observation given parameters.
Log-likelihood:
\[ LL(X|\theta) = \sum^n_{i=1} log(f'(x_i|\theta)) \]
## mu sigma LL ## 4252 27.87879 14.24242 40.76342
Empirical mean and sd:
## [1] "Mean= 27.811 SD= 15.0303407590558"
Nelder-Mead algorithm
optim(par=c(20,12), LL.gaus, x=dat$y)
## $par ## [1] 27.81143 14.25771 ## ## $value ## [1] 40.76329 ## ## $counts ## function gradient ## 53 NA ## ## $convergence ## [1] 0 ## ## $message ## NULL
Newton-Raphson method
nlm(LL.gaus, c(20,12), x=dat$y, hessian=TRUE, gradtol=1e-03)
## $minimum ## [1] 40.76329 ## ## $estimate ## [1] 27.81205 14.25769 ## ## $gradient ## [1] 5.249938e-05 -1.312170e-04 ## ## $hessian ## [,1] [,2] ## [1,] 4.919280e-02 -1.686155e-05 ## [2,] -1.686155e-05 9.836416e-02 ## ## $code ## [1] 1 ## ## $iterations ## [1] 9
optim(par=c(20,12), LL.gaus, x=dat$y, method='SANN')
## $par ## [1] 27.82430 14.25626 ## ## $value ## [1] 40.7633 ## ## $counts ## function gradient ## 10000 NA ## ## $convergence ## [1] 0 ## ## $message ## NULL
\[ \boldsymbol{\beta} = (\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{X}^{-1})\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{z} \]
glm(y~1, data=dat)
## ## Call: glm(formula = y ~ 1, data = dat) ## ## Coefficients: ## (Intercept) ## 27.81 ## ## Degrees of Freedom: 9 Total (i.e. Null); 9 Residual ## Null Deviance: 2033 ## Residual Deviance: 2033 AIC: 85.53