library(tidyverse) #for data wrangling and plotting
library(gganimate) #for animations
library(gridExtra) #for additional plotting routines
library(ggfortify) #for regression diagnostics (autoplot)
library(DHARMa) #for simulated residuals
library(performance) #for model diagnostics
library(see) #for model diagnostics
library(glmmTMB) #for GLM(M)'s
library(gmodels) #for additional statistics
library(MASS) #for neg binom glm
Generalised linear models
1 Preparations
Load the necessary libraries
2 Preamble
Statistics is a branch of mathematics. Therefore it is inevitable that this tutorial WILL visit numerous mathematical concepts and present mathematical numenclature. In an attempt to constain the dread felt by readers who have long seen off the tortorous high school mathematics lessons, I will try to introduce these elements as gently as possible and will attempt to maintain a consistent use of symbology throughout. Keep in mind, that although it is possible to fit linear models without understanding the underlying mathematics, a more solid grounding will help ensure that models are fit and interpreted correctly.
3 Statistical models
Before we get into the details of statistical models, it is important that we establish the purpose of statistical models. Broadly speaking, the purpose of statistical models is:
estimate (quantify) the effects (associations, relationships, impacts) of various potential influences on a system. For example, to quantify the magnitude of change in total fish biomass resulting from a ban on fishing activity in an area. Another example might be for estimating the rate at which clutch size increases for every unit increase in body size in a species of bird.
testing inferences about specific apriori hypotheses. These a probabilistic conclusions about the strength of evidence in support of a hypothesis. For example, whether there is “significant” evidence that fishing exclusion zones increases fish biomass
guide decision making and prediction about future outcomes under different scenarios. For example, forcasting fish biomass increases in an area slated to become a new fishing exclusion zone.
It should be noted that different statistical procedures differ in the degree to which they support the above purposes and so it is important to match the statistical procedure to the research question and data context. For example, linear models (the subject of this tutorial) are intentionally low dimensional representations of a complex system. By low dimensional I mean that a linear model should be restricted to just a small number of parameters (or dimensions, such as the effects of fishing exclusion) out of an almost infinite number of effects (dimensions) that could effect fish biomass. In so doing, the focus on estimating the effect of these potential influences over and above (that is marginalising) all other influences. Whilst such an approach is very effective for estimation, inference testing and predicting the relative change in a response, the low dimensionality makes such an approach unsuitable for absolute prediction (of say the biomass of fish an a specific location at a specific point in time).
On the other hand, machine learning models trained on a very large set of cases and potential influences can often predict new cases with very high accuracy (since they incorporate many dimensions), yet are often unable to attribute which exact influences drive the predictions and by how much. That is, these approaches are very good at prediction, but unsuitable for estimation and inference testing.
4 Overview of linear models
Correlation and regression are techniques used to examine associations and relationships between continuous variables collected on the same set of sampling or experimental units. Specifically, correlation is used to investigate the degree to which variables change or vary together (covary). In correlation, there is no distinction between dependent (response) and independent (predictor) variables and there is no attempt to prescribe or interpret the causality of the association.
For example, there may be an association between arm and leg length in humans, whereby individuals with longer arms generally have longer legs. Neither variable directly causes the change in the other. Rather, they are both influenced by other variables to which they both have similar responses. Hence correlations apply mainly to survey designs where each variable is measured rather than specifically set or manipulated by the investigator.
Regression is used to investigate the nature of a relationship between variables in which the magnitude and changes in one or more variables (known as the independent, predictors or covariates) are assumed to be directly responsible for the magnitude and changes in another variable (dependent or response variable). Regression analyses apply to both survey and experimental designs. Whilst for experimental designs, the direction of causality is established and dictated by the experiment, for surveys the direction of causality is somewhat discretionary and based on prior knowledge or intuition.
For example, although it is possible that ambient temperature effects the growth rate of a species of plant, the reverse is not logical. As an example of regression, we could experimentally investigate the relationship between algal cover on rocks and molluscan grazer density by directly manipulating the density of snails in different specifically control plots and measuring the cover of algae therein. Any established relationship must be driven by snail density, as this was the controlled variable. Alternatively the relationship could be investigated via a field survey in which the density of snails and cover of algae could be measured from random locations across a rock platform. In this case, the direction of causality (or indeed the assumption of causality) may be more difficult to defend.
As stated above, one of the main goals of fitting a statistical model is to use a sample of observed data to be able to gain some insights into the presence or nature of relationships between a response and predictors in a defined population. This is achieved by first proposing one or more (deterministic) mathematical representations of possible relationships and then challenge the representation with data.
For example, we might propose that an increase in a predictor (\(x\)) will be associated with a linear increase in a response (\(y\)).
The general mathematical equation of the linear line is:
\[ y_i = \beta_0 + \beta x_i \]
where:
- \(y_i\) represents a vector (think single variable) of response values (the subset \(i\) is a way to indicate that there are multiple values),
- \(x_i\) represents the value of the \(ith\) observed predictor value.
- \(\beta_0\) represents a single y-intercept (expected value of \(y\) when \(x\)=0),
- \(\beta\) represents the slope of the linear line (this is the rate of change in \(y\) associated with 1 unit change in \(x\)).
The above expression can be re-written as: \[ y_i = \beta_0\times 1 + \beta_1\times x_i \]
which we can visualize in vector/matrix form as: \[ \underbrace{\begin{bmatrix} y_1\\ y_2\\ y_3\\ ...\\ y_i \end{bmatrix}}_{Y} = \underbrace{\begin{bmatrix} 1&x_1\\ 1&x_2\\ 1&x_3\\ ...\\ 1&x_i \end{bmatrix}}_{\boldsymbol{X}} \underbrace{\vphantom{\begin{bmatrix} y_1\\ y_2\\ y_3\\ ...\\ y_i \end{bmatrix}}\begin{bmatrix} \beta_0&\beta \end{bmatrix}}_{\boldsymbol{\beta}} \]
So given a sample of observed response (\(y\)) and predictor (\(x\)) values, we want to be able to estimate the two unknowns (\(\beta_0\) and \(\beta\)) in the above mathematical expression. If we assume that there is no uncertainty in the data, we can easily calculate the necessary value of \(\beta_0\) and \(\beta\) using any two data pairs and solving simultaneous equations.
\[ \begin{align} 7 =& \beta_0 + \beta 1~~and~~12 = \beta_0 + \beta 2\\ \beta_0 =& 7 - \beta ~~(\text{isolate }\beta_0\text{ from the first equation})\\ \beta =& 5 ~~(\text{solve for }\beta)\\ \beta_0 =& 2 ~~(\text{substitute }\beta\text{ and solve for }\beta_0)\\ \end{align} \]
The above model is a deterministic or mathematical model - it suggests complete confidence in the relationship. As already indicated, the point of statistical modelling is to formulate the equation in the presence of a sample of data drawn from the actual population. The models themselves are always a low-dimensional representation of reality. They cannot hope to capture all the complexity contained in the system and nor do they want to. They just want to provide insights into a very small fraction of this complexity.
Since a set of observed responses is likely to be the result of a large (if not infinite) number of interacting processes, the chances that all observed responses will respond to the predictor in exactly the same way is highly unlikely. Therefore, in all data there is noise. In modelling, we want to see if there is any detectable signal amongst all the noise.
Statistical models include both a deterministic component (a mathematical expression that relates a response to one or more predictors) as well as a stochastic component (representing uncertaintly). For example, the following figures highlight the distinction between a purely deterministic (mathematical) model (left hand figure) and a statistical model (right hand figure).
The statistical model captures the stochastic component (\(\varepsilon\)) which represents the degree to which each observed value differs from what would be expected in the absence of any noise (uncertainty). These differences between observed and expected values are called residuals.
So one way to write out this linear model would be:
\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]
Again, in vector and matrix form, this would be:
\[ \underbrace{\begin{bmatrix} y_1\\ y_2\\ y_3\\ ...\\ y_i \end{bmatrix}}_{Y} = \underbrace{\begin{bmatrix} 1&x_1\\ 1&x_2\\ 1&x_3\\ ...\\ 1&x_i \end{bmatrix}}_{\boldsymbol{X}} \underbrace{\vphantom{\begin{bmatrix} y_1\\ y_2\\ y_3\\ ...\\ y_i \end{bmatrix}}\begin{bmatrix} \beta_0&\beta \end{bmatrix}}_{\boldsymbol{\beta}} + \underbrace{\begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \varepsilon_3\\ ...\\ \varepsilon_i \end{bmatrix}}_{\boldsymbol{\varepsilon}} \]
- \(Y\) represents a column vector of the observed data
- \(\boldsymbol{X}\) represents what is called the model matrix. In this case, that is just a single column matrix of 1s. The ones indicate that there is an intercept. In matrix algebra, matrices are multiplied, so a column of ones just ensures that the thing being multiplied does exist.
- \(\boldsymbol{\beta}\) represents a vector of parameters to be estimated. In this case, there is only one.
- \(\boldsymbol{\varepsilon}\) represents a column vector or residuals.
This could all be summarised as:
\[ Y = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \]
To help with naming conventions, I will present the above matrix notation again, yet with more general labels:
\[ \underbrace{\begin{pmatrix} 0\\1\\2\\4\\7\\10 \end{pmatrix}}_\text{Response values} = \underbrace{\begin{pmatrix} 1&1\\1&2\\1&3\\1&4\\1&5\\1&6 \end{pmatrix}}_\text{Model matrix} \times \underbrace{\begin{pmatrix} \beta_0\\\beta_1 \end{pmatrix}}_\text{Parameter vector} + \underbrace{\begin{pmatrix} \varepsilon_1\\\varepsilon_2\\\varepsilon_3\\\varepsilon_4&\\\varepsilon_5\\\varepsilon_6 \end{pmatrix}}_\text{Residual vector} \]
The model matrix (also known as the design matrix) contains the “weights” for each parameter for each of the sampling units. The above matrix notation also isolates and thus highlights the “unknown” regression parameters (\(\beta_0\) and \(\beta_1\)) from the observed data from which they are estimated.
4.1 Categorical predictors
The model structure for linear models containing a single categorical predictor variable (known as a factor) with two or more treatment levels (groups) is similar in form to the multiple linear regression model (listed immediately above) with the overall mean (\(\mu\)) replacing the y-intercept (\(\beta_0\)). The factor levels (groups) are represented in the model by binary (contain only of 0s and 1s, see Table below) indicator (or `dummy’) variables and their associated estimable parameters (\(\beta_1,~\beta_2,~...\)).
For a data set comprising of \(p\) groups and \(n\) replicates within each group, the linear model is: \[y_{ij} = \mu + \beta_1(dummy_1)_{ij} + \beta_2(dummy_2)_{ij} + .... + \varepsilon_{ij}\] where \(j\) represents the treatment levels (from 1 to \(p\)) and \(i\) represents the set of replicates (from 1 to \(n\)) within the \(j^{th}\) group. Hence, \(y_{ij}\) represents the \(i^{th}\) observation of the response variable within the \(j^{th}\) group and \((dummy_1)_{ij}\) represents the dummy code for the \(i^{th}\) replicate within the \(j^{th}\) group of the first dummy variable (first treatment level).
Raw data
Y | Group |
---|---|
2 | G1 |
3 | G1 |
4 | G1 |
6 | G2 |
7 | G2 |
8 | G2 |
10 | G3 |
11 | G3 |
12 | G3 |
Dummy coded data
y | \(dummy_1\) | \(dummy_2\) | \(dummy_3\) |
---|---|---|---|
2 | 1 | 0 | 0 |
3 | 1 | 0 | 0 |
4 | 1 | 0 | 0 |
6 | 0 | 1 | 0 |
7 | 0 | 1 | 0 |
8 | 0 | 1 | 0 |
10 | 0 | 0 | 1 |
11 | 0 | 0 | 1 |
12 | 0 | 0 | 1 |
The dummy variable for a particular treatment level contains all 0s except in the rows that correspond to observations that received that treatment level. The table above illustrates the dummy coding for a single factor with three levels (G1',
G2’, `G3’) each with three replicates. Note that a linear model of the form:
\[y_{ij}=\beta_0+\beta_1(dummy_1)_{ij} + \beta_2(dummy_2)_{ij} +\beta_3(dummy_3)_{ij} + \varepsilon_{ij}\]
is over-parameterized (as we are attempting to estimate four parameters from three populations, see the tutorial on estimation.
An effects model for a factor with \(p\) groups, will have \(p+1\) parameters (the overall mean \(\mu\) plus the \(p\) \(\beta\) parameters), and thus the linear effects model is considered to be over-parameterized. Given that \(\beta_j=\mu_j - \mu\), it is only possible to estimate \(p-1\) orthogonal (independent) parameters. For example, once \(\mu\) and \(p-1\) of the effects parameters have been estimated, the final effects parameter is no longer free to vary and therefore cannot be independently estimated. Likewise, if the full linear model contains as many dummy variables as there are treatment groups, then it too is over-parameterized.
In order to obtain parameter estimates, the model must be reduced to a total of \(p\) parameters. Over-parameterization can be resolved by one of two alternative parameterizations:
means parameterization - removing one of the parameters from the effects model (either the overall mean (\(\mu\)) or one of the treatment effects (\(\beta_j\)) parameters - a procedure rarely used in a frequentist framework in biology). When it is the overall mean that is removed, then each of the regression parameters represent the mean of a group.
\[y_{ij} = \beta_j+\varepsilon_{ij}, \hspace{2cm}\text{where}~j= 1:\text{number of levels of the factor}\] \[\begin{pmatrix} 2\\3\\4\\6\\7\\8\\10\\11\\12 \end{pmatrix} = \begin{pmatrix} 1&0&0\\1&0&0\\1&0&0\\0&1&0\\0&1&0\\0&1&0\\0&0&1\\0&0&1\\0&0&1 \end{pmatrix} \times \begin{pmatrix} \beta_1\\\beta_2\\\beta_3 \end{pmatrix} + \begin{pmatrix} \varepsilon_1\\\varepsilon_2\\\varepsilon_3\\\varepsilon_4&\\\varepsilon_5\\\varepsilon_6\\\varepsilon_7&\\\varepsilon_8\\\varepsilon_9 \end{pmatrix} \]
effects parameterization - generating a new set (\(p-1\)) of effects parameters (\(\beta_{j}\), where \(j\) represents the set of orthogonal parameters from 1 to \(p-1\)) each of which represent a linear combination of groups rather than a single group effect. That is, each \(\beta\) can include varying contributions from any number of the groups and are not restricted to a single contrast of (\(=\beta_j-\beta\)). For example, one of the parameters might represent the difference in means between two groups or the difference in means between one group and the average of two other groups. \[y_{ij} = \beta_0+\beta_j+\varepsilon_{ij}, \hspace{2cm}\text{where}~j= 1:(p-1)\] In the effects parameterization, \(\mu\) typically represents the mean of one of the groups (a reference group) and each of the \(\beta\) effects represent the difference between subsequent group means and the reference group.
\[\begin{pmatrix} 2\\3\\4\\6\\7\\8\\10\\11\\12 \end{pmatrix} = \begin{pmatrix} 1&0&0\\1&0&0\\1&0&0\\1&1&0\\1&1&0\\1&1&0\\1&0&1\\1&0&1\\1&0&1 \end{pmatrix} \times \begin{pmatrix} \beta_0\\\beta_1\\\beta_2\\\beta_3 \end{pmatrix} + \begin{pmatrix} \varepsilon_1\\\varepsilon_2\\\varepsilon_3\\\varepsilon_4&\\\varepsilon_5\\\varepsilon_6\\\varepsilon_7&\\\varepsilon_8\\\varepsilon_9 \end{pmatrix} \]
The reduced number of effects parameters are defined through the use of a matrix of contrast coefficients. Note, the new set of effects parameters should incorporate the overall relational effects of each of the groups equally such that each group maintains an equal contribution to the overall model fit. The contrast coefficients are responsible for determining how the model is re-parameterized into an orthogonal model matrix.
A number of pre-fabricated, contrast matrices exist (to yield commonly used model matrices), each of which estimate a different set of specific comparisons between treatment combinations. We will explore each of these with Motivating example 3 in subsequent sections on fitting linear models.
5 Example data
This tutorial will blend theoretical discussions with actual calculations and model fits. I believe that by bridging the divide between theory and application, we all gain better understanding. The applied components of this tutorial will be motivated by numerous fabricated data sets. The advantage of simulated data over real data is that with simulated data, we know the ‘truth’ and can therefore gauge the accuracy of estimates.
The motivating examples are:
- Example 1 - simulated samples drawn from a Gaussian (normal) distribution reminiscent of data collected on measurements (such as body mass)
- Example 2 - simulated Gaussian samples drawn three different populations representing three different treatment levels (e.g. body masses of three different species)
- Example 3 - simulated samples drawn from a Poisson distribution reminiscent of count data (such as number of individuals of a species within quadrats)
- Example 4 - simulated samples drawn from a Negative Binomial distribution reminiscent of over-dispersed count data (such as number of individuals of a species that tends to aggregate in groups)
- Example 5 - simulated samples drawn from a Bernoulli (binomial with \(n = 1\)) distribution reminiscent of binary data (such as the presence/absence of a species within sites)
- Example 6 - simulated samples drawn from a Binomial distribution reminiscent of proportional data (such as counts of a particular taxa out of a total number of individuals)
Lets formally simulate the data illustrated above. The underlying process dictates that on average a one unit change in the predictor (x
) will be associated with a five unit change in response (y
) and when the predictor has a value of 0, the response will typically be 2. Hence, the response (y
) will be related to the predictor (x
) via the following:
\[ y = 2 + 5x \]
This is a deterministic model, it has no uncertainty. In order to simulate actual data, we need to add some random noise. We will assume that the residuals are drawn from a Gaussian distribution with a mean of zero and standard deviation of 4. The predictor will comprise of 10 uniformly distributed integer values between 1 and 10. We will round the response to two decimal places.
For repeatability, a seed will be employed on the random number generator. Note, the smaller the dataset, the less it is likely to represent the underlying deterministic equation, so we should keep this in mind when we look at how closely our estimated parameters approximate the ‘true’ values. Hence, the seed has been chosen to yield data that maintain a general trend that is consistent with the defining parameters.
set.seed(234)
dat <- data.frame(x = 1:10) |>
mutate(y = round(2 + 5*x + rnorm(n = 10, mean = 0, sd = 4), digits = 2))
dat
x y
1 1 9.64
2 2 3.79
3 3 11.00
4 4 27.88
5 5 32.84
6 6 32.56
7 7 37.84
8 8 29.86
9 9 45.05
10 10 47.65
We will use these data in two ways. Firstly, to estimate the mean and variance of the reponse (y
) ignoring the predcitor (x
) and secondly to estimate the relationship between the reponse and predictor.
For the former, we know that the mean and variance of the response (y
) can be calculated as:
\[ \begin{align} \bar{y} =& \frac{1}{n}\sum^n_{i=1}y_i\\ var(y) =& \frac{1}{n}\sum^n_{i=1}(y-\bar{y})^2\\ sd(y) =& \sqrt{var(y)} \end{align} \]
As previously described, categorical predictors are transformed into dummy codes prior to the fitting of the linear model. We will simulate a small data set with a single categorical predictor comprising a control and two treatment levels (‘mediam’, ‘high’). To simplify things we will assume a Gaussian distribution, however most of the modelling steps would be the same regardless of the chosen distribution.
The data will be drawn from three Gaussian distributions with a standard deviation of 4 and means of 20, 15 and 10. We will draw a total of 12 observations, four from each of the three populations.
set.seed(123)
beta_0 <- 20
beta <- c(-5, -10)
sigma <- 4
n <- 12
x <- gl(3, 4, 12, labels = c('control', 'medium', 'high'))
y <- (model.matrix(~x) %*% c(beta_0, beta)) + rnorm(12, 0, sigma)
dat2 <- data.frame(x = x, y = y)
dat2
x y
1 control 17.758097
2 control 19.079290
3 control 26.234833
4 control 20.282034
5 medium 15.517151
6 medium 21.860260
7 medium 16.843665
8 medium 9.939755
9 high 7.252589
10 high 8.217352
11 high 14.896327
12 high 11.439255
The Poisson distribution is only parameterized by a single parameter (\(\lambda\)) which represents both the mean and variance. Furthermore, Poisson data can only be positive integers.
Unlike simple trend between two Gaussian or uniform distributions, modelling against a Poisson distribution alters the scale to logarithms. This needs to be taken into account when we simulate the data. The parameters that we used to simulate the underlying processes need to either be on a logarithmic scale, or else converted to a logarithmic scale prior to using them for generating the random data.
Moreover, for any model that involves a non-identity link function (such as a logarithmic link function for Poisson models), ‘slope’ is only constant on the scale of the link function. When it is back transformed onto the natural scale (scale of the data), it takes on a different meaning and interpretation.
We will chose \(\beta_0\) to represent a value of 1 when x=0
. As for the ‘effect’ of the predictor on the response, lets say that for every one unit increase in the predictor the response increases by 40% (on the natural scale). Hence, on the log scale, the slope will be \(log(1.5)=\) 0.3364722.
In theory, count data should follow a Poisson distribution and therefore have properties like mean equal to variance (e.g. \(\textnormal{Dispersion}=\frac{\sigma}{\mu}=1\)). However as simple linear models are low dimensional representations of a system, it is often unlikely that such a simple model can capture all the variability in the response (counts). For example, if we were modelling the abundance of a species of intertidal snail within quadrats in relation to water depth, it is highly likely that water depth alone drives snail abundance. There are countless other influences that the model has not accounted for. As a result, the observed data might be more variable than a Poisson (of a particular mean) would expect and in such cases, the model is over-dispersed (more variance than expected).
Over dispersed models under-estimate the variability and thus precision in estimates resulting in inflated confidence in outcomes (elevated Type I errors).
There are numerous causes of over-dispersed count data (one of which is eluded to above). These are:
- additional sources of variability not being accounted for in the model (see above)
- when the items being counted aggregate together. Although the underlying items may have been generated by a Poisson process, the items clump together. When the items are counted, they are more likely to be in either in relatively low or relatively high numbers - hence the data are more varied than would be expected from their overall mean.
- imperfect detection resulting in excessive zeros. Again the underlying items may have been generated by a Poisson process, however detecting and counting the items might not be completely straight forward (particularly for more cryptic items). Hence, the researcher may have recorded no individuals in a quadrat and yet there was one or more present, they were just not obvious and were not detected. That is, layered over the Poisson process is another process that determines the detectability. So while the Poisson might expect a certain proportion of zeros, the observed data might have a substantially higher proportion of zeros - and thus higher variance.
This example will generate data that is drawn from a negative binomial distribution so as to broadly represent any one of the above causes.
We will chose \(\beta_0\) to represent a value of 1 when x=0
. As for the ‘effect’ of the predictor on the response, lets say that for every one unit increase in the predictor the response increases by 40% (on the natural scale). Hence, on the log scale, the slope will be \(log(1.5)=\) 0.3364722. Finally, the dispersion parameter (ratio of variance to mean) will be 10.
set.seed(234)
beta <- c(1, 1.40)
beta <- log(beta)
n <- 10
size <- 10
dat4 <- data.frame(x = seq(from = 1, to = 10, len = n)) |>
mutate(
mu = exp(beta[1] + beta[2] * x),
y = rnbinom(n, size = size, mu = mu)
)
dat4
x mu y
1 1 1.400000 0
2 2 1.960000 3
3 3 2.744000 7
4 4 3.841600 3
5 5 5.378240 5
6 6 7.529536 9
7 7 10.541350 13
8 8 14.757891 10
9 9 20.661047 17
10 10 28.925465 26
Binary data (presence/absence, dead/alive, yes/no, heads/tails, etc) pose unique challenges for linear modeling. Linear regression, designed for continuous outcomes, may not be directly applicable to binary responses. The nature of binary data violates assumptions of normality and homoscedasticity, which are fundamental to linear regression. Furthermore, linear models may predict probabilities outside the [0, 1] range, leading to unrealistic predictions.
This example will generate data that is drawn from a bernoulli distribution so as to broadly represent presence/absence data.
We will chose \(\beta_0\) to represent the odds of a value of 1 when \(x=0\) equal to \(0.02\). This is equivalent to a probability of \(y\) being zero when \(x=0\) of \(\frac{0.02}{1+0.02}=0.0196\). E.g., at low \(x\), the response is likely to be close to 0. For every one unit increase in \(x\), we will stipulate a 2 times increase in odds that the expected response is equal to 1.
Similar to binary data, proportional (binomial) data tend to violate normality and homogeneity of variance (particularly as mean proportions approach either 0% or 100%.
This example will generate data that is drawn from a binomial distribution so as to broadly represent proportion data.
We will chose \(\beta_0\) to represent the odds of a particular trial (e.g. an individual) being of a particular type (e.g. species 1) when \(x=0\) and to equal to \(0.02\). This is equivalent to a probability of \(y\) being of the focal type when \(x=0\) of \(\frac{0.02}{1+0.02}=0.0196\). E.g., at low \(x\), the the probability that an individual is taxa 1 is likely to be close to 0. For every one unit increase in \(x\), we will stipulate a 2.5 times increase in odds that the expected response is equal to 1.
For this example, we will also convert the counts into proportions (\(y2\)) by division with the number of trials (\(5\)).
set.seed(123)
beta <- c(0.02, 2.5)
beta <- log(beta)
n <- 10
trials <- 5
dat6 <- data.frame(x = seq(from = 1, to = 10, len = n)) |>
mutate(
count = as.numeric(rbinom(n, size = trials, prob = plogis(beta[1] + beta[2] * x))),
total = trials,
y = count/total
)
dat6
x count total y
1 1 0 5 0.0
2 2 1 5 0.2
3 3 1 5 0.2
4 4 4 5 0.8
5 5 2 5 0.4
6 6 5 5 1.0
7 7 5 5 1.0
8 8 4 5 0.8
9 9 5 5 1.0
10 10 5 5 1.0
6 Exploratory data analysis
Statistical models utilize data and the inherent statistical properties of distributions to discern patterns, relationships, and trends, enabling the extraction of meaningful insights, predictions, or inferences about the phenomena under investigation. To do so, statistical models make assumptions about the likely distributions from which the data were collected. Consequently, the reliability and validity of any statistical model depend upon adherence to these underlying assumptions.
Exploratory Data Analysis (EDA) and assumption checking therefore play pivotal roles in the process of statistical analysis, offering essential tools to glean insights, assess the reliability of statistical methods, and ensure the validity of conclusions drawn from data. EDA involves visually and statistically examining datasets to understand their underlying patterns, distributions, and potential outliers. These initial steps provides an intuitive understanding of the data’s structure and guides subsequent analyses. By scrutinizing assumptions, such as normality, homoscedasticity, and independence, researchers can identify potential limitations or violations that may impact the accuracy and reliability of their findings.
Exploratory Data Analysis within the context of ecological statistical models usually comprise a set of targetted graphical summaries. They are not to be considered definitive diagnostics of the model assumptions, but rather a first pass to assess the obvious violations prior to the fitting of models. More definitive diagnostics can only be achieved after a model has been fit.
In addition to graphical summaries, there are numerous statistical tests to help explore possible violations of various statistical assumptions. These tests are less commonly used in ecology since they are often more sensitive to deviations from ideal than are the models that we are seeking to ensure.
Simple classic regression models are often the easiest models to fit and interpret and as such often represent a standard by which other alternate models are gauged. As you will see later in this tutorial, such models can actually be fit using closed form (exact solution) matrix algebra that can be performed by hand. Nevertheless, and perhaps as a result, they also impose some of the strictest assumptions. Although these collective assumptions are specific to gaussian models, they do provide a good introduction to model assumptions in general, so we will use them to motivate the more wider discussion.
Simple (gaussian) linear models (represented below) make the following assumptions:
The data depicted above where generated using the following R codes:
The observations represent
- single observations drawn from 10 normal populations
- each population had a standard deviation of 4
- the mean of each population varied linearly according to the value of x (\(2 + 5x\))
- normality: the residuals (and thus observations) must be drawn from populations that are normal distribution. The right hand figure underlays the ficticious normally distributed populations from which the observed values have been sampled.
Estimation and inference testing in linear regression assumes that the response is normally distributed in each of the populations. In this case, the populations are all possible measurements that could be collected at each level of \(x\) - hence there are 16 populations. Typically however, we only collect a single observation from each population (as is also the case here). How then can be evaluate whether each of these populations are likely to have been normal?
For a given response, the population distributions should follow much the same distribution shapes. Therefore provided the single samples from each population are unbiased representations of those populations, a boxplot of all observations should reflect the population distributions.
The two figures above show the relationships between the individual population distributions and the overall distribution. The left hand figure shows a distribution drawn from single representatives of each of the 16 populations. Since the 16 individual populations were normally distributed, the distribution of the 16 observations is also normal.
By contrast, the right hand figure shows 16 log-normally distributed populations and the resulting distribution of 16 single observations drawn from these populations. The overall boxplot mirrors each of the individual population distributions.
Whilst traditionally, non-normal data would typically be normalised via a scale transformation (such as a logarithmic transformation), these days it is arguably more appropriate to attempt to match the data to a more suitable distribution (see later in this tutorial).
You may have noticed that we have only explored the distribution of the response (y-axis). What about the distribution of the predictor (independent, x-axis) variable, does it matter? The distribution assumption applies to the residuals (which as purely in the direction of the y-axis). Indeed technically, it is assumed that there is no uncertainty associated with the predictor variable. They are assumed to be set and thus there is no error associated with the values observed. Whilst this might not always be reasonable, it is an assumption.
Given that the predictor values are expected to be set rather than measured, we actually assume that they are uniformly distributed. In practice, the exact distribution of predictor values is not that important provided it is reasonably symmetrical and no outliers (unusually small or large values) are created as a result of the distribution.
As with exploring the distribution of the response variable, boxplots, histograms and density plots can be useful means of exploring the distribution of predictor variable(s). When such diagnostics reveal distributional issues, scale transformations (such as logarithmic transformations) are appropriate.
homogeneity of variance: the residuals (and thus observations) must be drawn from populations that are equally varied. The model as shown only estimates a single variance (\(\sigma^2\)) parameter - it is assumed that this is a good overall representation of all underlying populations. The right hand figure underlays the ficticious normally distributed and equally varied populations from which the observations have been sampled.
Moreover, since the expected values (obtained by solving the deterministic component of the model) and the variance must be estimated from the same data, they need to be independent (not related one another)
Simple linear regression also assumes that each of the populations are equally varied. Actually, it is the prospect of a relationship between the mean and variance of y-values across x-values that is of the greatest concern. Strictly the assumption is that the distribution of y values at each x value are equally varied and that there is no relationship between mean and variance.
However, as we only have a single y-value for each x-value, it is difficult to directly determine whether the assumption of homogeneity of variance is likely to have been violated (mean of one value is meaningless and variability can’t be assessed from a single value). The figure below depicts the ideal (and almost never realistic) situation in which (left hand figure) the populations are all equally varied. The middle figure simulates drawing a single observation from each of the populations. When the populations are equally varied, the spread of observed values around the trend line is fairly even - that is, there is no trend in the spread of values along the line.
If we then plot the residuals (difference between observed values and those predicted by the trendline) against the predict values, there is a definite lack of pattern. This lack of pattern is indicative of a lack of issues with homogeneity of variance.
If we now contrast the above to a situation where the population variance is related to the mean (unequal variance), we see that the observations drawn from these populations are not evenly distributed along the trendline (they get more spread out as the mean predicted value increase). This pattern is emphasized in the residual plot which displays a characteristic “wedge”-shape pattern.
Hence looking at the spread of values around a trendline on a scatterplot of \(y\) against \(x\) is a useful way of identifying gross violations of homogeneity of variance. Residual plots provide an even better diagnostic. The presence of a wedge shape is indicative that the population mean and variance are related.
- linearity: the underlying relationships must be simple linear trends, since the line of best fit through the data (of which the slope is estimated) is linear. The right hand figure depicts a linear trend through the underlying populations.
It is important to disclose the meaning of the word “linear” in the term “linear regression”. Technically, it refers to a linear combination of regression coefficients. For example, the following are examples of linear models:
- \(y_i = \beta_0 + \beta_1 x_i\)
- \(y_i = \beta_0 + \beta_1 x_i + \beta_2 z_i\)
- \(y_i = \beta_0 + \beta_1 x_i + \beta_2 x^2_i\)
All the coefficients (\(\beta_0\), \(\beta_1\), \(\beta_2\)) are linear terms. Note that the last of the above examples, is a linear model, however it describes a non-linear trend. Contrast the above models with the following non-linear model:
- \(y_i = \beta_0 + x_i^{\beta_1}\)
In that case, the coefficients are not linear combinations (one of them is a power term).
That said, a simple linear regression usually fits a straight (linear) line through the data. Therefore, prior to fitting such a model, it is necessary to establish whether this really is the most sensible way of describing the relationship. That is, does the relationship appear to be linearly related or could some other non-linear function describe the relationship better. Scatterplots and residual plots are useful diagnostics.
To see how a residual plot could be useful, consider the following. The first row of figures illustrate the residuals resulting from data drawn from a linear trend. The residuals are effectively random noise. By contrast, the second row show the residuals resulting from data drawn from a non-normal relationship that have nevertheless been modelled as a linear trend. There is still a clear pattern remaining in the residuals.
The above might be an obvious and somewhat overly contrived example, yet it does illustrate the point - that a pattern in the residuals could point to a mis-specified model.
If non-linearity does exist (as in the second case above) , then fitting a straight line through what is obviously not a straight relationship is likely to poorly represent the true nature of the relationship. There are numerous causes of non-linearity:
- underlying distributional issues can result in non-linearity. For example, if we are assuming a gaussian distribution and the data are non-normal, often the relationships will appear non-linear. Addressing the distributional issues can therefore resolve the linearity issues
- the underlying relationship might truly be non-linear in which case this should be reflected in some way by the model formula. If the model formula fails to describe the non-linear trend, then problems will persist.
- the model proposed is missing an important covariate that might help standardise the data in a way that results in linearity
independence: the residuals (and thus observations) must be independently drawn from the populations. That is, the correlation between all the observations is assumed to be 0 (off-diagonals in the covariance matrix). More practically, there should be no pattern to the correlations between observations.
Random sampling and random treatment assignment are experimental design elements that are intended to mitigate many types of sampling biases that cause dependencies between observations. Nevertheless, there are aspects of sampling designs that are either logistically difficult to randomise or in some cases not logically possible. For example, the residuals from observations sampled closer together in space and time will likely be more similar to one another than those of observations more spaced apart. Since neither space nor time can be randomised, data collected from sampling designs that involve sampling over space and/or time need to be assess for spatial and temporal dependencies. These concepts will be explored in the context of introducing susceptible designs in a later tutorial.
The above is only a very brief overview of the model assumptions that apply to just one specific model (simple linear gaussian regression). For the remainder of this section, we will graphically explore the two motivating example data sets so as gain insights into what distributional assumptions might be most valid, and thus help guide modelling choices. Similarly, for subsequent tutorials in this series (that introduce progressively more complex models), all associated assumptions will be explored and detailed.
Conclusions
- there are no obvious violations of the linear regression model assumptions
- we can now fit the suggested model
- full confirmation about the model’s goodness of fit should be reserved until after exploring the additional diagnostics that are only available after fitting the model.
Conclusions
- the spread of noise in each group seems reasonably similar
- more importantly, there does not seem to be a relationship between the mean (as approximated by the position of the boxplots along the y-axis) and the variance (as approximated by the spread of the boxplots).
- that is, the size of the boxplots do not vary with the elevation of the boxplots.
Linearity is not an issue for categorical predictors since it is effectively fitting separate lines between pairs of points (and a line between two points can only ever be linear)….
Conclusions
- no evidence of non-normality
- no evidence of non-homogeneity of variance
Conclusions
- the spread of noise does not look random along the line of best fit.
- homogeneity of variance is difficult to assess in the presence of distributional issues (such as non-normality in this case) as they can result in non-linearity (apparent here)
Conclusions
- there are obvious violations of the linear regression model assumptions
- we should consider a different model that does not assume normality
Conclusions
- the spread of noise does not look random along the line of best fit.
- homogeneity of variance is difficult to assess in the presence of distributional issues (such as non-normality in this case) as they can result in non-linearity (apparent here)
Conclusions
- there are obvious violations of the linear regression model assumptions
- we should consider a different model that does not assume normality
Conclusions
- there are obvious violations of the linear regression model assumptions
- we should consider a different model that does not assume normality
Conclusions
- although there is no evidence of non-linearity from this small data set, it is worth noting that the line of best fit does extend outside the logical response range [0.1] within the range of observed \(x\) values. That is, a simple linear model would predict proportions higher than 100% at high values of \(x\)
Conclusions
- there are obvious violations of the linear regression model assumptions
- we should consider a different model that does not assume normality
7 Data standardisation
Data centering involves subtracting the mean of data from each observation thereby resulting in the value of zero being in the center of the data. We can take this further and standardise (scale) the data such that it has a mean of zero (as with centering) and a standard deviation of one. Again, the spacing and order of the data are unaffected.
Consider the three figures below. They each present the same data, however the data in the middle figure has been centered (it now has a mean of 0) and the data in the right hand figure has been standardized. Note that in each case, the order and spacing of the data has not changed, it is just that the number line has been shifted to the right (and squashed in the case of the standardised data).
So what is the point I hear you ask. Data standardisation can greatly enhance the performance and stability of statistical models:
improves numerical stability - many statistical models involve mathematical operations that can be very sensitive to large numbers or scale imbalances. Statistical models (particularly those that involve optimisation algorithms) work far better (and are more likely to converge) when all data are on a standardised scale.
allows effect sizes to be compared - effect parameters are usually some form of slope which is calculated as the change in \(y\) divided by the change in \(x\). The magnitude of the slope is therefore determined by the scale of both \(y\) and \(x\). If a model has multiple predictors, each of which are on vastly different scales, then it is very difficult to assess which has the largest effect on \(y\) - since any differences in relative effects will be masked by differences in their \(x\) scales. For example, one predictor might have a slope twice as large as another either because its effect on \(y\) is twice as much or because the scale of its values is twice that of the other (or some combination of the two reasons).
enhances interpretation - recall that the y-intercept is the expected value of \(y\) when \(x\) is equal to zero. In many applications, an x value of zero is outside the range of the sampled data (as well as that of the population). For example, in the example of bird body mass and its effect on clutch size, a bird body mass of zero is not going to be within the domain of collectable data. A bird that has no mass does not exist. In such cases, the y-intercept has no practical interpretation.
If however, the data are centered (or standardised), an \(x\) value of zero represents the mean value of \(y\). Hence, the y-intercept is then interpreted as the expected value of \(y\) at the average \(x\) - now a meaningful interpretation.
mitigates collinearity in multiplicative models - when a model contains multiple continuous predictors along with their interactions, unless the data are centered (or standardised), the main effects will be correlated to the interactions and thus impossible to interpret correctly.
8 Ordinary Least Squares (OLS)
The following tabbed content will work through fitting OLS to each of the example data sets. Only the first two example data sets (Example 1 - Gaussian and Example 2 - categorical) satisfy the assumptions of OLS. The other examples are included primarily so that we can see what the model diagnostics look like when one or more assumptions are violated.
For Example 1 - Gaussian, we will fit the model three ways (on raw predictor values, on centered predictor values and on standardised predictor values). The purpose of this is to illustrate how these choices impact on the diagnostics and interpretation of the model.
The Ordinary Least Squares (OLS) estimate of a parameter (e.g the mean, \(\mu\)) is the value of this parameter that minimize the sum of the square residuals (e.g. \((y_i-\mu)^2\)). That is, it minimize the total difference between the observed values (\(x_i\)) and those expected (\(\mu\)).
In this section, I am bypassing the usually essential assumption checking and diagnostic steps so as to focus on the underpinning mechanisms of parameter estimation.
We could calculate the residual sum of squares (RSS) for a large number of combinations of potential parameter values in order to find the ‘optimum’ parameter values (those that minimize RSS).
mu <- seq(15,40,len=1000)
rss <- vector('numeric', length(mu))
for (i in 1:length(mu)) {
rss[i] <- sum((dat$y - mu[i])^2)
}
mu[which(rss == min(rss))]
[1] 27.81281
This minimum is very close to that calculated with the simple formula (as applied with the mean()
function).
Now this is all fine and well, but what if we need to estimate multiple parameters. Lets see how we could estimate the mean via matrix algebra as a stepping stone towards solving more complex problems.
We start by expressing the problem as a linear equation.
\[ y_i = \beta_0 + \varepsilon_i \]
Here, we are saying that our observed values, can be considered to be equal to the mean of the values plus the individual deviations (residuals) of each of the values from the mean. Note, in the context of a linear model, this is just an intercept only model.
Recall that the typical linear model might look like:
\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]
Which is the same as: \[ y_i = \beta_0\times 1 + \beta_1\times x_i + \varepsilon_i \]
Returning to our intercept only model then: \[ y_i = \beta_0\times 1 + \varepsilon_i \]
And in matrix form, this would be: \[ \underbrace{\begin{bmatrix} y_1\\ y_2\\ y_3\\ ...\\ y_n \end{bmatrix}}_{Y} = \underbrace{\begin{bmatrix} 1\\ 1\\ 1\\ ...\\ 1 \end{bmatrix}}_{\boldsymbol{X}} \underbrace{\vphantom{\begin{bmatrix} y_1\\ y_2\\ y_3\\ ...\\ y_n \end{bmatrix}}\begin{bmatrix} \beta_0 \end{bmatrix}}_{\boldsymbol{\beta}} + \underbrace{\begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \varepsilon_3\\ ...\\ \varepsilon_n \end{bmatrix}}_{\boldsymbol{\varepsilon}} \]
- \(Y\) represents a column vector of the observed data
- \(\boldsymbol{X}\) represents what is called the model matrix. In this case, that is just a single column matrix of 1s. The ones indicate that there is an intercept. In matrix algebra, matrices are multiplied, so a column of ones just ensures that the thing being multiplied does exist.
- \(\boldsymbol{\beta}\) represents a vector of parameters to be estimated. In this case, there is only one.
- \(\boldsymbol{\varepsilon}\) represents a column vector or residuals.
This could all be summarised as:
\[ Y = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \]
If we assume that the residuals (\(\varepsilon\)) are (independently and identically distributed, iid) all independently drawn from the same normal distribution with a mean of 0 and a constant variance, then there is a rather elegant solution for solving for \(\boldsymbol{\beta}\).
Recall that OLS estimates the value of the unknown parameters (\(\boldsymbol{\beta}\)) as the values that minimize the sum of the squared residuals. So lets start by expressing the above equation in terms of \(\boldsymbol{\varepsilon}\).
\[ \boldsymbol{\varepsilon} = Y - \boldsymbol{X}\boldsymbol{\beta} \]
Via matrix algebra, we can calulate the sum of square residuals as (\(\boldsymbol{\varepsilon}'\boldsymbol{\varepsilon}\)), where \(\boldsymbol{\varepsilon}'\) is the transpose of \(\boldsymbol{\varepsilon}\), so \(\boldsymbol{\varepsilon}'\boldsymbol{\varepsilon}\) is just:
\[ \begin{bmatrix} \varepsilon_1 & \varepsilon_2 & \varepsilon_3 &...& \varepsilon_n \end{bmatrix} \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \varepsilon_3\\ ...\\ \varepsilon_n\\ \end{bmatrix} = \begin{bmatrix} \varepsilon_1\times\varepsilon_1 + \varepsilon_2\times\varepsilon_2 + \varepsilon_3\times\varepsilon_3 + ... + \varepsilon_n\times\varepsilon_n \end{bmatrix} \]
Based on all this, we could re-write \(\boldsymbol{\varepsilon}^T\boldsymbol{\varepsilon}\) as:
\[ \boldsymbol{\varepsilon}^T\boldsymbol{\varepsilon} = (Y - \boldsymbol{X}\boldsymbol{\beta})^T(Y - \boldsymbol{X}\boldsymbol{\beta}) \]
If we then take the derivatives of the above with respect to \(\boldsymbol{\beta}\), we will obtain the values of \(\boldsymbol{\beta}\) that minimizes the sum of square residuals. Altimately, we arrive at:
\[ \boldsymbol{\beta} = \underbrace{(\boldsymbol{X}^T\boldsymbol{X})^{-1}}_{\frac{1}{n}} \underbrace{\boldsymbol{X}^T Y}_{\sum^n_{i=1}y_i} \]
where \((\boldsymbol{X}^T\boldsymbol{X})^{-1}\) is the inverse of \((\boldsymbol{X}^T\boldsymbol{X})\). In matrix algebra, matrices are never divided, only multiplied. Therefore, to divide by \((\boldsymbol{X}^T\boldsymbol{X})\), it is necessary to multiply by the inverse of \((\boldsymbol{X}^T\boldsymbol{X})^{-1}\), in the same way that dividing by \(n\) is the same as multiplying by the inverse of \(n\) which is of course \(\frac{1}{n}\).
In the above equation, we can see how the matrix form translates directly into the simple calculation of a mean. \((\boldsymbol{X}^T\boldsymbol{X})^{-1}\) equates to \(\frac{1}{n}\) and \(\boldsymbol{X}^T Y\) equates to \(\sum^n_{i=1}y_i\).
So to estimate \(\boldsymbol{\beta}\), all we have to do is perform the above matrix multiplications. Unfortunately, matrix inversion (e.g. calculating \((\boldsymbol{X}^T\boldsymbol{X})^{-1}\)) is not always straight forward. Note in this case it is straight forward, because the matrix only has a single value).
In order to be able to invert a matrix, the following must be satisfied:
- the matrix must be square (have the same number of rows as columns)
- be full rank, that is the number of independent columns (the rank) must be equal to the number of columns.
If these are the case, then we can obtain the inverse by decomposing the square matrix into two parts:
- the inverse (i.e. what we are after)
- an identity matrix (a bi-product that is typically thrown away)
In R, this can be achieved via the solve()
(or qr.solve()
) functions.
## Generate the single column matrix of 1s
X <- cbind(rep(1,length(dat$y)))
## perform the matrix multiplications to solve for beta
## beta = (X'X)^-1 X'Y
beta <- solve(t(X) %*% X) %*% t(X) %*% dat$y
beta
[,1]
[1,] 27.811
This is the same as the empirical mean we calculated for the brute force method.
Recall the motivating example 1 data as:
x y
1 1 9.64
2 2 3.79
3 3 11.00
4 4 27.88
5 5 32.84
6 6 32.56
7 7 37.84
8 8 29.86
9 9 45.05
10 10 47.65
So the statistical model might be:
\[ \begin{align} y_i =& \beta_0 + \beta_1 x_i + \varepsilon_i\\ Y =& \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \end{align} \]
which can be represented in matrix form as:
\[ \begin{bmatrix} 9.64\\ 3.79\\ 11\\ ...\\ \end{bmatrix} = \underbrace{\begin{bmatrix} 1&1\\ 1&2\\ 1&3\\ ...\\ \end{bmatrix}}_{\text{model matrix}} \begin{bmatrix} \beta_0\\ \beta_1 \end{bmatrix} + \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \varepsilon_3\\ ...\\ \end{bmatrix} \]
So, the response is represented by a vector and this is related to the model matrix (a matrix representing the predictor(s)), a vector of parameters to estimate (\(\beta_0\): the intercept and \(\beta_1\): the slope) and a vector of residuals (random noise).
Recall that in order to estimate \(\boldsymbol{\beta}\), we solve the following:
\[ \boldsymbol{\beta} = (\boldsymbol{X}^T\boldsymbol{X})^{-1} \boldsymbol{X}^T Y \]
\[ y_i = \beta_0+\beta_1\times x_i+\varepsilon_i \hspace{1cm}\varepsilon_i\sim{}\mathcal{N}(0, \sigma^2) \]
This whole process, can be more conveniently wrapped into a simple function within R called lm()
.
Having fit the model, it is vital that we review a range of regression diagnostics so as to assess the goodness of fit of the model. This step definitively confirms the model assumptions.
When supplied with a fitted linear model, the autoplot()
function generates a set of standard diagnostic regression plots:
- the top left plot is a residual plot: Ideally, there should not be any patterns in the residual plot. Of particular concern would be if there was a wedge (funnel) shape (indicative of unequal variances) or some curved shape (indicative of having fit a linear trend through non-linear data).
- the top right plot is a Q-Q normal plot: This figure plots the quantiles of the residuals against the quantiles that would be expected if the data were drawn from a normal distribution. Ideally, the points should be close to the dashed line. Deviations from this line at the tails signify potential non-normality.
- the middle left plot is a standardised residual plot. This has a similar interpretation to the residual plot.
- the middle right plot displays the cooks distance values associated with each observation. Cook’s distance is a measure of the influence of each point on the estimated parameter values. This is a combination of residuals (measure of outlierness along the y-axis) and leverage (measure of outlierness along the x-axis). Essentially, it estimates variations in regression coefficients after removing each observation (one by one). Ideally, all the values should be under 0.8. Numbers above the bars indicate the observation number - this can be useful to identify which observations are influential.
- the bottom left plot displays the trend between residuals and leverage. In the event of large Cook’s distance, this can help identify whether the issues are due to residuals or leverage.
- the bottom right plot displays the trend between Cook’s distance and leverage.
Statistical models assume that all observations have the same influence (weight) on the estimates. Outliers (unusually small or large observations) are problematic as they tend to have higher influence on estimates. They tend to “pull” the estimates (that is bias the estimates) towards themselves.
There are numerous measures of observation influence.
- residuals - the deviations (along the y-axis) between observed and expected values. Observations that have relatively large residuals (an order of magnitude greater than the others) can be viewed as potentially overly influential.
- leverage - the deviations along the x-axis. That is, how unusual is each predictor observation. For experimental designs, this is usually not an issue since the observed values of predictors are under the control of the experiment (we set what values of the predictor we wish to explore a response to). However, for natural experiments (e.g. surveys), the experimenter has less control over the range of predictor values observed.
- Cook’s distance - this is a metric of influence that combines together residuals and leverage. The values can range from 0 (typical influence) to 1 (highly influential). Values greater than 0.8 should be viewed as overly influential and scrutinised further.
Cook’s distance (cook.d
) and leverage (hat
) are displayed in tabular form.
Influence measures of
lm(formula = y ~ x, data = dat) :
dfb.1_ dfb.x dffit cov.r cook.d hat inf
1 0.3312 -0.2802 0.332 1.881 0.06131 0.345 *
2 -0.9249 0.7304 -0.945 0.905 0.36812 0.248
3 -0.4086 0.2881 -0.439 1.243 0.09745 0.176
4 0.3878 -0.2187 0.473 1.008 0.10468 0.127
5 0.2680 -0.0756 0.441 0.945 0.08939 0.103
6 0.0409 0.0231 0.135 1.393 0.01012 0.103
7 0.0000 0.0923 0.199 1.387 0.02186 0.127
8 0.2080 -0.5868 -0.894 0.672 0.29732 0.176
9 -0.0483 0.0954 0.123 1.715 0.00865 0.248
10 0.0505 -0.0855 -0.101 1.984 0.00586 0.345 *
y x .hat .sigma .cooksd .fitted .resid .stdresid
1 9.64 1 0.3454545 6.522628 0.061314588 7.225273 2.4147273 0.4820270
2 3.79 2 0.2484848 5.623285 0.368122762 11.799879 -8.0098788 -1.4922110
3 11.00 3 0.1757576 6.229844 0.097452765 16.374485 -5.3744848 -0.9560542
4 27.88 4 0.1272727 5.996167 0.104682723 20.949091 6.9309091 1.1981856
5 32.84 5 0.1030303 5.940710 0.089394162 25.523697 7.3163030 1.2476017
6 32.56 6 0.1030303 6.546155 0.010120337 30.098303 2.4616970 0.4197772
7 37.84 7 0.1272727 6.494260 0.021858264 34.672909 3.1670909 0.5475130
8 29.86 8 0.1757576 5.342608 0.297318395 39.247515 -9.3875152 -1.6699226
9 45.05 9 0.2484848 6.597780 0.008650710 43.822121 1.2278788 0.2287493
10 47.65 10 0.3454545 6.610265 0.005863429 48.396727 -0.7467273 -0.1490614
In this case, no observations would be considered overly influential.
The performance
package provides a similar set of visual diagnostics:
- the top left plot is a “Posterior predictive plot”“. This plot features the density of the observed response (thick green line) along with a collection of random realisations predicted from the model. The point of this diagnostic is to see whether the predictions are broadly consistent with the input data. If the densities of realisations and observed data are not well aligned, it might imply that the model is not a good fit.
- the top right and middle left two plots are residual and standard residual plots. Ideally, these should show no patterns in the residuals
- the middle right plot is an influence plot that plots residuals against leverage with contours to help identify large Cook’s D values. Points should not fall outside the 0.8 contours (if present).
- the bottom left plot is a form of Q-Q plot that plots the distribution (density) of the residuals from observed data (y-axis) against those based on an exact normal (or other nominated) distribution. Ideally, the points should all lay along the green horizontal line.
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Conclusions:
- there are no alarming signs from any of the diagnostics
Although an exploration of model residuals can provide important insights into the goodness of fit and conformity of a model to the underlying assumptions, diagnosing issues is a bit of a fine art. This difficulty is exacerbated when the residuals are being calculated from some models (e.g logistic regression models) where it is difficult to separate out nefarious patterns from the specific patterns expected of the specific model.
The DHARMa
package generates standardised residuals via simulation and uses these as the basis of a range of tools to diagnose common modelling issues including outliers, heterogeneity, over-dispersion, autocorrelation.
New observations simulated from the fitted model are used to calculate a cumulative density function (CDF) that describes the probability profile of each observation. Thereafter, the residual of an observation is calculated as the value of the CDF that corresponds to the actual observed value:
- a value of 0 indicates that the observed value was less than all simulated values
- a value of 1 indicates that the observed value was greater than all simulated values
- a value of 0.5 indicates that have the observed value were greater than all simulated values from a correctly specified model, these quantile residuals should be uniformly distributed
This approach ensures that all residuals have the same interpretation irrespective of the model and distribution selected.
Exploring DHARMa residuals begins with running the simulateResiduals()
function. If this is done with plot=TRUE
, a pair of diagnostic plots with a range of diagnostic tests will be provided as side effects.
- the left hand plot is a Q-Q plot (ideally all points should be close to the red line).
- the KS (Kolmogorov-Smirnov) test tests whether the (in this case simulated) are likely to have been drawn from the nominated distribution (in this case Gaussian).
- the Dispersion test tests whether the standard deviation of the data is equal to that of the simulated data
- the Outlier test tests for the prevalence of outliers (when observed values are outside the simulated range)
- the right hand plot is a residual plot. Ideally, there should be no patterns in the residuals. To help identify any patterns, quantile trends are overlayed. Ideally, there should be a flat black line at each of the quantiles of 0.25, 0.5 and 0.75.
Alternatively, once the simulated residuals have been generated, individual tests can be explored.
residuals. This includes:
- a Q-Q plot (same as above)
- a plot of simulated dispersion values (histogram) as well as the standardised observed dispersion (red line). Dispersion is the ratio of variance to mean and on a non-standardized scale should ideally be 1. Higher values indicate overdispersion. More about this issue will be discussed later on in the tutorial. For the plot, the red line should fall well within the range of the underlying histogram. If the red line is off to the right, it suggests that the model might be over-dispersed. If the red line is to the left, it suggests under-dispersion.
- histogram of outliers. The red line (if present) indicates the presence of outliers.
The figures are accompanied by formal statistical tests. These tests should be reviewed, however, be aware that they are typically more sensitive than the assumptions they seek to explore. So while it is important to review them in order to identify potential issues, significant tests need not be considered indicators of complete model failure.
$uniformity
Exact one-sample Kolmogorov-Smirnov test
data: simulationOutput$scaledResiduals
D = 0.152, p-value = 0.949
alternative hypothesis: two-sided
$dispersion
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 0.90609, p-value = 0.952
alternative hypothesis: two.sided
$outliers
DHARMa outlier test based on exact binomial test with approximate
expectations
data: simulationOutput
outliers at both margin(s) = 0, observations = 10, p-value = 1
alternative hypothesis: true probability of success is not equal to 0.007968127
95 percent confidence interval:
0.0000000 0.3084971
sample estimates:
frequency of outliers (expected: 0.00796812749003984 )
0
$uniformity
Exact one-sample Kolmogorov-Smirnov test
data: simulationOutput$scaledResiduals
D = 0.152, p-value = 0.949
alternative hypothesis: two-sided
$dispersion
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 0.90609, p-value = 0.952
alternative hypothesis: two.sided
$outliers
DHARMa outlier test based on exact binomial test with approximate
expectations
data: simulationOutput
outliers at both margin(s) = 0, observations = 10, p-value = 1
alternative hypothesis: true probability of success is not equal to 0.007968127
95 percent confidence interval:
0.0000000 0.3084971
sample estimates:
frequency of outliers (expected: 0.00796812749003984 )
0
- uniformity
Exact one-sample Kolmogorov-Smirnov test
data: simulationOutput$scaledResiduals
D = 0.152, p-value = 0.949
alternative hypothesis: two-sided
- dispersion
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 0.90609, p-value = 0.952
alternative hypothesis: two.sided
- outliers
DHARMa outlier test based on exact binomial test with approximate
expectations
data: dat.resid
outliers at both margin(s) = 0, observations = 10, p-value = 1
alternative hypothesis: true probability of success is not equal to 0.007968127
95 percent confidence interval:
0.0000000 0.3084971
sample estimates:
frequency of outliers (expected: 0.00796812749003984 )
0
- quantiles
Test for location of quantiles via qgam
data: simulationOutput
p-value = 0.8891
alternative hypothesis: both
Since the diagnostics do not reveal any issues, we are free to explore the model summaries
Call:
lm(formula = y ~ x, data = dat)
Residuals:
Min 1Q Median 3Q Max
-9.387 -4.218 1.821 2.991 7.316
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.6507 4.2299 0.627 0.548349
x 4.5746 0.6817 6.710 0.000151 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.192 on 8 degrees of freedom
Multiple R-squared: 0.8491, Adjusted R-squared: 0.8303
F-statistic: 45.03 on 1 and 8 DF, p-value: 0.0001511
2.5 % 97.5 %
(Intercept) -7.103503 12.404836
x 3.002579 6.146633
Interpretation:
- when \(x=0\), \(y\) is expected to be 2.651 (the y-intercept).
- for every one unit change in \(x\), the expected value of \(y\) increases by 4.575 (the slope)
- the slope could be as low as 3.003 or as high as 6.147 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 84.914% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
\[ y_i = \beta_0+\beta_1\times (x_i - \bar{x}) +\varepsilon_i \hspace{1cm}\varepsilon_i\sim{}\mathcal{N}(0, \sigma^2) \]
This whole process, can be more conveniently wrapped into a simple function within R called lm()
.
Having fit the model, it is vital that we review a range of regression diagnostics so as to assess the goodness of fit of the model. This step definitively confirms the model assumptions.
When supplied with a fitted linear model, the autoplot()
function generates a set of standard diagnostic regression plots:
- the top left plot is a residual plot: Ideally, there should not be any patterns in the residual plot. Of particular concern would be if there was a wedge (funnel) shape (indicative of unequal variances) or some curved shape (indicative of having fit a linear trend through non-linear data).
- the top right plot is a Q-Q normal plot: This figure plots the quantiles of the residuals against the quantiles that would be expected if the data were drawn from a normal distribution. Ideally, the points should be close to the dashed line. Deviations from this line at the tails signify potential non-normality.
- the middle left plot is a standardised residual plot. This has a similar interpretation to the residual plot.
- the middle right plot displays the cooks distance values associated with each observation. Cook’s distance is a measure of the influence of each point on the estimated parameter values. This is a combination of residuals (measure of outlierness along the y-axis) and leverage (measure of outlierness along the x-axis). Essentially, it estimates variations in regression coefficients after removing each observation (one by one). Ideally, all the values should be under 0.8. Numbers above the bars indicate the observation number - this can be useful to identify which observations are influential.
- the bottom left plot displays the trend between residuals and leverage. In the event of large Cook’s distance, this can help identify whether the issues are due to residuals or leverage.
- the bottom right plot displays the trend between Cook’s distance and leverage.
Statistical models assume that all observations have the same influence (weight) on the estimates. Outliers (unusually small or large observations) are problematic as they tend to have higher influence on estimates. They tend to “pull” the estimates (that is bias the estimates) towards themselves.
There are numerous measures of observation influence.
- residuals - the deviations (along the y-axis) between observed and expected values. Observations that have relatively large residuals (an order of magnitude greater than the others) can be viewed as potentially overly influential.
- leverage - the deviations along the x-axis. That is, how unusual is each predictor observation. For experimental designs, this is usually not an issue since the observed values of predictors are under the control of the experiment (we set what values of the predictor we wish to explore a response to). However, for natural experiments (e.g. surveys), the experimenter has less control over the range of predictor values observed.
- Cook’s distance - this is a metric of influence that combines together residuals and leverage. The values can range from 0 (typical influence) to 1 (highly influential). Values greater than 0.8 should be viewed as overly influential and scrutinised further.
Cook’s distance (cook.d
) and leverage (hat
) are displayed in tabular form.
Influence measures of
lm(formula = y ~ scale(x, scale = FALSE), data = dat) :
dfb.1_ dfb.ss.F dffit cov.r cook.d hat inf
1 0.1789 -0.2802 0.332 1.881 0.06131 0.345 *
2 -0.5994 0.7304 -0.945 0.905 0.36812 0.248
3 -0.3310 0.2881 -0.439 1.243 0.09745 0.176
4 0.4188 -0.2187 0.473 1.008 0.10468 0.127
5 0.4342 -0.0756 0.441 0.945 0.08939 0.103
6 0.1326 0.0231 0.135 1.393 0.01012 0.103
7 0.1767 0.0923 0.199 1.387 0.02186 0.127
8 -0.6741 -0.5868 -0.894 0.672 0.29732 0.176
9 0.0783 0.0954 0.123 1.715 0.00865 0.248
10 -0.0546 -0.0855 -0.101 1.984 0.00586 0.345 *
y scale(x, scale = FALSE) .hat .sigma .cooksd .fitted
1 9.64 -4.5 0.3454545 6.522628 0.061314588 7.225273
2 3.79 -3.5 0.2484848 5.623285 0.368122762 11.799879
3 11.00 -2.5 0.1757576 6.229844 0.097452765 16.374485
4 27.88 -1.5 0.1272727 5.996167 0.104682723 20.949091
5 32.84 -0.5 0.1030303 5.940710 0.089394162 25.523697
6 32.56 0.5 0.1030303 6.546155 0.010120337 30.098303
7 37.84 1.5 0.1272727 6.494260 0.021858264 34.672909
8 29.86 2.5 0.1757576 5.342608 0.297318395 39.247515
9 45.05 3.5 0.2484848 6.597780 0.008650710 43.822121
10 47.65 4.5 0.3454545 6.610265 0.005863429 48.396727
.resid .stdresid
1 2.4147273 0.4820270
2 -8.0098788 -1.4922110
3 -5.3744848 -0.9560542
4 6.9309091 1.1981856
5 7.3163030 1.2476017
6 2.4616970 0.4197772
7 3.1670909 0.5475130
8 -9.3875152 -1.6699226
9 1.2278788 0.2287493
10 -0.7467273 -0.1490614
In this case, no observations would be considered overly influential.
The performance
package provides a similar set of visual diagnostics:
- the top left plot is a “Posterior predictive plot”“. This plot features the density of the observed response (thick green line) along with a collection of random realisations predicted from the model. The point of this diagnostic is to see whether the predictions are broadly consistent with the input data. If the densities of realisations and observed data are not well aligned, it might imply that the model is not a good fit.
- the top right and middle left two plots are residual and standard residual plots. Ideally, these should show no patterns in the residuals
- the middle right plot is an influence plot that plots residuals against leverage with contours to help identify large Cook’s D values. Points should not fall outside the 0.8 contours (if present).
- the bottom left plot is a form of Q-Q plot that plots the distribution (density) of the residuals from observed data (y-axis) against those based on an exact normal (or other nominated) distribution. Ideally, the points should all lay along the green horizontal line.
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Conclusions:
- there are no alarming signs from any of the diagnostics
Although an exploration of model residuals can provide important insights into the goodness of fit and conformity of a model to the underlying assumptions, diagnosing issues is a bit of a fine art. This difficulty is exacerbated when the residuals are being calculated from some models (e.g logistic regression models) where it is difficult to separate out nefarious patterns from the specific patterns expected of the specific model.
The DHARMa
package generates standardised residuals via simulation and uses these as the basis of a range of tools to diagnose common modelling issues including outliers, heterogeneity, over-dispersion, autocorrelation.
New observations simulated from the fitted model are used to calculate a cumulative density function (CDF) that describes the probability profile of each observation. Thereafter, the residual of an observation is calculated as the value of the CDF that corresponds to the actual observed value:
- a value of 0 indicates that the observed value was less than all simulated values
- a value of 1 indicates that the observed value was greater than all simulated values
- a value of 0.5 indicates that have the observed value were greater than all simulated values from a correctly specified model, these quantile residuals should be uniformly distributed
This approach ensures that all residuals have the same interpretation irrespective of the model and distribution selected.
Exploring DHARMa residuals begins with running the simulateResiduals()
function. If this is done with plot=TRUE
, a pair of diagnostic plots with a range of diagnostic tests will be provided as side effects.
- the left hand plot is a Q-Q plot (ideally all points should be close to the red line).
- the KS (Kolmogorov-Smirnov) test tests whether the (in this case simulated) are likely to have been drawn from the nominated distribution (in this case Gaussian).
- the Dispersion test tests whether the standard deviation of the data is equal to that of the simulated data
- the Outlier test tests for the prevalence of outliers (when observed values are outside the simulated range)
- the right hand plot is a residual plot. Ideally, there should be no patterns in the residuals. To help identify any patterns, quantile trends are overlayed. Ideally, there should be a flat black line at each of the quantiles of 0.25, 0.5 and 0.75.
Alternatively, once the simulated residuals have been generated, individual tests can be explored.
residuals. This includes:
- a Q-Q plot (same as above)
- a plot of simulated dispersion values (histogram) as well as the standardised observed dispersion (red line). Dispersion is the ratio of variance to mean and on a non-standardized scale should ideally be 1. Higher values indicate overdispersion. More about this issue will be discussed later on in the tutorial. For the plot, the red line should fall well within the range of the underlying histogram. If the red line is off to the right, it suggests that the model might be over-dispersed. If the red line is to the left, it suggests under-dispersion.
- histogram of outliers. The red line (if present) indicates the presence of outliers.
The figures are accompanied by formal statistical tests. These tests should be reviewed, however, be aware that they are typically more sensitive than the assumptions they seek to explore. So while it is important to review them in order to identify potential issues, significant tests need not be considered indicators of complete model failure.
$uniformity
Exact one-sample Kolmogorov-Smirnov test
data: simulationOutput$scaledResiduals
D = 0.152, p-value = 0.949
alternative hypothesis: two-sided
$dispersion
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 0.90609, p-value = 0.952
alternative hypothesis: two.sided
$outliers
DHARMa outlier test based on exact binomial test with approximate
expectations
data: simulationOutput
outliers at both margin(s) = 0, observations = 10, p-value = 1
alternative hypothesis: true probability of success is not equal to 0.007968127
95 percent confidence interval:
0.0000000 0.3084971
sample estimates:
frequency of outliers (expected: 0.00796812749003984 )
0
$uniformity
Exact one-sample Kolmogorov-Smirnov test
data: simulationOutput$scaledResiduals
D = 0.152, p-value = 0.949
alternative hypothesis: two-sided
$dispersion
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 0.90609, p-value = 0.952
alternative hypothesis: two.sided
$outliers
DHARMa outlier test based on exact binomial test with approximate
expectations
data: simulationOutput
outliers at both margin(s) = 0, observations = 10, p-value = 1
alternative hypothesis: true probability of success is not equal to 0.007968127
95 percent confidence interval:
0.0000000 0.3084971
sample estimates:
frequency of outliers (expected: 0.00796812749003984 )
0
- uniformity
Exact one-sample Kolmogorov-Smirnov test
data: simulationOutput$scaledResiduals
D = 0.152, p-value = 0.949
alternative hypothesis: two-sided
- dispersion
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 0.90609, p-value = 0.952
alternative hypothesis: two.sided
- outliers
DHARMa outlier test based on exact binomial test with approximate
expectations
data: dat.resid
outliers at both margin(s) = 0, observations = 10, p-value = 1
alternative hypothesis: true probability of success is not equal to 0.007968127
95 percent confidence interval:
0.0000000 0.3084971
sample estimates:
frequency of outliers (expected: 0.00796812749003984 )
0
- quantiles
Test for location of quantiles via qgam
data: simulationOutput
p-value = 0.8891
alternative hypothesis: both
Since the diagnostics do not reveal any issues, we are free to explore the model summaries
Call:
lm(formula = y ~ scale(x, scale = FALSE), data = dat)
Residuals:
Min 1Q Median 3Q Max
-9.387 -4.218 1.821 2.991 7.316
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.8110 1.9581 14.20 5.88e-07 ***
scale(x, scale = FALSE) 4.5746 0.6817 6.71 0.000151 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.192 on 8 degrees of freedom
Multiple R-squared: 0.8491, Adjusted R-squared: 0.8303
F-statistic: 45.03 on 1 and 8 DF, p-value: 0.0001511
2.5 % 97.5 %
(Intercept) 23.295697 32.326303
scale(x, scale = FALSE) 3.002579 6.146633
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered), \(y\) is expected to be 27.811 (the y-intercept).
- for every one unit change in \(x\), the expected value of \(y\) increases by 4.575 (the slope)
- the slope could be as low as 3.003 or as high as 6.147 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 84.914% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
\[ y_i = \beta_0+\beta_1\times (x_i - \bar{x})/\sigma_{x} +\varepsilon_i \hspace{1cm}\varepsilon_i\sim{}\mathcal{N}(0, \sigma^2) \]
This whole process, can be more conveniently wrapped into a simple function within R called lm()
.
Having fit the model, it is vital that we review a range of regression diagnostics so as to assess the goodness of fit of the model. This step definitively confirms the model assumptions.
When supplied with a fitted linear model, the autoplot()
function generates a set of standard diagnostic regression plots:
- the top left plot is a residual plot: Ideally, there should not be any patterns in the residual plot. Of particular concern would be if there was a wedge (funnel) shape (indicative of unequal variances) or some curved shape (indicative of having fit a linear trend through non-linear data).
- the top right plot is a Q-Q normal plot: This figure plots the quantiles of the residuals against the quantiles that would be expected if the data were drawn from a normal distribution. Ideally, the points should be close to the dashed line. Deviations from this line at the tails signify potential non-normality.
- the middle left plot is a standardised residual plot. This has a similar interpretation to the residual plot.
- the middle right plot displays the cooks distance values associated with each observation. Cook’s distance is a measure of the influence of each point on the estimated parameter values. This is a combination of residuals (measure of outlierness along the y-axis) and leverage (measure of outlierness along the x-axis). Essentially, it estimates variations in regression coefficients after removing each observation (one by one). Ideally, all the values should be under 0.8. Numbers above the bars indicate the observation number - this can be useful to identify which observations are influential.
- the bottom left plot displays the trend between residuals and leverage. In the event of large Cook’s distance, this can help identify whether the issues are due to residuals or leverage.
- the bottom right plot displays the trend between Cook’s distance and leverage.
Statistical models assume that all observations have the same influence (weight) on the estimates. Outliers (unusually small or large observations) are problematic as they tend to have higher influence on estimates. They tend to “pull” the estimates (that is bias the estimates) towards themselves.
There are numerous measures of observation influence.
- residuals - the deviations (along the y-axis) between observed and expected values. Observations that have relatively large residuals (an order of magnitude greater than the others) can be viewed as potentially overly influential.
- leverage - the deviations along the x-axis. That is, how unusual is each predictor observation. For experimental designs, this is usually not an issue since the observed values of predictors are under the control of the experiment (we set what values of the predictor we wish to explore a response to). However, for natural experiments (e.g. surveys), the experimenter has less control over the range of predictor values observed.
- Cook’s distance - this is a metric of influence that combines together residuals and leverage. The values can range from 0 (typical influence) to 1 (highly influential). Values greater than 0.8 should be viewed as overly influential and scrutinised further.
Cook’s distance (cook.d
) and leverage (hat
) are displayed in tabular form.
Influence measures of
lm(formula = y ~ scale(x), data = dat) :
dfb.1_ dfb.sc.. dffit cov.r cook.d hat inf
1 0.1789 -0.2802 0.332 1.881 0.06131 0.345 *
2 -0.5994 0.7304 -0.945 0.905 0.36812 0.248
3 -0.3310 0.2881 -0.439 1.243 0.09745 0.176
4 0.4188 -0.2187 0.473 1.008 0.10468 0.127
5 0.4342 -0.0756 0.441 0.945 0.08939 0.103
6 0.1326 0.0231 0.135 1.393 0.01012 0.103
7 0.1767 0.0923 0.199 1.387 0.02186 0.127
8 -0.6741 -0.5868 -0.894 0.672 0.29732 0.176
9 0.0783 0.0954 0.123 1.715 0.00865 0.248
10 -0.0546 -0.0855 -0.101 1.984 0.00586 0.345 *
y scale(x) .hat .sigma .cooksd .fitted .resid
1 9.64 -1.4863011 0.3454545 6.522628 0.061314588 7.225273 2.4147273
2 3.79 -1.1560120 0.2484848 5.623285 0.368122762 11.799879 -8.0098788
3 11.00 -0.8257228 0.1757576 6.229844 0.097452765 16.374485 -5.3744848
4 27.88 -0.4954337 0.1272727 5.996167 0.104682723 20.949091 6.9309091
5 32.84 -0.1651446 0.1030303 5.940710 0.089394162 25.523697 7.3163030
6 32.56 0.1651446 0.1030303 6.546155 0.010120337 30.098303 2.4616970
7 37.84 0.4954337 0.1272727 6.494260 0.021858264 34.672909 3.1670909
8 29.86 0.8257228 0.1757576 5.342608 0.297318395 39.247515 -9.3875152
9 45.05 1.1560120 0.2484848 6.597780 0.008650710 43.822121 1.2278788
10 47.65 1.4863011 0.3454545 6.610265 0.005863429 48.396727 -0.7467273
.stdresid
1 0.4820270
2 -1.4922110
3 -0.9560542
4 1.1981856
5 1.2476017
6 0.4197772
7 0.5475130
8 -1.6699226
9 0.2287493
10 -0.1490614
In this case, no observations would be considered overly influential.
The performance
package provides a similar set of visual diagnostics:
- the top left plot is a “Posterior predictive plot”“. This plot features the density of the observed response (thick green line) along with a collection of random realisations predicted from the model. The point of this diagnostic is to see whether the predictions are broadly consistent with the input data. If the densities of realisations and observed data are not well aligned, it might imply that the model is not a good fit.
- the top right and middle left two plots are residual and standard residual plots. Ideally, these should show no patterns in the residuals
- the middle right plot is an influence plot that plots residuals against leverage with contours to help identify large Cook’s D values. Points should not fall outside the 0.8 contours (if present).
- the bottom left plot is a form of Q-Q plot that plots the distribution (density) of the residuals from observed data (y-axis) against those based on an exact normal (or other nominated) distribution. Ideally, the points should all lay along the green horizontal line.
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Conclusions:
- there are no alarming signs from any of the diagnostics
Although an exploration of model residuals can provide important insights into the goodness of fit and conformity of a model to the underlying assumptions, diagnosing issues is a bit of a fine art. This difficulty is exacerbated when the residuals are being calculated from some models (e.g logistic regression models) where it is difficult to separate out nefarious patterns from the specific patterns expected of the specific model.
The DHARMa
package generates standardised residuals via simulation and uses these as the basis of a range of tools to diagnose common modelling issues including outliers, heterogeneity, over-dispersion, autocorrelation.
New observations simulated from the fitted model are used to calculate a cumulative density function (CDF) that describes the probability profile of each observation. Thereafter, the residual of an observation is calculated as the value of the CDF that corresponds to the actual observed value:
- a value of 0 indicates that the observed value was less than all simulated values
- a value of 1 indicates that the observed value was greater than all simulated values
- a value of 0.5 indicates that have the observed value were greater than all simulated values from a correctly specified model, these quantile residuals should be uniformly distributed
This approach ensures that all residuals have the same interpretation irrespective of the model and distribution selected.
Exploring DHARMa residuals begins with running the simulateResiduals()
function. If this is done with plot=TRUE
, a pair of diagnostic plots with a range of diagnostic tests will be provided as side effects.
- the left hand plot is a Q-Q plot (ideally all points should be close to the red line).
- the KS (Kolmogorov-Smirnov) test tests whether the (in this case simulated) are likely to have been drawn from the nominated distribution (in this case Gaussian).
- the Dispersion test tests whether the standard deviation of the data is equal to that of the simulated data
- the Outlier test tests for the prevalence of outliers (when observed values are outside the simulated range)
- the right hand plot is a residual plot. Ideally, there should be no patterns in the residuals. To help identify any patterns, quantile trends are overlayed. Ideally, there should be a flat black line at each of the quantiles of 0.25, 0.5 and 0.75.
Alternatively, once the simulated residuals have been generated, individual tests can be explored.
residuals. This includes:
- a Q-Q plot (same as above)
- a plot of simulated dispersion values (histogram) as well as the standardised observed dispersion (red line). Dispersion is the ratio of variance to mean and on a non-standardized scale should ideally be 1. Higher values indicate overdispersion. More about this issue will be discussed later on in the tutorial. For the plot, the red line should fall well within the range of the underlying histogram. If the red line is off to the right, it suggests that the model might be over-dispersed. If the red line is to the left, it suggests under-dispersion.
- histogram of outliers. The red line (if present) indicates the presence of outliers.
The figures are accompanied by formal statistical tests. These tests should be reviewed, however, be aware that they are typically more sensitive than the assumptions they seek to explore. So while it is important to review them in order to identify potential issues, significant tests need not be considered indicators of complete model failure.
$uniformity
Exact one-sample Kolmogorov-Smirnov test
data: simulationOutput$scaledResiduals
D = 0.152, p-value = 0.949
alternative hypothesis: two-sided
$dispersion
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 0.90609, p-value = 0.952
alternative hypothesis: two.sided
$outliers
DHARMa outlier test based on exact binomial test with approximate
expectations
data: simulationOutput
outliers at both margin(s) = 0, observations = 10, p-value = 1
alternative hypothesis: true probability of success is not equal to 0.007968127
95 percent confidence interval:
0.0000000 0.3084971
sample estimates:
frequency of outliers (expected: 0.00796812749003984 )
0
$uniformity
Exact one-sample Kolmogorov-Smirnov test
data: simulationOutput$scaledResiduals
D = 0.152, p-value = 0.949
alternative hypothesis: two-sided
$dispersion
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 0.90609, p-value = 0.952
alternative hypothesis: two.sided
$outliers
DHARMa outlier test based on exact binomial test with approximate
expectations
data: simulationOutput
outliers at both margin(s) = 0, observations = 10, p-value = 1
alternative hypothesis: true probability of success is not equal to 0.007968127
95 percent confidence interval:
0.0000000 0.3084971
sample estimates:
frequency of outliers (expected: 0.00796812749003984 )
0
- uniformity
Exact one-sample Kolmogorov-Smirnov test
data: simulationOutput$scaledResiduals
D = 0.152, p-value = 0.949
alternative hypothesis: two-sided
- dispersion
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 0.90609, p-value = 0.952
alternative hypothesis: two.sided
- outliers
DHARMa outlier test based on exact binomial test with approximate
expectations
data: dat.resid
outliers at both margin(s) = 0, observations = 10, p-value = 1
alternative hypothesis: true probability of success is not equal to 0.007968127
95 percent confidence interval:
0.0000000 0.3084971
sample estimates:
frequency of outliers (expected: 0.00796812749003984 )
0
- quantiles
Test for location of quantiles via qgam
data: simulationOutput
p-value = 0.8891
alternative hypothesis: both
Since the diagnostics do not reveal any issues, we are free to explore the model summaries
Call:
lm(formula = y ~ scale(x), data = dat)
Residuals:
Min 1Q Median 3Q Max
-9.387 -4.218 1.821 2.991 7.316
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.811 1.958 14.20 5.88e-07 ***
scale(x) 13.850 2.064 6.71 0.000151 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.192 on 8 degrees of freedom
Multiple R-squared: 0.8491, Adjusted R-squared: 0.8303
F-statistic: 45.03 on 1 and 8 DF, p-value: 0.0001511
2.5 % 97.5 %
(Intercept) 23.29570 32.32630
scale(x) 9.09076 18.60986
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since standardised), \(y\) is expected to be 27.811 (the y-intercept).
- for every one unit change in \(x\) (which represents a span of approximately 34% of the range of \(x\) since standardised), the expected value of \(y\) increases by 13.85 (the slope)
- the slope could be as low as 9.091 or as high as 18.61 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 84.914% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
We can also compare the performance of each of these alternative models.
# Comparison of Model Performance Indices
Name | Model | R2 | R2 (adj.) | RMSE | Sigma | AIC weights | AICc weights | BIC weights | Performance-Score
-------------------------------------------------------------------------------------------------------------------
dat1c.lm | lm | 0.849 | 0.830 | 5.538 | 6.192 | 0.333 | 0.333 | 0.333 | 100.00%
dat.lm | lm | 0.849 | 0.830 | 5.538 | 6.192 | 0.333 | 0.333 | 0.333 | 14.29%
dat1b.lm | lm | 0.849 | 0.830 | 5.538 | 6.192 | 0.333 | 0.333 | 0.333 | 0.00%
For this demonstration, we will fit a range of models that vary only in the contrast paramterisations so that we can appreciate the ways that they are similar and different. We will start with Treatment contrasts as these are the default contrasts applied to an unordered categorical predictor.
Treatment contrasts are contrasts in which each of the treatment groups means are compared to the mean of a `control’ or reference group. This approach to over-parameterization is computationally identical to fitting \(p-1\) dummy variables via multiple linear regression. However, due to the interpretation of the parameters (groups compared to a control) and the fact that treatment effects are not orthogonal to the intercept, the interpretation of treatment contrasts (and thus dummy regression) is really only meaningful for situations where there is clearly a single group (control) to which the other groups can be compared. For treatment contrasts, the intercept is replaced by \(\beta_{0}\) and thus the remaining \(\beta_{p}\) parameters are numbered starting at 1.
Parameter | Estimates | Null hypothesis |
---|---|---|
\(\beta_0\) | mean of first (control) group (\(\mu_1\)) | H\(_0\): \(\mu=\mu_1=0\) |
\(\beta_1\) | mean of group 2 minus mean of first (control) group (\(\mu_2 - \mu_1\)) | H\(_0\): \(\mu=\mu_2 - \mu_1=0\) |
\(\beta_2\) | mean of group 3 minus mean of first (control) group (\(\mu_3 - \mu_1\)) | H\(_0\): \(\mu=\mu_3 - \mu_1=0\) |
… |
The dummy codes can be viewed using the model.matrix()
function. I am also specifically defining "contr.treatment"
to indicate the contrasts to use, however, as treatment contrasts are the default, this is not necessary.
(Intercept) xmedium xhigh
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 1 0
6 1 1 0
7 1 1 0
8 1 1 0
9 1 0 1
10 1 0 1
11 1 0 1
12 1 0 1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"
To fit a model with treatment contrasts
Call:
lm(formula = y ~ x, data = dat2)
Coefficients:
(Intercept) xmedium xhigh
20.839 -4.798 -10.387
Interpretation
- the mean of the first group is 20.8385636
- the difference between the means of the first and second group (i.e. the effect of the second group over the first) is -4.7983559
- the difference between the means of the first and third group (i.e. the effect of the third group over the first) is -10.3871828
Means parameterization literally estimates the group means. Whilst this would initially seem the most obvious approach, as it estimates the mean and standard error of each treatment group, since it does not include any effects (differences), it has traditionally not been seen as all that useful. In particular, the resulting hypothesis tests are not particularly useful.
Parameter | Estimates | Null hypothesis |
---|---|---|
\(\beta_0\) | mean of first group (\(\mu_1\)) | H\(_0\): \(\mu=\mu_1=0\) |
\(\beta_1\) | mean of second group (\(\mu_2\)) | H\(_0\): \(\mu_2=0\) |
\(\beta_2\) | mean of third group (\(\mu_3\)) | H\(_0\): \(\mu_3=0\) |
… |
Note that for means parameterisation, it is simply a matter of dropping the intercept (-1
).
xcontrol xmedium xhigh
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 0 1 0
6 0 1 0
7 0 1 0
8 0 1 0
9 0 0 1
10 0 0 1
11 0 0 1
12 0 0 1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"
To fit a model with treatment contrasts
Call:
lm(formula = y ~ -1 + x, data = dat2)
Coefficients:
xcontrol xmedium xhigh
20.84 16.04 10.45
Interpretation
- the mean of the first group is 20.8385636
- the mean of the second group is 16.0402077
- the mean of the third group is 10.4513808
This technique constrains the sum of the unconstrained treatment effects (\(\beta\)) to zero. In this model parameterization, the intercept estimates the average treatment effect and the remaining (\(\beta_p\)) estimate the differences between each of the treatment means and the average treatment mean. Unlike treatment contrasts, these contrasts are genuinely orthogonal.
Parameter | Estimates | Null hypothesis |
---|---|---|
\(\beta_0\) | mean of all groups (\(\mu_{1:p}/p\)) | H\(_0\): \(\mu=\mu_{p}/p=0\) |
\(\beta_1\) | mean of group 1 minus mean of group means (\(\mu_1-(\mu_{1:p}/p)\)) | H\(_0\): \(\beta_1=\mu_1-(\mu_{1:p}/p)=0\) |
\(\beta_2\) | mean of group 2 minus mean of group means (\(\mu_2-(\mu_{1:p}/p)\)) | H\(_0\): \(\beta_2=\mu_1-(\mu_{1:p}/p)=0\) |
… |
- You might notice that the last group is completely omitted.
-
{.sm .paperTable tbl-colwidths=“[10,70, 20]”}
(Intercept) x1 x2
1 1 1 0
2 1 1 0
3 1 1 0
4 1 1 0
5 1 0 1
6 1 0 1
7 1 0 1
8 1 0 1
9 1 -1 -1
10 1 -1 -1
11 1 -1 -1
12 1 -1 -1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.sum"
To fit a model with sum to zero contrasts
Call:
lm(formula = y ~ x, data = dat2, contrasts = list(x = "contr.sum"))
Coefficients:
(Intercept) x1 x2
15.7767 5.0618 0.2635
Interpretation
- the mean of the groups is 15.7767174
- the difference between the grand mean (mean of all groups) and mean of the first group (i.e. the effect of the first group over the average) is 5.0618462
- the difference between the grand mean (mean of all groups) and mean of the second group (i.e. the effect of the second group over the average) is 0.2634903
In Helmert contrasts, the intercept estimates the average treatment effect and the remaining (\(\beta_{p-1}\)) estimate the differences between the overal mean and the mean of the subsequent groups. Helmert contrasts are really only useful if the group levels are ordered.
Parameter | Estimates | Null hypothesis |
---|---|---|
\(\beta_0\) | mean of all groups (\(\mu_{1:p}/p\)) | H\(_0\): \(\mu=\mu_{1:p}/p=0\) |
\(\beta_1\) | mean of groups 1 and 2 minus mean of group 1 (\(\mu_{1,2}-(\mu_{1})\)) | H\(_0\): \(\beta_1=\mu_{1,2}-(\mu_{1})=0\) |
\(\beta_1\) | mean of groups 1 and 2 minus mean of group 1 (\(\mu_{1,2}-(\mu_{1})\)) | H\(_0\): \(\beta_1=\mu_{1,2}-(\mu_{1})=0\) |
… |
- You might notice that the last group is completely omitted.
-
{.sm .paperTable tbl-colwidths=“[10,70, 20]”}
(Intercept) x1 x2
1 1 -1 -1
2 1 -1 -1
3 1 -1 -1
4 1 -1 -1
5 1 1 -1
6 1 1 -1
7 1 1 -1
8 1 1 -1
9 1 0 2
10 1 0 2
11 1 0 2
12 1 0 2
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.helmert"
To fit a model with Helmert contrasts
Call:
lm(formula = y ~ x, data = dat2, contrasts = list(x = "contr.helmert"))
Coefficients:
(Intercept) x1 x2
15.777 -2.399 -2.663
Interpretation
- the mean of the groups is 15.7767174
- the difference between the mean of the first group and the mean of the first and second groups is -2.3991779
- the difference between the mean of the second group and the mean of the second and third groups is -2.3991779
Polynomial contrasts generate orthogonal polynomial trends (such as linear, quadratic and cubic). This is equivalent to fitting a multiple linear regression (or polynomial regression) with orthogonal parameters. \(p-1\) order polynomials are defined. Polynomial contrasts are obviously only sensible for factors whose levels can be conceptualized as representing an ordered, continuum. Unless otherwise specified, the intervals between levels are assumed to be equally spaced. The first, second and third order polynomials respectively represent straight line (linear), curves with a single major change in direction (quadratic) and curves with two major changes in direction (cubic).
Parameter | Estimates | Null hypothesis |
---|---|---|
\(\beta_0\) | mean of all groups (\(\mu_{1:p}/p\)) | H\(_0\): \(\beta_0=0\) |
\(\beta_1\) | the first order polynomial (linear) coefficient | H\(_0\): \(\beta_1=0\) |
\(\beta_1\) | the second order polynomial (quadratic) coefficent | H\(_0\): \(\beta_2=0\) |
… |
- You might notice that the last group is completely omitted.
-
{.sm .paperTable tbl-colwidths=“[10,70, 20]”}
(Intercept) x.L x.Q
1 1 -7.071068e-01 0.4082483
2 1 -7.071068e-01 0.4082483
3 1 -7.071068e-01 0.4082483
4 1 -7.071068e-01 0.4082483
5 1 -7.850462e-17 -0.8164966
6 1 -7.850462e-17 -0.8164966
7 1 -7.850462e-17 -0.8164966
8 1 -7.850462e-17 -0.8164966
9 1 7.071068e-01 0.4082483
10 1 7.071068e-01 0.4082483
11 1 7.071068e-01 0.4082483
12 1 7.071068e-01 0.4082483
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.poly"
To fit a model with polynomial contrasts
Call:
lm(formula = y ~ x, data = dat2, contrasts = list(x = "contr.poly"))
Coefficients:
(Intercept) x.L x.Q
15.7767 -7.3448 -0.3227
Interpretation
- the mean of the groups is 15.7767174
- \(y\) declines at a rate of -7.3448474 per level of the group (\(x\)). There is a negative relationship.
- the rate of quadratic curvature is -0.3227084. Since this value is negative, it suggests that the gradient of the trend (slope), declines with \(x\).
In addition to the `prefabricated’ sets of comparisons illustrated above, it is possible to define other contrast combinations that are specifically suited to a particular experimental design and set of research questions. Contrasts are defined by constructing a contrast matrix according to the following rules:
- groups to be included and excluded in a specific contrasts (comparison) are represented by non-zero and zero coefficients respectively
- groups to be apposed (contrasted) to one another should have apposing signs
- the number of contrasts must not exceed \(p-1\), where \(p\) is the number of groups. Actually, it must equal \(p-1\) exactly. However, it is usually sufficient to define less than \(p-1\) contrasts and let R generate the remaining contrasts.
- within a given contrast, the sum of positive coefficients (and negative coefficients) should sum to 1 to ensure that the resulting estimates can be sensibly interpreted.
- all the contrasts must be orthogonal (independent of one another)
Lets say we were interested in comparing the means of group 1 with the average of groups 2 and 3. We can specifically define this one contrasts and let R define the other two orthogonal contrasts (which might hvae no interpretation at all).
# define potential contrast matrix for comparing group G1 with the average of groups G2 and G3
library(gmodels)
contrasts(dat2$x) <- make.contrasts(rbind(c(1, -0.5, -0.5)))
contrasts(dat2$x)
C1 C2
control 0.6666667 5.551115e-17
medium -0.3333333 -7.071068e-01
high -0.3333333 7.071068e-01
(Intercept) xC1 xC2
1 1 0.6666667 5.551115e-17
2 1 0.6666667 5.551115e-17
3 1 0.6666667 5.551115e-17
4 1 0.6666667 5.551115e-17
5 1 -0.3333333 -7.071068e-01
6 1 -0.3333333 -7.071068e-01
7 1 -0.3333333 -7.071068e-01
8 1 -0.3333333 -7.071068e-01
9 1 -0.3333333 7.071068e-01
10 1 -0.3333333 7.071068e-01
11 1 -0.3333333 7.071068e-01
12 1 -0.3333333 7.071068e-01
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$x
C1 C2
control 0.6666667 5.551115e-17
medium -0.3333333 -7.071068e-01
high -0.3333333 7.071068e-01
To fit a model with user defined contrasts
dat2f.lm <- lm(y ~ x,
data = dat2,
contrasts = list(x = make.contrasts(rbind(c(1, -0.5, -0.5))))
)
dat2f.lm
Call:
lm(formula = y ~ x, data = dat2, contrasts = list(x = make.contrasts(rbind(c(1,
-0.5, -0.5)))))
Coefficients:
(Intercept) xC1 xC2
15.777 7.593 -3.952
Interpretation
- the mean of the groups is 15.7767174
- the difference between the first group and the average of the second and third groups is 7.5927693
- since we did not define a second contrast, R has just added one in that satisfies orthogonality - this one obviously has no interpretation.
Whilst the type of contrasts impact on the interpretation of model estimates, they rarely have an impact on the goodness of fit of the model (unless the sample sizes within groups vary substantially).
DHARMa (simulated) residuals
Following indices with missing values are not used for ranking: AIC_wt,
AICc_wt, BIC_wt
# Comparison of Model Performance Indices
Name | Model | R2 | R2 (adj.) | RMSE | Sigma | AIC weights | AICc weights | BIC weights | Performance-Score
-------------------------------------------------------------------------------------------------------------------
dat2b.lm | lm | 0.955 | 0.940 | 3.535 | 4.082 | 0.333 | 0.333 | 0.333 | 100.00%
dat2c.lm | lm | 0.590 | 0.499 | 3.535 | 4.082 | 0.333 | 0.333 | 0.333 | 37.50%
dat2a.lm | lm | 0.590 | 0.499 | 3.535 | 4.082 | 0.333 | 0.333 | 0.333 | 0.00%
Following indices with missing values are not used for ranking: AIC_wt,
AICc_wt, BIC_wt
Exploratory data analysis suggested that a simple linear model of the form
\[ y_i = \beta_0+\beta_1\times x_i+\varepsilon_i \hspace{1cm}\varepsilon_i\sim{}\mathcal{N}(0, \sigma^2) \]
might not be appropriate for the motivating example 3 data. To see how this manifests in the fitted model and diagnostics, lets fit the model via OLS anyway.
\[ y_i = \beta_0+\beta_1\times x_i+\varepsilon_i \hspace{1cm}\varepsilon_i\sim{}\mathcal{N}(0, \sigma^2) \]
And now to explore the regression diagnostics.
Conclusions:
- there is a clear non-linear pattern remaining in the residual plot indicative of a persistent non-linear trend
- there is some evidence that the tails of the Q-Q plot are starting to diverge from the line. This usually implies that the nominated distribution (in this case gaussian) is not capturing the underlying distribution appropriately.
- the 10th observation has a large (>0.8) Cook’s D value suggesting that it was overly influential
Influence measures of
lm(formula = y ~ x, data = dat3) :
dfb.1_ dfb.x dffit cov.r cook.d hat inf
1 0.69973 -0.59205 0.70238 1.553 2.49e-01 0.345
2 0.36136 -0.28537 0.36916 1.550 7.35e-02 0.248
3 -0.08015 0.05651 -0.08608 1.569 4.21e-03 0.176
4 -0.00629 0.00355 -0.00767 1.496 3.36e-05 0.127
5 -0.00884 0.00249 -0.01455 1.455 1.21e-04 0.103
6 -0.22861 -0.12895 -0.75191 0.502 1.90e-01 0.103
7 0.00000 -0.19332 -0.41761 1.092 8.51e-02 0.127
8 0.07626 -0.21509 -0.32761 1.379 5.72e-02 0.176
9 -0.31623 0.62431 0.80763 1.058 2.91e-01 0.248
10 -0.54136 0.91610 1.08681 1.146 5.11e-01 0.345
y x .hat .sigma .cooksd .fitted .resid .stdresid
1 1 1 0.3454545 5.462498 2.486948e-01 -3.2727273 4.2727273 0.97078434
2 3 2 0.2484848 5.651781 7.354214e-02 -0.1454545 3.1454545 0.66696345
3 2 3 0.1757576 5.801405 4.213192e-03 2.9818182 -0.9818182 -0.19878842
4 6 4 0.1272727 5.815619 3.359699e-05 6.1090909 -0.1090909 -0.02146529
5 9 5 0.1030303 5.815022 1.208692e-04 9.2363636 -0.2363636 -0.04587534
6 3 6 0.1030303 4.456374 1.896896e-01 12.3636364 -9.3636364 -1.81736913
7 10 7 0.1272727 5.374765 8.511611e-02 15.4909091 -5.4909091 -1.08041965
8 15 8 0.1757576 5.617331 5.721763e-02 18.6181818 -3.6181818 -0.73257214
9 28 9 0.2484848 5.136837 2.907776e-01 21.7454545 6.2545455 1.32621634
10 31 10 0.3454545 5.062545 5.114353e-01 24.8727273 6.1272727 1.39214605
Conclusions:
- the 10th observation has a large (>0.8) Cook’s D value suggesting that it was overly influential
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Conclusions:
- the observed density and simulated densities in the Posterior Predictive Check plot are inconsistent
- the peaks are not aligned
- there is clear evidence of a non-linear trend in the residuals plot
- the 10th observation has high Cook’s D value
- dots do not fall along the Q-Q plot line (particularly the one in the tails)
Since the model diagnostics indicated violations of the assumptions, we will not explore the model summaries
Exploratory data analysis suggested that a simple linear model of the form
\[ y_i = \beta_0+\beta_1\times x_i+\varepsilon_i \hspace{1cm}\varepsilon_i\sim{}\mathcal{N}(0, \sigma^2) \]
might not be appropriate for the motivating example 4 data. To see how this manifests in the fitted model and diagnostics, lets fit the model via OLS anyway.
\[ y_i = \beta_0+\beta_1\times x_i+\varepsilon_i \hspace{1cm}\varepsilon_i\sim{}\mathcal{N}(0, \sigma^2) \]
And now to explore the regression diagnostics.
Conclusions:
- there is a clear non-linear pattern remaining in the residual plot indicative of a persistent non-linear trend
- there is evidence that the tails of the Q-Q plot have diverge from the line. This usually implies that the nominated distribution (in this case gaussian) is not capturing the underlying distribution appropriately.
- the 10th observation has a large (>0.8) Cook’s D value suggesting that it was overly influential
Influence measures of
lm(formula = y ~ x, data = dat4) :
dfb.1_ dfb.x dffit cov.r cook.d hat inf
1 0.2577 -0.2180 0.2587 1.925 0.037551 0.345 *
2 0.3092 -0.2442 0.3159 1.597 0.054662 0.248
3 0.4650 -0.3278 0.4993 1.163 0.122089 0.176
4 -0.2610 0.1472 -0.3181 1.239 0.052606 0.127
5 -0.1896 0.0535 -0.3119 1.159 0.049580 0.103
6 -0.0419 -0.0236 -0.1378 1.390 0.010602 0.103
7 0.0000 0.0117 0.0253 1.495 0.000365 0.127
8 0.1886 -0.5319 -0.8102 0.764 0.260507 0.176
9 0.0260 -0.0513 -0.0663 1.731 0.002510 0.248
10 -1.1603 1.9635 2.3294 0.327 1.255963 0.345 *
y x .hat .sigma .cooksd .fitted .resid .stdresid
1 0 1 0.3454545 3.787196 0.0375508913 -1.090909 1.0909091 0.37722422
2 3 2 0.2484848 3.741534 0.0546616885 1.218182 1.7818182 0.57501005
3 7 3 0.1757576 3.537286 0.1220892204 3.527273 3.4727273 1.07009938
4 3 4 0.1272727 3.644957 0.0526058342 5.836364 -2.8363636 -0.84938298
5 5 5 0.1030303 3.609271 0.0495801803 8.145455 -3.1454545 -0.92912778
6 9 6 0.1030303 3.776988 0.0106021970 10.454545 -1.4545455 -0.42965447
7 13 7 0.1272727 3.820138 0.0003653183 12.763636 0.2363636 0.07078191
8 10 8 0.1757576 3.184752 0.2605067570 15.072727 -5.0727273 -1.56312946
9 17 9 0.2484848 3.817707 0.0025099755 17.381818 -0.3818182 -0.12321644
10 26 10 0.3454545 2.432094 1.2559625765 19.690909 6.3090909 2.18161342
Conclusions:
- the 10th observation has a large (>0.8) Cook’s D value suggesting that it was overly influential
1 outlier detected: case 10.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model).
Conclusions:
- the observed density and simulated densities in the Posterior Predictive Check plot are inconsistent
- the peaks are not aligned
- there is clear evidence of a non-linear trend in the residuals plot
- the 10th observation has high Cook’s D value
- dots do not fall along the Q-Q plot line (particularly those in the tails)
Since the model diagnostics indicated violations of the assumptions, we will not explore the model summaries
Exploratory data analysis suggested that a simple linear model of the form
\[ y_i = \beta_0+\beta_1\times x_i+\varepsilon_i \hspace{1cm}\varepsilon_i\sim{}\mathcal{N}(0, \sigma^2) \]
might not be appropriate for the motivating example 4 data. To see how this manifests in the fitted model and diagnostics, lets fit the model via OLS anyway.
\[ y_i = \beta_0+\beta_1\times x_i+\varepsilon_i \hspace{1cm}\varepsilon_i\sim{}\mathcal{N}(0, \sigma^2) \]
And now to explore the regression diagnostics.
Conclusions:
- the regular residual plot is not interpretable for binary data. It will just show two, angled clouds of points (one associated with the 0 values, the other associated with the 1’s). there is a clear non-linear pattern remaining in the residual plot indicative of a persistent non-linear trend
- there is evidence that the tails of the Q-Q plot have diverge from the line. This usually implies that the nominated distribution (in this case gaussian) is not capturing the underlying distribution appropriately.
Influence measures of
lm(formula = y ~ x, data = dat5) :
dfb.1_ dfb.x dffit cov.r cook.d hat inf
1 0.0000 0.0000 1.37e-16 1.995 1.06e-32 0.345 *
2 -0.2401 0.1896 -2.45e-01 1.651 3.35e-02 0.248
3 -0.3631 0.2560 -3.90e-01 1.305 7.88e-02 0.176
4 0.7379 -0.4162 8.99e-01 0.466 2.58e-01 0.127
5 -0.3911 0.1103 -6.43e-01 0.635 1.56e-01 0.103
6 0.1066 0.0602 3.51e-01 1.095 6.10e-02 0.103
7 0.0000 0.1063 2.30e-01 1.353 2.86e-02 0.127
8 -0.0217 0.0611 9.31e-02 1.566 4.93e-03 0.176
9 0.0476 -0.0939 -1.21e-01 1.716 8.38e-03 0.248
10 0.2534 -0.4289 -5.09e-01 1.743 1.38e-01 0.345
y x .hat .sigma .cooksd .fitted .resid .stdresid
1 0 1 0.3454545 0.3651484 1.064866e-32 8.326673e-17 5.551115e-17 0.0000000
2 0 2 0.2484848 0.3604912 3.352163e-02 1.333333e-01 -1.333333e-01 -0.4502943
3 0 3 0.1757576 0.3478626 7.884330e-02 2.666667e-01 -2.666667e-01 -0.8599394
4 1 4 0.1272727 0.2727724 2.578125e-01 4.000000e-01 6.000000e-01 1.8803495
5 0 5 0.1030303 0.2967000 1.561098e-01 5.333333e-01 -5.333333e-01 -1.6486803
6 1 6 0.1030303 0.3400545 6.098038e-02 6.666667e-01 3.333333e-01 1.0304252
7 1 7 0.1272727 0.3560698 2.864583e-02 8.000000e-01 2.000000e-01 0.6267832
8 1 8 0.1757576 0.3640921 4.927706e-03 9.333333e-01 6.666667e-02 0.2149849
9 1 9 0.2484848 0.3639897 8.380407e-03 1.066667e+00 -6.666667e-02 -0.2251472
10 1 10 0.3454545 0.3529917 1.382275e-01 1.200000e+00 -2.000000e-01 -0.7237469
Conclusions:
- there are no obvious issues with the influence values
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Conclusions:
- the observed density and simulated densities in the Posterior Predictive Check plot are inconsistent
- the observed data show two peaks (associated with 0’s and 1’s)
- the posterior predictions show a single peak between 0 and 1
- there is clear evidence of a non-linear trend in the standardised residuals plot
- dots do not fall along the Q-Q plot line (particularly those in the tails)
Since the model diagnostics indicated violations of the assumptions, we will not explore the model summaries
Exploratory data analysis suggested that a simple linear model of the form
\[ y_i = \beta_0+\beta_1\times x_i+\varepsilon_i \hspace{1cm}\varepsilon_i\sim{}\mathcal{N}(0, \sigma^2) \]
might not be appropriate for the motivating example 4 data. To see how this manifests in the fitted model and diagnostics, lets fit the model via OLS anyway.
\[ y_i = \beta_0+\beta_1\times x_i+\varepsilon_i \hspace{1cm}\varepsilon_i\sim{}\mathcal{N}(0, \sigma^2) \]
And now to explore the regression diagnostics.
Conclusions:
- the regular residual plot is not interpretable for binary data. It will just show two, angled clouds of points (one associated with the 0 values, the other associated with the 1’s). there is a clear non-linear pattern remaining in the residual plot indicative of a persistent non-linear trend
- there is evidence that the tails of the Q-Q plot have diverge from the line. This usually implies that the nominated distribution (in this case gaussian) is not capturing the underlying distribution appropriately.
Influence measures of
lm(formula = y ~ x, data = dat6) :
dfb.1_ dfb.x dffit cov.r cook.d hat inf
1 -0.5258 0.4449 -0.528 1.725 0.14801 0.345
2 -0.1195 0.0944 -0.122 1.716 0.00847 0.248
3 -0.3417 0.2409 -0.367 1.333 0.07057 0.176
4 0.6158 -0.3473 0.750 0.622 0.20736 0.127
5 -0.1876 0.0529 -0.308 1.164 0.04862 0.103
6 0.1743 0.0983 0.573 0.734 0.13326 0.103
7 0.0000 0.1695 0.366 1.169 0.06771 0.127
8 0.0677 -0.1908 -0.291 1.419 0.04570 0.176
9 0.0450 -0.0888 -0.115 1.718 0.00750 0.248
10 0.3208 -0.5429 -0.644 1.613 0.21314 0.345
y x .hat .sigma .cooksd .fitted .resid .stdresid
1 0.0 1 0.3454545 0.2165384 0.148014601 0.1272727 -0.12727273 -0.7489309
2 0.2 2 0.2484848 0.2238334 0.008468314 0.2412121 -0.04121212 -0.2263249
3 0.2 3 0.1757576 0.2150630 0.070573026 0.3551515 -0.15515152 -0.8135885
4 0.8 4 0.1272727 0.1802776 0.207356771 0.4690909 0.33090909 1.6863422
5 0.4 5 0.1030303 0.2123412 0.048615863 0.5830303 -0.18303030 -0.9200478
6 1.0 6 0.1030303 0.1892068 0.133261324 0.6969697 0.30303030 1.5232580
7 1.0 7 0.1272727 0.2111195 0.067708333 0.8109091 0.18909091 0.9636241
8 0.8 8 0.1757576 0.2184552 0.045697585 0.9248485 -0.12484848 -0.6546845
9 1.0 9 0.2484848 0.2239157 0.007501344 1.0387879 -0.03878788 -0.2130117
10 1.0 10 0.3454545 0.2129163 0.213141026 1.1527273 -0.15272727 -0.8987170
Conclusions:
- there are no obvious issues with the influence values
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Conclusions:
- the observed density and simulated densities in the Posterior Predictive Check plot are inconsistent
- the observed data show two peaks (associated with 0’s and 1’s)
- the posterior predictions show a single peak between 0 and 1
- there is clear evidence of a non-linear trend in the standardised residuals plot
- dots do not fall along the Q-Q plot line (particularly those in the tails)
Since the model diagnostics indicated violations of the assumptions, we will not explore the model summaries
9 Generalised linear models
9.1 Summary
The callout below contains a lot of detail that some might find excessive and unnecessary if you only need to know how to fit a Generalised Linear Model (GLM). Whilst it is useful to have a more in-depth understanding of the concepts and mathematics behind statistical routines in preparation to fitting models, the follwing summary might suffice for those seeking a quicker overview.
The essential points are:
GLM’s take on the form of:
\[ \begin{align} Y \sim{}& D(\boldsymbol{\eta}, ...)\\ g(\boldsymbol{\eta}) =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]
where \(D\) represents the nominated distribution to model the data against and \(g()\) represents a link function that is used to map the scale of the expected values [\(-\infty, \infty\)] to the scale of data under the nominated distribution.
there is no closed form solution to find the values of the parameters that best fit the data.
rather than find a solution based on minimising the residuals, GLM’s operate on likelihood. Likelihood is a calculation of the probability of obtaining the observed data for a specific combination of parameter values. maximising the (log) likelihood of the given set of parameters given the observed data.
Maximum Likelihood Estimation (MLE) is the process of finding the specific combination of parameter values that maximise the (log) likelihood of observing the data.
there are numerous optimisation algorithms designed to efficiently explore the entire possible parameter space tp find the MLE
the role of many R GLM model fitting routines is to define the appropriate likelihood function for the specified model and data and to pass this on to a optimisation routine.
If all you desire is a overview, you can now skip down to the section on Fitting GLM’s.
Recall from the previous section that the simple linear model can be written as: \[ \begin{align} y_i =& \beta_0 + \beta_1 x_i + \varepsilon_i\\ Y =& \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \end{align} \]
This could be re-written as:
\[ \begin{align} y_i \sim{}& N(\mu_i, \sigma^2I)\\ \mu_i =& \beta_0 + \beta_1 x_i\\ \end{align} \]
indicating that the response (\(y\)) is distributed as a Gaussian (normal) the mean of which is determined by the linear relationship with \(x\) and there is a constant variance (\(\sigma^2\)) - all observations are independently and identically distributed (iid) - independently drawn from a distribution with the same variance.
The \(I\) signifies an identity matrix (a square matrix whose diagonals are all one).
More generally, the above can be written in vector/matrix form as: \[ \begin{align} Y \sim{}& N(\boldsymbol{\mu}, \sigma^2I)\\ \boldsymbol{\mu} =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]
If we suspect that the residuals are not independent and identically distributed (if for example they were Poisson distributed), then we could alter (generalise) the above to:
\[ \begin{align} Y \sim{}& N(\boldsymbol{\mu}, \sigma^2\boldsymbol{W}^{-1})\\ \boldsymbol{\mu} =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]
where \(\boldsymbol{W}\) is a matrix of positive diagonals .
This allows us to generalise to other (exponential) families (such as Binomial, Poisson, Negative Binomial, Gamma etc). For example, if our response (\(Y\)) were count data, we might consider a Poisson.
\[ \begin{align} Y \sim{}& Pois(\boldsymbol{\lambda})\\ \boldsymbol{\lambda} =& e^{\boldsymbol{X}\boldsymbol{\beta}}\\ &\text{OR equivalently}\\ log(\boldsymbol{\lambda}) =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]
The Poisson distribution (\(P(x|\lambda) = e^{-\lambda}\lambda^x/x!\)) is parameterised by a single parameter (\(\lambda\)) that represents both the mean and variance (as well as degrees of freedom). Poisson data are bound at the lower end by zero (negative values are not defined) - it is not possible to count fewer than 0 things. The Poisson family includes an exponential term, therefore to map the expected values back onto the natural scale (scale of the observed data), we use a link function (in the case of the Poission, this link is a log link). Hence the above can be generalized even further to:
\[ \begin{align} Y \sim{}& D(\boldsymbol{\eta}, ...)\\ g(\boldsymbol{\eta}) =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]
where \(D\) is the nominated family, \(g\) is the link function and \(...\) represents any additional parameters required by the nominated distribution (such as \(\sigma^2\boldsymbol{W}^{-1}\) in the case of the Gaussian distribution).
9.2 Maximum likelihood
Ordinary Least Squares provides an elegant solution for when the data satisfy certain assumptions (normality, homogeneity, independence, etc), yet for many other situations, it is not appropriate.
For the following demonstrations, we will use the data from Motivating example 1.
As with OLS, lets start with motivating example 1 data
A more general alternative to OLS is Maximum Likelihood (ML).
Likelihood is a measure of how probable (likely) a set of observations are at following (or being drawn from) a specific distribution. For example, we could evaluate the likelihood that the observations could have come from a normal (Gaussian) distribution with a specific mean and standard deviation.
Before we can understand likelihood, we must first remind ourselves of a couple of things about probability.
- for any continuous distribution, the probability of obtaining a specific values (that is, that a specific value (\(X\)) is equal to a particular reference values (\(x\))) is infinitely small (\(Pr(X=x)=0\)). We can only directly estimate probabilities of obtaining values less than (or greater than) nominated reference points (quantiles).
- it is possible to calculate the probability that a specific value (\(X\)) is between two reference points (\(Q1\) and \(Q2\)). This is just the probability of \(X\) being less than Q1 minus the probability of \(X\) being less than Q2 (\(Pr(Q1 < X > Q2)\)).
So although we cant estimate the probability that \(Pr(X=x)\) directly from the distributions’ function (\(f(x)\)), we can approximate this by calculating the probability that \(X\) is in the interval \([x, x+\Delta]\):
\[ \frac{Pr(x < X \le x + \Delta)}{\Delta} \]
The smaller \(\Delta\) (so long as it is larger than 0), the more accurate the estimate. This becomes a simple calculus problem. The derivative (\(f'(x)\)) of the distributions’ function is a probability density function (PDF). The PDF allows us to approximate the probability of obtaining a single value from a distribution.
Provided the data are all iid (individually and identically distributed), and thus from the same distribution, the probability (likelihood) that a set of values (\(X\)) comes from a specific distribution (described by its parameters, \(\theta\)) can be calculated as the product of their individual probabilities.
\[ L(X|\theta) = \prod^n_{i=1} f'(x_i|\theta) \]
The products of probability densities can soon become very small and this can lead to computation and rounding issues. Hence it is more usual to work with the logarithms of likelihood. The log laws indicate that the log of a product is the same as the sum of the individual logs (\(log(A\times B) = log(A) + log(B)\)).
\[ LL(X|\theta) = \sum^n_{i=1} log(f'(x_i|\theta)) \]
The PDF of a Gaussian distribution is:
\[ P(x|\mu, \sigma) = \frac{1}{\sqrt{\sigma^2 2\pi}} exp^{-\frac{1}{2}((x-\mu)/\sigma)^2} \]
So in order to estimate the optimum values for the parameters (\(\mu\) and \(\sigma\)), given our data (\(x\)), we would maximize the following:
Returning to our example,
\[ \begin{align} LL(x_1, x_2, x_3, ..., x_n|\mu, \sigma^2) =& \sum^n_{i=1}ln(\frac{1}{\sqrt{\sigma^2 2\pi}} exp^{-\frac{1}{2}((x_i-\mu)/\sigma)^2})\\ =& -\frac{n}{2}ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum^n_{i=1}(x_i-\mu)^2 \end{align} \]
Note optimization routines usually attempt to minimize rather than maximize. Nevertheless, finding the maximum of a log-likelihood (what we are after) is the same as finding the minimum of the negative log-likelihood.
For the examples and routines that follow, we will write our own code from first principles as well as use existing built in R functions. Although the code that we write could be written to find maximums, the inbuilt functions typically optimise on minimums. As a result, to enable us to compare routines, we will work with minimizing negative log-likelihood.
Do, lets now write a function for calculating the negative log-likelihood for our data given a Gaussian distribution based on the formula above.
In a similar manner to the brute force approach we used to approximate OLS earlier, lets use a brute force approach to explore the partial negative log-likelihood profile for the mean and then approximate the mean (we will fix the standard deviation at 1). We refer to it as a partial profile, because it is holding the other parameter constant.
mu = seq(15,40,len=1000)
theta=cbind(mu=mu,sigma=1)
B=apply(theta, 1, LL.gaus, dat$y)
theta[which(B==min(B)),]
mu sigma
27.81281 1.00000
Again, this estimation is very close to the ‘true’ value.
9.3 Optimization
Optimization algorithms are essentially search algorithms. They are attempting to find a single point in multidimensional space. There are numerous types of algorithms, each of which offers different advantages and disadvantages under different circumstances. For example, some assume that the underlying likelihood is very smooth (changes very gradually) and gain efficiencies out of being able to located minimums via differentiation. Others are less efficient, yet more accurate when the likelihood is not smooth or there are multiple local minima.
Lets use an analogy to gain a better appreciation of the problem and solutions. Imagine we had a big block of land and we wanted to install a well from which to draw water from an underground aquifer. Although there are no physical restrictions on where the well can be positioned, we are keen on it being as shallow as possible (perhaps because it is cheaper to drill a shallow well).
The depth from the land surface down to the aquifer is not constant over space and we want to be able to put our well in the shallowest point. Although we do not know the true underground topography, we can drill narrow pilot holes and accurately measure the depth down to the aquifer at any point in space.
To put this another way, although we do not know what the underground profile looks like throughout the entire spatial domain (all possible latitide/longitude values), we can estimate its value (depth) at any point (single latitude/longitude).
To find the optimum location for our well, we need a search algorithm. One that is able to find the latitude/longitude associated with the minimum depth. We will showcase a few different options and try to describe the advantages and disadvantages of each. For example, in our well analogy, if the depth profile was very smooth (undulated very slowly), we might be able to use approximations to the curvature of the undulations to find where any minimums are. On the other hand, if the profile is not smooth (perhaps there are underground caves or other abrupt underground geological features), such approximations may be very inaccurate and more exhaustive searching (such as a grid of pilot holes) may be required.
Just like with the underground aquifer, although a (negative) log-likelihood function has an unknown profile in multidimensional space (one dimension for each parameter to estimate), we can evaluate it for any combination of parameters.
One conceptually simple way of searching for the minimum of a function is to evaluate the function for a large number of parameter combinations (perhaps in a grid). For example, to drill a pilot hole every 100m in a grid. If the grid is fine enough, it will located the minimum (maximum) no matter what the functions profile is. However, the finer the grid, the more effort is required - lower efficiency.
The following code chunk evaluates and plots the negative log-likelihood for a full (\(100\times 100\)) grid of parameter combinations. Negative log-likelihood is represented as a colour along the green to white spectrum. The blue lines represent major contours in the profile. The optimum parameters (those associated with the minimum negative log-likelihood and thus maximum log-likelihood) are indicated by the black solid point.
mu = seq(15,40,len=100)
sigma=seq(10,20,len=100)
theta = expand.grid(mu=mu, sigma=sigma)
theta$LL=apply(theta, 1, LL.gaus, x=dat$y)
ggplot(data=theta,aes(y=mu, x=sigma, fill=LL)) +
geom_tile(show.legend=FALSE) + geom_contour(aes(z=LL)) +
geom_point(data=theta[which(theta$LL==min(theta$LL)),], aes(y=mu, x=sigma), fill='black') +
scale_fill_gradientn(colors=terrain.colors(12)) +
scale_x_continuous(expand=c(0,0)) +
scale_y_continuous(expand=c(0,0))
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
mu sigma LL
4252 27.87879 14.24242 40.76342
The optimum solution yielded an estimated mean of 27.8787879 and sigma of 14.2424242. These values are very similar to the empirical calculations.
The brute force approach is a useful illustration, but it is practically limited to just a one or two parameters and since the entire search space within the domain needs to be evaluated. This is essentially an exhaustive search algorithm. Furthermore, the accuracy is dependent on the resolution of the grid. The finer the grid, the more accurate, however it does require \(n^p\) function evaluations (where \(n\) is the number of grid increments per \(p\) parameters.
Of course, we can always use a relatively course search grid and having identified the ‘optimum’ parameter configuration within this grid, we could apply a finer resolution grid over a narrower search domain. This is analogous to starting with a 100x100m drilling grid and then centring a 1x1m drilling grid around the ‘best’ location.
A simplex is a multidimensional space that has one more vertex than there are parameter dimensions (\(p+1\)). In this case, we are estimating two parameters, therefore the simplex is a triangle (three vertices). The most common form of simplex optimisation is the Nelder-Mead algorithm. As with most optimisation algorithms, the Nelder-Mead algorithm is a search algorithm that aims to find a point in multidimensional space as efficiently as possible.
Optimisation routines work on a wide variety of function (not just likelihood functions). To keep the terminology about what is being optimised general, the function to be optimised is often referred to as a loss or objective function. A loss function is any function that evaluates the performance (or fit) of a set of events (i.e. data). The further the loss is from zero, the worse the fit (hence the desire to find the minimum).
The Nelder-Mead algorithm can be described as (keep in mind that it is a minimisation rather than maximisation):
Start with some initial estimate values - e.g. a set of \(p+1\) vertices for each parameter.
Identify the vertex with the highest (ie worst) loss (negative log-likelihood).
Reflect the simplex around the centroid of the other vertices 3a. If the reflected point is better (lower negative log-likelihood) than the second worst point, but not better than the best point, replace the worst point with the reflected point. 3b. If instead, the reflected point is the best vertex, then expand this point by doubling the reflection distance
3b.1. If this expanded point is better than the reflected point, replace the worst point with the expanded point 3b.2 Otherwise replace the worst point with the reflected point.
3c. If instead neither 3a or 3b (the reflected point is not better than the second worst point, then contract the point by halving the reflection distance.
3c.1. If this contracted point is better than the worst point, replace the worst point with the contracted point 3c.2. Otherwise, shrink the entire simplex (contract all vertices towards the centroid)
Repeat Steps 2-3 until either the maximum number of iterations have occurred or the change in loss between two successive iterations is below a certain threshold.
Clearly, the more iterations are performed, the more accurate the estimates, and yet the longer the search will take.
In the well analogy, the simplex represents three points of the search triangle. Reflecting the triangle allows us to move away from the direction we know to be deeper. We expand the triangle in order to explore a new direction and contract the triangle to narrow our search area.
Code for illustrating the process is listed as details below. This code is a modification of the code presented in https://github.com/nicolaivicol/nelder-mead-R/blob/master/optimNM.R
get.optim.NM <- function(X.vert, params.init, objective.fn, iter.max=250, abs.tol=0.0001, x, control=list(fnscale=1,refl=1,expan=2,contr=-0.5,shrink=0.5))
{
# input dimension
X.len <- length(params.init)
# initialize controls before iterations of searching
iter <- 0; not.converged <- 1; not.max.iter <- 1
X.optim <- params.init; f_X.optim <- control$fnscale*objective.fn(X.optim, x=x)
# while loop, iterations
while (not.converged & not.max.iter)
{
# get values at vertices
f_X.vert <- control$fnscale*apply(X = X.vert, MARGIN = 1, FUN = objective.fn, x)
# order ascending X.vert and f(X.vert), by f(X.vert)
X.order <- sort(f_X.vert, index.return = TRUE)$ix
X.vert <- X.vert[X.order, ]
f_X.vert <- f_X.vert[X.order]
# get centroid (mean on each dimension) of all points except the worst
X.centr <- apply(X = X.vert[1:X.len, ], MARGIN = 2, FUN = mean)
# get reflected point
X.refl <- X.centr + control$refl*(X.centr - X.vert[X.len+1, ])
f_X.refl <- control$fnscale*objective.fn(X.refl,x)
if ((f_X.vert[1] <= f_X.refl) & (f_X.refl < f_X.vert[X.len]))
{
# if the reflected point is better than the second worst, but not better than the best...
# ... then obtain a new simplex by replacing the worst point with the reflected point
X.vert[X.len+1, ] <- X.refl
} else if (f_X.refl < f_X.vert[1]) {
# if the reflected point is the best point so far
# ... then compute the expanded point
X.expan <- X.centr + control$expan*(X.centr - X.vert[X.len+1, ])
f_X.expan <- control$fnscale*objective.fn(X.expan,x)
# ... if the expanded point is better than the reflected point
if (f_X.expan < f_X.refl)
{
# ... then obtain a new simplex by replacing the worst point with the expanded point
X.vert[X.len+1, ] <- X.expan
} else {
# ... else obtain a new simplex by replacing the worst point with the reflected point
X.vert[X.len+1, ] <- X.refl
}
} else {
# ... reflected point is not better than second worst
# ... then compute the contracted point
X.contr <- X.centr + control$contr*(X.centr - X.vert[X.len+1, ])
f_X.contr <- control$fnscale*objective.fn(X.contr,x)
# ... if the contracted point is better than the worst point
if (f_X.contr < f_X.vert[X.len+1])
{
# ... then obtain a new simplex by replacing the worst point with the contracted point
X.vert[X.len+1, ] <- X.contr
} else {
# ... shrink the simplex: X = X1 + coef.shrink(X-X1)
X.vert <- sweep(control$shrink*sweep(X.vert, 2, X.vert[1, ], FUN = "-"), 2, X.vert[1, ], FUN="+")
}
}
# get values at vertices
f_X.vert <- control$fnscale*apply(X = X.vert, MARGIN = 1, FUN = objective.fn, x)
# order asc X.vert and f(X.vert)
X.order <- sort(f_X.vert, index.return = TRUE)$ix
X.vert <- X.vert[X.order, ]
f_X.vert <- f_X.vert[X.order]
# update controls
iter <- iter + 1
not.max.iter <- (iter < iter.max)*1
not.converged <- (abs(control$fnscale*objective.fn(X.vert[X.len, ],x)- control$fnscale*objective.fn(X.vert[1, ],x)) > abs.tol)*1
X.optim <- X.vert[1, ]; f_X.optim <- control$fnscale*objective.fn(X.optim,x)
}
return(list(X.optim=X.optim, f_X.optim=f_X.optim, X.vert=X.vert, iter=iter))
}
We can illustrate the iterative process by plotting the outcome of a limited number of iterations - in this case 10 iterations. In this illustration, the filled in triangle represents the current optimum simplex. Previous simplexes remain as open triangle.
## Starting values at the center of the vectices
params.init <- c(mu = 20, sigma = 12)
d.mu <- 0.5
d.sigma <- 0.3
simplex <- rbind(Vertex1 = params.init + c(d.mu,d.sigma),
Vertex2 = params.init + c(-d.mu, d.sigma),
Vertex3 = params.init + c(-d.mu, -d.sigma)) |>
data.frame()
base.plot <- ggplot(data = theta, aes(y = mu, x = sigma)) +
geom_tile(aes(fill = LL), show.legend = FALSE) +
geom_contour(aes(z = LL)) +
scale_fill_gradientn(colors = terrain.colors(12)) +
scale_x_continuous(expand = c(0,0)) +
scale_y_continuous(expand = c(0,0))
newdata <- vector('list', 15)
a <- get.optim.NM(X.vert = as.matrix(simplex), params.init, objective.fn = LL.gaus, x = dat$y, iter.max = 1, control = list(fnscale = 1, refl = 1, expan = 2, contr = -0.5, shrink = 0.5))
newdata[[1]] <- data.frame(a$X.vert)
for (i in 2:15) {
a <- get.optim.NM(X.vert = a$X.vert, params.init = a$X.optim, objective.fn = LL.gaus, x = dat$y, iter.max = 1, control = list(fnscale = 1, refl = 1, expan = 2, contr = -0.5, shrink = 0.5))
newdata[[i]] <- data.frame(a$X.vert)
}
newdata <- bind_rows(newdata, .id = 'Iter') |>
mutate(Iter=factor(Iter, levels = unique(Iter)))
g <- base.plot +
geom_polygon(data = newdata, aes(y = mu, x = sigma, group = Iter),
color = 'black', fill = 'grey40') +
transition_manual(Iter) +
shadow_trail(distance = 0.1, alpha = 0.4, color = 'grey40', fill = NA) +
labs(title = 'Iter: {current_frame}')
ga <- animate(g, fps = 20, duration = 15)
#ga=animate(g, renderer=av_renderer())
anim_save('simplexAnim.gif', animation = ga, path = '../resources/', renderer =av_renderer())
Having seen this illustration, we could allow the Nelder-Mead simplex optimization to iterate more thoroughly. We will now instruct the routine to iterate a maximum of 250 times. Along with setting a maximum number of iterations, most optimizations also have a stopping trigger based around the extent of improvement between iterations. This convergence tolerance defines a threshold difference below which two successive iteration outcomes are considered the same. The lower the value, the more accuracy is demanded.
It is also a good idea to repeat the iterations again from multiple starting configurations to ensure that any single optimization has not just settled n a local minimum (maximum).
get.optim.NM(X.vert = as.matrix(simplex), params.init, objective.fn = LL.gaus,
x = dat$y, iter.max = 250, abs.tol = 1e-08,
control = list(fnscale = 1, refl = 1, expan = 2, contr = -0.5, shrink = 0.5))
$X.optim
mu sigma
27.81238 14.25925
$f_X.optim
sigma
40.76329
$X.vert
mu sigma
Vertex3 27.81238 14.25925
Vertex2 27.80955 14.25881
Vertex1 27.81327 14.25858
$iter
[1] 32
get.optim.NM(X.vert = as.matrix(simplex - c(-0.5, 0.2)), params.init, objective.fn = LL.gaus,
x = dat$y, iter.max = 250, abs.tol = 1e-08,
control = list(fnscale = 1, refl = 1, expan = 2, contr = -0.5, shrink = 0.5))
$X.optim
mu sigma
27.81122 14.25812
$f_X.optim
sigma
40.76329
$X.vert
mu sigma
Vertex1 27.81122 14.25812
Vertex2 27.80970 14.25939
Vertex3 27.81149 14.26005
$iter
[1] 38
The two sets of estimated parameters (listed as X.optim
) are very similar (converged) and are very similar to those calculated empirically.
R has an inbuilt function (optim()
) that is an interface to numerous optimization algorithms, and the default algorithm is Nelder-Mead
. As with other optimizations, optim()
defaults to minimizing. To force it to maximize (if our likelihood function returned log-likelihood rather than negative log-likelihood), we can indicate that the fnscale
is -1. Other important optimization control parameters include:
maxit
- the maximum number of iterations to perform (100 for Nelder-Mead).abstol
- the absolute convergence tolerance. The lower the tolerance, the smaller the change in optimized value (log-likelihood) required before convergence is reached.
$par
[1] 27.81143 14.25771
$value
[1] 40.76329
$counts
function gradient
53 NA
$convergence
[1] 0
$message
NULL
We can see that the negative log-likelihood calculation was performed 51 times before convergence. Whilst this is not the same as the number of iterations, it does provide an estimate of the total computation load.
If we compare this to the brute force approach (which required \(100\times 100=10,000\) evaluations), the Nelder-Mead simplex approach is a substantial improvement.
If the profile of a function is smooth enough, and the function itself is doubly differentiable (can be differentiated into first and second order derivatives), then we can make use of a group of algorithms based on a root-finding algorithm devised by Isaac Newton. In mathematical contexts, a root is the value of the parameters (\(x\)) when the function equals zero (\(f(x)=0\)).
A simplified version of Newton’s method, the Newton-Raphson method shows that root (\(x_{n+1}\)) of a function (\(f(x_n)\)) can be approximated by iteratively solving:
\[ x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)} \]
where \(x_n\) is the initial parameter estimate, \(x_{n+1}\) is the improved parameter estimate, \(f(x_n)\) is the value of the function at \(x_ n\) and \(f'(x_n)\) is the first order derivative (the slope) of the function at \(x_n)\).
Before we use this approach to estimate maximum likelihood estimates, lets see how it works with a polynomial function (e.g. \(y=0.1x^4 + x^3 - x^2 + 30\)).
In the following animation, we will use the Newton-Raphson method to estimate the value of \(x\) when \(y=0\). The animation will go through five iterations. For each iteration, the red line represents the slope associated with the initial estimate of \(x\). The point where this line intersects with the dashed line (\(y=0\)) is the updated estimate for \(x\). We can see that by the fifth iteration, the estimated \(x\) is has began to stabalise (converge) on a value of approximately \(1.98\).
NR <- function(f, x, return.grid=FALSE) {
if (is.function(f)) f=body(f)[[2]]
f_x = eval(f)
f._x = eval(D(f,'x'))
if (return.grid) {
list(x1=x - f_x/f._x, f_x=f_x, f._x=f._x)
}else {
list(x1=x - f_x/f._x)
}
}
optim.NR <- function(f, x0, abs.tol=1.0E-6, iter.max=250, return.grid=FALSE) {
iter <- 0; not.converged <- 1; not.max.iter <- 1
fgrid <- list()
while (not.converged & not.max.iter) {
nr <- NR(f, x = x0, return.grid)
x1 <- nr$x1
iter <- iter + 1
not.max.iter <- (iter < iter.max)*1
not.converged <- (abs(x0-x1) > abs.tol)*1
if (return.grid) fgrid[[iter]] <- c(x0 = x0, x1 = x1, f_x = nr$f_x, f._x = nr$f._x)
x0 <- x1
}
list(X.optim = x0, iter = iter, grid = do.call('rbind', fgrid))
}
f <- function(x) {0.1*x^4 + x^3 - x^2 + 10*x + 30}
a <- optim.NR(f, x0 = 3, abs.tol = 0.001, return.grid = TRUE)
x <- seq(-4, 4, len = 100)
dat1 <- data.frame(x = x, y = f(x))
gdat <- a$grid %>% as.data.frame |>
mutate(Iter = 1:n(),
x = ifelse(x0>0, x0+1, x0-1),
y = f_x + (x-x0)*f._x)
g <- ggplot() + geom_line(data = dat1, aes(y = y, x = x)) +
geom_hline(yintercept = 0, linetype = 'dashed') +
geom_segment(data = gdat, aes(x = x, y = y, xend = x1, yend = 0), color = 'red') +
geom_segment(data = gdat, aes(x = x0, xend = x0, y = 0, yend = f_x), linetype = 'dashed')+
geom_text(data = gdat |> filter(Iter<4), aes(x = x0, y = -5, label = "x[0]"), parse = TRUE) +
geom_text(data = gdat %>% filter(Iter<4), aes(x = x1, y = -5, label = "x[1]"), parse = TRUE) +
geom_point(data = gdat, aes(x = x0, y = f_x)) +
geom_text(data = gdat |> filter(Iter<4), aes(x = x0, y = f_x, label = "f(x)"),
parse = TRUE, nudge_x = 0.1, nudge_y = -5, hjust = 0, vjust = 0) +
geom_text(data = gdat |> filter(Iter<4), aes(x = x0+0.3+(x1-x0)/2, y = (f_x/2)-5, label = "f*minute*(x)"),
parse = TRUE, color = 'red') +
geom_text(data = gdat, aes(x = -4, y = 140, label = paste0("Iter == ", Iter)), parse = TRUE, hjust = 0) +
geom_text(data = gdat, aes(x = -4, y = 120, label = paste0("x[1] == ", round(x1, 3))), parse = TRUE, hjust = 0) +
transition_manual(Iter)
ga <- animate(g, fps = 20, duration = 5)
anim_save('NMAnim1.gif', animation = ga, path = '../resources')
If we allow our implementation of the Newton-Raphson method to run from an initial guess of 3 until it converges:
We see that it takes just 6 to converge.
In the above, we estimated the root (value of \(x\) when \(f(x)=0\)) of a function. If instead, we want to find the value of \(x\) when the function is at its minimum (e.g. optimisation), then we want to find the root of the first derivative (\(f'(x)\)) of the function - that is, we want to find the value of \(x\) where the slope of \(f(x)\) is 0.
Again, in order to illustrate the principles of what we are trying to achieve, we must digress from our actual example. Rather than try to find the root of a function, we will now try to find the root of the derivative of a function (so as to find the minimum of a function). So we are now shifting our focus away from the profile of the function and onto the profile of the derivative of the function (since the derivative of a function is a slope profile).
The left hand side of the following figure represents the profile of the function (\(y = 0.001x^3 - 0.001x^2 - 0.3x + 5\)). The right hand side represents the derivative (\(0.001(3x^2) - 0.301(2x)\)) of that function.
f <- function(x) {0.001*x^3 - 0.001*x^2 -0.3*x + 5}
f1 <- D(body(f)[[2]], 'x')
x <- seq(0, 20, len = 100)
dat1 <- data.frame(x = x, y = f(x))
g1 <- ggplot() +
geom_line(data = dat1, aes(y = y, x = x))
g2 <- ggplot() +
geom_line(data = data.frame(x = x, y = eval(f1, envi = list(x = x))), aes(y = y, x = x)) +
geom_hline(yintercept = 0, linetype = 'dashed')
grid.arrange(g1, g2, nrow = 1)
$X.optim
[1] 10.33889
$iter
[1] 6
$grid
x0 x1 f_x f._x
[1,] 3.00000 20.43750 -2.790000e-01 0.01600000
[2,] 20.43750 12.87523 9.121992e-01 0.12062500
[3,] 12.87523 10.59535 1.715639e-01 0.07525136
[4,] 10.59535 10.34209 1.559353e-02 0.06157209
[5,] 10.34209 10.33889 1.924166e-04 0.06005255
[6,] 10.33889 10.33889 3.079948e-08 0.06003333
If we animate the process:
dat1 <- data.frame(x, y = eval(f1, env = list(x = x)))
gdat <- a$grid |>
as.data.frame() |>
mutate(Iter = 1:n(),
x = ifelse(x0>0,x0+1, x0-1),
y = f_x + (x-x0)*f._x)
g2 <- ggplot() +
geom_line(data = dat1, aes(y = y, x = x)) +
geom_hline(yintercept = 0, linetype = 'dashed') +
geom_segment(data = gdat, aes(x = x, y = y, xend = x1, yend = 0), color = 'red') +
geom_segment(data = gdat, aes(x = x0, xend = x0, y = 0, yend = f_x), linetype = 'dashed') +
geom_text(data = gdat |> filter(Iter<4), aes(x = x0, y = -0.05, label = "x[0]"), parse = TRUE) +
geom_text(data = gdat |> filter(Iter<4), aes(x = x1, y = -0.05, label = "x[1]"), parse = TRUE) +
geom_point(data = gdat, aes(x = x0, y = f_x)) +
geom_text(data = gdat |> filter(Iter<4), aes(x = x0, y = f_x, label = "f(x)"),
parse = TRUE, nudge_x = 0.1, nudge_y = -0.05, hjust = 0, vjust = 0) +
geom_text(data = gdat |> filter(Iter<4), aes(x = x0+0.3+(x1-x0)/2, y = (f_x/2)-0.05, label = "f*minute*(x)"),
parse = TRUE, color = 'red') +
geom_text(data = gdat, aes(x = 0, y = 0.8, label = paste0("Iter == ", Iter)),
parse = TRUE, hjust = 0) +
geom_text(data = gdat, aes(x = -0, y = 1, label = paste0("x[1] == ", round(x1, 3))), parse = TRUE, hjust = 0) +
transition_manual(Iter)
ga <- animate(g2, fps = 20, duration = 6)
anim_save('NMAnim2.gif', animation = ga, path='../resources')
Note that we are now looking for were the slope of the profile is equal to zero. This makes no distinction between a peak (maximum) and a valley (minimum) as both have a slope of zero. Provided the (negative)log-likelihood profile is monotonic (has either a single peak or valley), this should be OK. It can however be problematic if the profile has local minima and maxima. To minimize issues, it is best to select starting values (inital parameter values) that are likely to be reasonably close to the optimum values.
Ok, great. Now how do we use this approach to optimize for multiple parameters.
Recall that the Newton-Raphson method for optimisation estimates the root by subtracting the ratio of the first order derivative of the (loss) function by the second order derivative of the function.
\[ x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)} \]
When there are multiple parameters to estimate, then there are multiple partial gradients (slopes).
Now that we are back to estimating the mean and variance of our example data, we have two parameters to estimate. Therefore we need:
- the partial derivative of the negative log-likelihood with respect to the mean parameter
- the partial derivative of the negative log-likelihood with respect to the sigma parameter
- the second order partial derivative with respect to mean of the first order derivative with respect to the mean
- the second order partial derivative with respect to sigma of the first order derivative with respect to the mean
- the second order partial derivative with respect to mean of the first order derivative with respect to the sigma
- the second order partial derivative with respect to sigma of the first order derivative with respect to the sigma
The second order partial derivatives form a square matrix called a Hessian matrix.
When there are multiple parameters to estimate, the above formula becomes:
\[ \boldsymbol{x_{n+1}} = \boldsymbol{x_n} - \frac{\boldsymbol{g_n}}{\boldsymbol{H_n}} \]
where \(\boldsymbol{g_n}\) is a vector of partial gradients (first order derivatives), \(\boldsymbol{H_n}\) is a matrix of second order partial derivatives and \(\boldsymbol{x_n}\) and \(\boldsymbol{x_{n+1}}\) are the previous and updated parameter estimates respectively.
Recall that matrices cannot be divided. Rather we must multiply by the matrix inverse. Hence the above equation becomes:
\[ \boldsymbol{x_{n+1}} = \boldsymbol{x_n} - (\boldsymbol{H_n})^{-1}\boldsymbol{g_n} \]
Recall also to invert a matrix, it must be decomposed into an identity matrix and the inverse. In R, this can be achieved via either solve()
or qr.solve
.
On top of this, there is the need to calculate the first and second order derivatives for a function for which there is no equation. Hence, we need to approximate the derivatives using finite differences. That is,we can estimate a derivative (gradient at a specific point on a profile), by calculating the rise over run for a very small run (\(\Delta\))
\[ \frac{df(x)}{dx} \approx \frac{(f(x) + \Delta x/2) - (f(x - \Delta x/2))}{\Delta x} \]
Now that all the pieces are in place, we can demonstrate this by iterating through a number of Newton-Raphson cycles to estimate the mean and sigma of our example 1 data.
params.init <- c(mu = 20, sigma = 12)
## The following function calculates the difference-quotient approximation of gradient
approx.grad <- function(theta, fn, eps = 1e-05) {
p <- length(theta)
nf <- length(fn(theta))
eps.mat <- diag(p) * (eps/2)
Gmat <- array(0, dim = c(nf, p))
for (i in 1:p) {
Gmat[,i] <- (fn(theta + eps.mat[,i]) - fn(theta - eps.mat[,i]))/eps
}
if (nf>1) Gmat else c(Gmat)
}
optim.NR <- function(params.init, fn, x, iter.max = 250, abs.tol = 1e-05, control = list(eps = 1e-06)) {
fnc <- function(bb) t(approx.grad(bb, fn))
eps <- control$eps
gradfn <- function(x) approx.grad(x, fnc, eps)
iter <- 0; not.converged <- 1; not.max.iter <- 1
newpar <- params.init
oldpar <- params.init - 1
while (not.converged & not.max.iter) {
oldpar <- newpar
newpar <- oldpar - solve(gradfn(oldpar), t(fnc(oldpar)))
iter <- iter + 1
not.max.iter <- (iter < iter.max)*1
not.converged <- (abs(fn(oldpar) - fn(newpar))>abs.tol)*1
}
list(iter=iter, final = as.vector(newpar), LL = fn(as.vector(newpar)),
gradient = fnc(newpar), hessian = gradfn(newpar))
}
#optim.NR(params.init, fn=function(t) LL.gaus(t, x=y), iter.max=1)
mu <- seq(15, 040, len = 100)
sigma <- seq(10, 20, len = 100)
theta <- expand.grid(mu = mu, sigma = sigma)
theta$LL <- apply(theta, 1, LL.gaus, x = dat$y)
base.plot <- ggplot(data = theta, aes(y = mu, x = sigma)) +
geom_tile(aes(fill = LL), show.legend = FALSE) +
geom_contour(aes(z = LL)) +
scale_fill_gradientn(colors = terrain.colors(12)) +
scale_x_continuous(expand = c(0,0)) +
scale_y_continuous(expand = c(0,0))
newdata <- vector('list', 10)
newdata[[1]] <- data.frame(t(setNames(params.init, c('mu', 'sigma'))))
p.i <- params.init
for (i in 2:10) {
a <- optim.NR(params.init = p.i, fn = function(t) LL.gaus(t, x = dat$y), iter.max = 1)
newdata[[i]] <- data.frame(t(setNames(a$final,c('mu','sigma'))))
p.i <- as.vector(a$final)
}
newdata <- bind_rows(newdata, .id='Iter') |>
mutate(Iter = factor(Iter, levels = unique(Iter)), nIter = as.numeric(Iter))
g <- base.plot +
geom_point(data = newdata, aes(y = mu, x = sigma, group = Iter), color = 'black') +
geom_path(data = newdata, aes(y = mu, x = sigma), color = 'black') +
transition_reveal(nIter) +
labs(title = "Iteration: {frame_along}")
ga=animate(g, nframes = 10, fps = 1)
anim_save('NMAnim3.gif', animation = ga, path = '../resources')
Now lets allow the routine (optim.NR()
defined in the concealed code above) to run to convergence.
$iter
[1] 5
$final
[1] 27.81098 14.25903
$LL
[1] 40.76329
$gradient
[,1] [,2]
[1,] -1.04734e-06 -5.478284e-07
$hessian
[,1] [,2]
[1,] 0.0490274488 0.0000000
[2,] 0.0007105427 0.0980549
Again, we see that these estimates (final
in the output) are very similar to the empirical calculations. Furthermore, notice that convergence took only 5 iterations.
The inverse of the hessian matrix is the variance-covariance matrix. Therefore, we can also generate estimates of the standard error of the estimates:
\[ SE = \sqrt{diag(\boldsymbol{H})} \]
[1] 4.516275 3.193488
R has an inbuilt routine (nlm()
) that performs the Newton-like method for optimisation.
$minimum
[1] 40.76329
$estimate
[1] 27.81205 14.25769
$gradient
[1] 5.249938e-05 -1.312170e-04
$hessian
[,1] [,2]
[1,] 4.919280e-02 -1.686155e-05
[2,] -1.686155e-05 9.836416e-02
$code
[1] 1
$iterations
[1] 9
H <- nlm(LL.gaus, params.init, x = dat$y, hessian = TRUE, gradtol = 1e-06*2)$hessian
sqrt(diag(solve(H)))
[1] 4.509099 3.189208
When the profile of a function is not smooth, derivative methods can struggle to locate a minimum efficiently (if at all) and when the profile has multiple local minima, both derivative and simplex methods can fail to converge on the “true” parameter values. Stochastic global methods add a random (stochastic) component to increase the potential of finding the global minimum when there are multiple local minima.
If we return briefly to our well analogy, we might appreciate that there could be multiple underground caves and depending on where our initial pilot hole is drilled, the more efficient search algorithms might quickly hone in on a point that is locally shallow yet not the shallowest location in the entire landscape. Of course repeating the search from multiple random locations could help alleviate this issue, yet it might still be difficult to discover the shallowest point if this is associated with a very abrupt (rather than gradual) underground geological feature.
The classic stochastic global method is called simulated annealing or the Metropolis algorithm and it works as follows:
- Start with some initial estimate values (one for each parameter) and calculate the loss function (e.i. the negative log-likelihood)
- Pick a new set of parameter estimates close to the previous - e.g. jump a small distance in multidimensional space - and again calculate the loss function.
- If the value of the loss function is better (lower), accept the new parameters and return to Step 2.
- If the value of the loss function is not better,
- Calculate the difference (\(\Delta L = L_{new} - L_{old}\)) in loss between the new and old parameter estimates
- Pick a random number between 0 and 1
- Accept the new parameter estimates if the random number is less than \(e^{-\Delta L/k}\) (where \(k\) is a constant that regulates the acceptance propensity), otherwise retain the previous parameter estimates.
- Periodically (e.g. every 100 iterations), decrease \(k\) so as to reduce the acceptance propensity.
- Return to Step 2.
Rather than have a stopping rule based on convergence, simulated annealing continues until the maximum number of iterations have been performed. Along with the capacity and encouragement to occasionally move ‘uphill’ (towards higher loss values), a large number of iterations increases the chances that the global minima will be discovered even in the presence of multiple minima. The iterations, keep track of the parameters associated with the ‘best’ configuration.
There are variants of this algorithm that control the jump distance used to select the next point. By adaptively increasing and reducing the jump length following acceptance and non-acceptance respectively, these variances are able to further encourage wider exploration of the parameter space.
The following animation illustrates 4096 iterations (stepping up in \(log_2\) increments). The red point indicates the current ‘best’ parameter estimates. Note that whilst the algorithm ‘discovered’ the ‘best’ solution after approximately 100 iterations, it continued to explore the profile thoroughly. If there had been other minima, it is likely to have discovered them as well.
optim.SANN <- function(params.init, fn, iter.max = 2500, jump = 0.05, k = 100) {
bestpar <- newpar <- oldpar <- params.init
iter <- 0; not.max.iter <- 1
inck <- k/10
while(not.max.iter) {
## jump to new parameter set
newpar <- oldpar + replicate(length(params.init), runif(1, -jump, jump))
deltaL <- fn(newpar)-fn(oldpar)
if (deltaL<0) { #improvement
oldpar <- newpar
} else {
rnd <- runif(1)
if (rnd <= exp(-deltaL/k)) oldpar <- newpar
}
if (fn(newpar)<fn(bestpar)) bestpar <- newpar
iter <- iter+1
if ((iter %% inck)==0) k <- k/10
not.max.iter <- (iter < iter.max)*1
}
list(iter = iter, final = as.vector(bestpar), LL = fn(bestpar), last = as.vector(oldpar), LL = fn(bestpar))
}
#optim.SANN(params.init=c(-0.25,1), fn=function(p) LL.gaus(p, x=y), iter.max=2500)
p.i <- params.init
iter.cnt <- 1
iters <- 2500
newdata <- vector('list', iters)
bestpar <- p.i
for (i in 1:iters) {
a <- optim.SANN(params.init = p.i, fn = function(t) LL.gaus(t, x = dat$y), iter.max = 1, jump = 0.5, k = 0.1)
if (LL.gaus(a$last, x = dat$y)<LL.gaus(bestpar, x = dat$y)) bestpar = a$last
newdata[[i]] <- data.frame(t(setNames(a$last, c('mu', 'sigma'))), iter = floor(log2(i))+1, t(setNames(bestpar, c('bestmu', 'bestsigma'))))
p.i <- as.vector(a$last)
}
newdata <- bind_rows(newdata, .id = 'Iter') |>
mutate(iter = factor(iter, levels = unique(iter)), nIter = as.numeric(as.character(iter)))
g <- base.plot +
geom_point(data = newdata, aes(y = mu, x = sigma, group = iter), color = 'black') +
geom_path(data = newdata, aes(y = mu, x = sigma), color = 'black') +
geom_point(data = newdata, aes(y = bestmu, x = bestsigma), color = 'red') +
transition_reveal(nIter) +
labs(title = "Iteration: {2^frame_along}")
ga <- animate(g, nframes = 12, fps = 1)
#ga
anim_save('SANNAnim1.gif', animation = ga, path = '../resources')
Allowing the simulated annealing to iterate 2500 times (with updating \(k\)):
optim.SANN(params.init = c(20, 12), fn = function(t) LL.gaus(t, x = dat$y), iter.max = 2500, jump = 0.5, k = 0.1)
$iter
[1] 2500
$final
[1] 27.76141 14.24703
$LL
[1] 40.76336
$last
[1] 27.91323 13.76679
$LL
[1] 40.76336
Which is again very similar the empirical estimates.
9.4 Fitting GLM’s
\[ \begin{align} y_i \sim{}& N(\mu_i, \sigma^2)\\ \mu_i =& \beta_0 + \beta_1 x_i\\ \end{align} \]
Formula: y ~ x
Data: dat
AIC BIC logLik df.resid
68.61231 69.52006 -31.30615 7
Number of obs: 10
Dispersion estimate for gaussian family (sigma^2): 30.7
Fixed Effects:
Conditional model:
(Intercept) x
2.651 4.575
And to explore the diagnostics
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Family: gaussian ( identity )
Formula: y ~ x
Data: dat
AIC BIC logLik deviance df.resid
68.6 69.5 -31.3 62.6 7
Dispersion estimate for gaussian family (sigma^2): 30.7
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.6507 3.7833 0.701 0.484
x 4.5746 0.6097 7.503 6.26e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.5 % 97.5 % Estimate
(Intercept) -4.764542 10.065873 2.650666
x 3.379537 5.769675 4.574606
# R2 for Linear Regression
R2: 0.849
adj. R2: 0.806
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered), \(y\) is expected to be 2.651 (the y-intercept).
- for every one unit change in \(x\), the expected value of \(y\) increases by 4.575 (the slope)
- the slope could be as low as 3.38 or as high as 5.77 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 80.604% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
\[ \begin{align} y_i \sim{}& N(\mu_i, \sigma^2)\\ \mu_i =& \beta_0 + \beta_1 (x_i - \bar{x})\\ \end{align} \]
Formula: y ~ scale(x, scale = FALSE)
Data: dat
AIC BIC logLik df.resid
68.61231 69.52006 -31.30615 7
Number of obs: 10
Dispersion estimate for gaussian family (sigma^2): 30.7
Fixed Effects:
Conditional model:
(Intercept) scale(x, scale = FALSE)
27.811 4.575
And to explore the diagnostics
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Family: gaussian ( identity )
Formula: y ~ scale(x, scale = FALSE)
Data: dat
AIC BIC logLik deviance df.resid
68.6 69.5 -31.3 62.6 7
Dispersion estimate for gaussian family (sigma^2): 30.7
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 27.8110 1.7513 15.880 < 2e-16 ***
scale(x, scale = FALSE) 4.5746 0.6097 7.503 6.26e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.5 % 97.5 % Estimate
(Intercept) 24.378410 31.243560 27.810985
scale(x, scale = FALSE) 3.379534 5.769673 4.574603
# R2 for Linear Regression
R2: 0.849
adj. R2: 0.806
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered), \(y\) is expected to be 27.811 (the y-intercept).
- for every one unit change in \(x\) (which represents a span of approximately 34% of the range of \(x\)), the expected value of \(y\) increases by 4.575 (the slope)
- the slope could be as low as 3.38 or as high as 5.77 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 80.604% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
\[ \begin{align} y_i \sim{}& N(\mu_i, \sigma^2)\\ \mu_i =& \beta_0 + \beta_1 (x_i - \bar{x})/\sigma_{x}\\ \end{align} \]
Formula: y ~ scale(x)
Data: dat
AIC BIC logLik df.resid
68.61231 69.52006 -31.30615 7
Number of obs: 10
Dispersion estimate for gaussian family (sigma^2): 30.7
Fixed Effects:
Conditional model:
(Intercept) scale(x)
27.81 13.85
And to explore the diagnostics
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Family: gaussian ( identity )
Formula: y ~ scale(x)
Data: dat
AIC BIC logLik deviance df.resid
68.6 69.5 -31.3 62.6 7
Dispersion estimate for gaussian family (sigma^2): 30.7
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 27.811 1.751 15.880 < 2e-16 ***
scale(x) 13.850 1.846 7.503 6.26e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.5 % 97.5 % Estimate
(Intercept) 24.37842 31.24357 27.81100
scale(x) 10.23206 17.46856 13.85031
# R2 for Linear Regression
R2: 0.849
adj. R2: 0.806
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered), \(y\) is expected to be 27.811 (the y-intercept).
- for every one unit change in \(x\) (which represents a span of approximately 34% of the range of \(x\) since standardised), the expected value of \(y\) increases by
r round(fixef(dat1c.glm)[[1]][2], 3)
(the slope) - the slope could be as low as 10.232 or as high as 17.469 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 80.604% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
# Comparison of Model Performance Indices
Name | Model | R2 | R2 (adj.) | RMSE | Sigma | AIC weights | AICc weights | BIC weights | Performance-Score
----------------------------------------------------------------------------------------------------------------------
dat.glm | glmmTMB | 0.849 | 0.806 | 5.538 | 5.538 | 0.333 | 0.333 | 0.333 | 100.00%
dat1c.glm | glmmTMB | 0.849 | 0.806 | 5.538 | 5.538 | 0.333 | 0.333 | 0.333 | 85.72%
dat1b.glm | glmmTMB | 0.849 | 0.806 | 5.538 | 5.538 | 0.333 | 0.333 | 0.333 | 0.00%
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a poisson regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& Pois(\lambda_i)\\ log(\lambda_i) =& \beta_0 + \beta_1 x_i\\ \end{align} \]
Formula: y ~ x
Data: dat3
AIC BIC logLik df.resid
50.35456 50.95973 -23.17728 8
Number of obs: 10
Fixed Effects:
Conditional model:
(Intercept) x
0.05136 0.34263
And to explore the diagnostics
## glmmTMB not folly supported yet, residual_type="simulated" would be better - if supported
dat3.glm |> performance::check_model()
NULL
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Family: poisson ( log )
Formula: y ~ x
Data: dat3
AIC BIC logLik deviance df.resid
50.4 51.0 -23.2 46.4 8
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.05136 0.35408 0.145 0.885
x 0.34263 0.04319 7.932 2.15e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.5 % 97.5 % Estimate
(Intercept) -0.6426377 0.7453489 0.0513556
x 0.2579690 0.4272885 0.3426288
# R2 for Generalized Linear Regression
Nagelkerke's R2: 1.000
Since this model used a log link, in order to return the estimates to the scale of the response, we need to backtransform by exponentiation.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (log) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\), \(y\) is expected to be exp(0.051) = 1.053 (the y-intercept).
- for every one unit change in \(x\), the expected value of \(y\) increases by exp(0.343) =
1.409 (the slope) - this is equivalent to a 40.865% increase in the response per one unit change in \(x\)
- the slope (on the link scale) could be as low as 0.258 or as high as 0.427 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 1.294 or as high as 1.533 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 99.982% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a poisson regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& Pois(\lambda_i)\\ log(\lambda_i) =& \beta_0 + \beta_1 (x_i - \bar{x})\\ \end{align} \]
dat3b.glm <- glmmTMB(y ~ scale(x, scale = FALSE), data = dat3, family = poisson(link = "log"))
dat3b.glm
Formula: y ~ scale(x, scale = FALSE)
Data: dat3
AIC BIC logLik df.resid
50.35456 50.95973 -23.17728 8
Number of obs: 10
Fixed Effects:
Conditional model:
(Intercept) scale(x, scale = FALSE)
1.9358 0.3426
And to explore the diagnostics
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Family: poisson ( log )
Formula: y ~ scale(x, scale = FALSE)
Data: dat3
AIC BIC logLik deviance df.resid
50.4 51.0 -23.2 46.4 8
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9358 0.1411 13.720 < 2e-16 ***
scale(x, scale = FALSE) 0.3426 0.0432 7.932 2.15e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.5 % 97.5 % Estimate
(Intercept) 1.6592778 2.2123498 1.9358138
scale(x, scale = FALSE) 0.2579681 0.4272894 0.3426288
# R2 for Generalized Linear Regression
Nagelkerke's R2: 1.000
Since this model used a log link, in order to return the estimates to the scale of the response, we need to backtransform by exponentiation.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (log) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered), \(y\) is expected to be exp(1.936) = 6.93 (the y-intercept).
- for every one unit change in \(x\), the expected value of \(y\) increases by exp(0.343) =
1.409 (the slope) - this is equivalent to a 40.865% increase in the response per one unit change in \(x\)
- the slope (on the link scale) could be as low as 0.258 or as high as 0.427 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 1.294 or as high as 1.533 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 99.982% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a poisson regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& Pois(\lambda_i)\\ log(\lambda_i) =& \beta_0 + \beta_1 (x_i - \bar{x})/\sigma_{x}\\ \end{align} \]
Formula: y ~ scale(x)
Data: dat3
AIC BIC logLik df.resid
50.35456 50.95973 -23.17728 8
Number of obs: 10
Fixed Effects:
Conditional model:
(Intercept) scale(x)
1.936 1.037
And to explore the diagnostics
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Family: poisson ( log )
Formula: y ~ scale(x)
Data: dat3
AIC BIC logLik deviance df.resid
50.4 51.0 -23.2 46.4 8
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9358 0.1411 13.720 < 2e-16 ***
scale(x) 1.0374 0.1308 7.932 2.15e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conditional model:
(Intercept) scale(x)
1.936 1.037
# R2 for Generalized Linear Regression
Nagelkerke's R2: 1.000
Since this model used a log link, in order to return the estimates to the scale of the response, we need to backtransform by exponentiation.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (log) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered), \(y\) is expected to be exp(1.936) = 6.93 (the y-intercept).
- for every one unit change in \(x\) (which represents a span of approximately 34% of the range of \(x\) since standardised), the expected value of \(y\) increases by exp(1.037) = 2.822 (the slope)
- this is equivalent to a 182.176% increase in the response per one unit change in (standardised) \(x\)
- the slope (on the link scale) could be as low as 0.781 or as high as 1.294 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 2.184 or as high as 3.646 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 99.982% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a Negative Binomial regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& NB(\lambda_i, \phi)\\ log(\lambda_i) =& \beta_0 + \beta_1 x_i\\ \end{align} \]
Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence
problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence
problem; false convergence (8). See vignette('troubleshooting'),
help('diagnose')
Formula: y ~ x
Data: dat4
AIC BIC logLik df.resid
NA NA NA 7
Number of obs: 10
Dispersion parameter for nbinom2 family (): 5.36e+07
Fixed Effects:
Conditional model:
(Intercept) x
0.3913 0.2792
And to explore the diagnostics
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 0.56515, p-value = 0.528
alternative hypothesis: two.sided
Conclusions
- no obvious issues with these diagnostics
Family: nbinom2 ( log )
Formula: y ~ x
Data: dat4
AIC BIC logLik deviance df.resid
NA NA NA NA 7
Dispersion parameter for nbinom2 family (): 5.36e+07
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.3913 0.3415 1.146 0.252
x 0.2792 0.0431 6.479 9.24e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.5 % 97.5 % Estimate
(Intercept) -0.2779453 1.060588 0.3913216
x 0.1947710 0.363723 0.2792470
Since this model used a log link, in order to return the estimates to the scale of the response, we need to backtransform by exponentiation.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (log) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\), \(y\) is expected to be exp(0.391) = 1.479 (the y-intercept).
- for every one unit change in \(x\), the expected value of \(y\) increases by exp(0.279) =
1.322 (the slope) - this is equivalent to a 32.213% increase in the response per one unit change in \(x\)
- the slope (on the link scale) could be as low as 0.195 or as high as 0.364 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 1.215 or as high as 1.439 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a poisson regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& NB(\lambda_i, \phi)\\ log(\lambda_i) =& \beta_0 + \beta_1 (x_i - \bar{x})\\ \end{align} \]
Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence
problem; false convergence (8). See vignette('troubleshooting'),
help('diagnose')
Formula: y ~ scale(x, scale = FALSE)
Data: dat4
AIC BIC logLik df.resid
51.39412 52.30187 -22.69706 7
Number of obs: 10
Dispersion parameter for nbinom2 family (): 2.01e+08
Fixed Effects:
Conditional model:
(Intercept) scale(x, scale = FALSE)
1.9272 0.2792
And to explore the diagnostics
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Family: nbinom2 ( log )
Formula: y ~ scale(x, scale = FALSE)
Data: dat4
AIC BIC logLik deviance df.resid
51.4 52.3 -22.7 45.4 7
Dispersion parameter for nbinom2 family (): 2.01e+08
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9272 0.1362 14.151 < 2e-16 ***
scale(x, scale = FALSE) 0.2792 0.0431 6.479 9.24e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.5 % 97.5 % Estimate
(Intercept) 1.6602543 2.1941059 1.927180
scale(x, scale = FALSE) 0.1947701 0.3637238 0.279247
Since this model used a log link, in order to return the estimates to the scale of the response, we need to backtransform by exponentiation.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (log) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered), \(y\) is expected to be exp(1.927) = 6.87 (the y-intercept).
- for every one unit change in \(x\), the expected value of \(y\) increases by exp(0.279) =
1.322 (the slope) - this is equivalent to a 32.213% increase in the response per one unit change in \(x\)
- the slope (on the link scale) could be as low as 0.195 or as high as 0.364 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 1.215 or as high as 1.439 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a poisson regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& NB(\lambda_i, \phi)\\ log(\lambda_i) =& \beta_0 + \beta_1 (x_i - \bar{x})/\sigma_{x}\\ \end{align} \]
Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence
problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence
problem; false convergence (8). See vignette('troubleshooting'),
help('diagnose')
Formula: y ~ scale(x)
Data: dat4
AIC BIC logLik df.resid
NA NA NA 7
Number of obs: 10
Dispersion parameter for nbinom2 family (): 2.21e+08
Fixed Effects:
Conditional model:
(Intercept) scale(x)
1.9272 0.8455
And to explore the diagnostics
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Family: nbinom2 ( log )
Formula: y ~ scale(x)
Data: dat4
AIC BIC logLik deviance df.resid
NA NA NA NA 7
Dispersion parameter for nbinom2 family (): 2.21e+08
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9272 0.1362 14.151 < 2e-16 ***
scale(x) 0.8455 0.1305 6.479 9.25e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.5 % 97.5 % Estimate
(Intercept) 1.6602472 2.194100 1.9271735
scale(x) 0.5896881 1.101223 0.8454555
Since this model used a log link, in order to return the estimates to the scale of the response, we need to backtransform by exponentiation.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (log) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered), \(y\) is expected to be exp(1.927) = 6.87 (the y-intercept).
- for every one unit change in \(x\) (which represents a span of approximately 34% of the range of \(x\) since standardised), the expected value of \(y\) increases by exp(0.845) = 2.329 (the slope)
- this is equivalent to a 132.904% increase in the response per one unit change in (standardised) \(x\)
- the slope (on the link scale) could be as low as 0.59 or as high as 1.101 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 1.803 or as high as 3.008 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a Bernoulli (Binomial with number of trials = 1) regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& Bin(\pi, 1)\\ logit(\frac{\pi}{1-\pi}) =& \beta_0 + \beta_1 x_i\\ \end{align} \]
Formula: y ~ x
Data: dat5
AIC BIC logLik df.resid
9.01379 9.61896 -2.50690 8
Number of obs: 10
Fixed Effects:
Conditional model:
(Intercept) x
-5.825 1.295
And to explore the diagnostics
NULL
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 1.3257, p-value = 0.76
alternative hypothesis: two.sided
Conclusions
- no obvious issues with these diagnostics
Family: binomial ( logit )
Formula: y ~ x
Data: dat5
AIC BIC logLik deviance df.resid
9.0 9.6 -2.5 5.0 8
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.8246 3.9861 -1.461 0.144
x 1.2954 0.8451 1.533 0.125
2.5 % 97.5 % Estimate
(Intercept) -13.6371807 1.987979 -5.824601
x -0.3609074 2.951782 1.295437
# R2 for Logistic Regression
Tjur's R2: 0.653
Since this model used a logit link (\(log(\frac{\pi}{1-\pi})\)), in order to return the estimates to the scale of the response, we can either:
- for the y-intercept:
- backtransform by exponentiation to odds ratio (\(\frac{\pi}{1-\pi}\)).
- backtransform to probability by exponentiation of the odds ratio (\(\pi\)).
- for the slopes:
- backtransform by exponentiation to odds ratio.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (logit) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\):
- the odds of \(y\) being 1 vs 0 is expected to be exp(-5.825) = 0.003 (the y-intercept).
- the probability of \(y\) being 1 is expected to be plogis(-5.825) = 0.003 (the y-intercept).
- for every one unit change in \(x\), the odds of \(y\) increases by exp(1.295) =
3.653 (the slope) - this is equivalent to a 265.259% increase in the odds of the response being 1 per one unit change in \(x\)
- the slope (on the link scale) could be as low as -0.361 or as high as 2.952 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 0.697 or as high as 19.14 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 65.333% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a poisson regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& Bin(\pi_i, 1)\\ log(\frac{\pi_i}{1-\pi_i}) =& \beta_0 + \beta_1 (x_i - \bar{x})\\ \end{align} \]
dat5b.glm <- glmmTMB(y ~ scale(x, scale = FALSE), data = dat5, family = binomial(link = "logit"))
dat5b.glm
Formula: y ~ scale(x, scale = FALSE)
Data: dat5
AIC BIC logLik df.resid
9.01379 9.61896 -2.50690 8
Number of obs: 10
Fixed Effects:
Conditional model:
(Intercept) scale(x, scale = FALSE)
1.300 1.295
And to explore the diagnostics
NULL
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Family: binomial ( logit )
Formula: y ~ scale(x, scale = FALSE)
Data: dat5
AIC BIC logLik deviance df.resid
9.0 9.6 -2.5 5.0 8
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3003 1.4106 0.922 0.357
scale(x, scale = FALSE) 1.2954 0.8451 1.533 0.125
2.5 % 97.5 % Estimate
(Intercept) -1.4643927 4.064993 1.300300
scale(x, scale = FALSE) -0.3609129 2.951788 1.295437
# R2 for Logistic Regression
Tjur's R2: 0.653
Since this model used a logit link (\(log(\frac{\pi}{1-\pi})\)), in order to return the estimates to the scale of the response, we can either:
- for the y-intercept:
- backtransform by exponentiation to odds ratio (\(\frac{\pi}{1-\pi}\)).
- backtransform to probability by exponentiation of the odds ratio (\(\pi\)).
- for the slopes:
- backtransform by exponentiation to odds ratio.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (logit) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered):
- the odds of \(y\) being 1 vs 0 is expected to be exp(1.3) = 3.67 (the y-intercept).
- the probability of \(y\) being 1 is expected to be plogis(1.3) = 0.786 (the y-intercept).
- for every one unit change in \(x\), the odds of \(y\) increases by exp(1.295) =
3.653 (the slope) - this is equivalent to a 265.259% increase in the odds of the response being 1 per one unit change in \(x\)
- the slope (on the link scale) could be as low as -0.361 or as high as 2.952 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 0.697 or as high as 19.14 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 65.333% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a poisson regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& Bin(\pi_i, \phi)\\ log(\frac{\pi_i}{1-\pi}) =& \beta_0 + \beta_1 (x_i - \bar{x})/\sigma_{x}\\ \end{align} \]
Formula: y ~ scale(x)
Data: dat5
AIC BIC logLik df.resid
9.01379 9.61896 -2.50690 8
Number of obs: 10
Fixed Effects:
Conditional model:
(Intercept) scale(x)
1.300 3.922
And to explore the diagnostics
NULL
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Family: binomial ( logit )
Formula: y ~ scale(x)
Data: dat5
AIC BIC logLik deviance df.resid
9.0 9.6 -2.5 5.0 8
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.300 1.411 0.922 0.357
scale(x) 3.922 2.559 1.533 0.125
2.5 % 97.5 % Estimate
(Intercept) -1.464392 4.064999 1.300304
scale(x) -1.092721 8.936984 3.922131
# R2 for Logistic Regression
Tjur's R2: 0.653
Since this model used a logit link (\(log(\frac{\pi}{1-\pi})\)), in order to return the estimates to the scale of the response, we can either:
- for the y-intercept:
- backtransform by exponentiation to odds ratio (\(\frac{\pi}{1-\pi}\)).
- backtransform to probability by exponentiation of the odds ratio (\(\pi\)).
- for the slopes:
- backtransform by exponentiation to odds ratio.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (logit) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered):
- the odds of \(y\) being 1 vs 0 is expected to be exp(1.3) = 3.67 (the y-intercept).
- the probability of \(y\) being 1 is expected to be plogis(1.3) = 0.786 (the y-intercept).
- for every one unit change in \(x\) (which represents a span of approximately 34% of the range of \(x\) since standardised), the odds of \(y\) increases by exp(3.922) =
50.508 (the slope) - this is equivalent to a 4950.799% increase in the odds of the response being 1 per one unit change in (standardised) \(x\)
- the slope (on the link scale) could be as low as -1.093 or as high as 8.937 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 0.335 or as high as 7608.213 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 65.333% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a Binomial regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& Bin(\pi, n)\\ logit(\frac{\pi}{1-\pi}) =& \beta_0 + \beta_1 x_i\\ \end{align} \]
dat6.glm <- glmmTMB(cbind(count, total - count) ~ x, data = dat6, family = binomial(link = "logit"))
dat6.glm
Formula: cbind(count, total - count) ~ x
Data: dat6
AIC BIC logLik df.resid
22.93434 23.53951 -9.46717 8
Number of obs: 10
Fixed Effects:
Conditional model:
(Intercept) x
-3.3295 0.8234
And to explore the diagnostics
NULL
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 1.0614, p-value = 0.76
alternative hypothesis: two.sided
Conclusions
- no obvious issues with these diagnostics
Family: binomial ( logit )
Formula: cbind(count, total - count) ~ x
Data: dat6
AIC BIC logLik deviance df.resid
22.9 23.5 -9.5 18.9 8
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.3295 1.0441 -3.189 0.001429 **
x 0.8234 0.2246 3.667 0.000246 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.5 % 97.5 % Estimate
(Intercept) -5.3758818 -1.283040 -3.3294608
x 0.3832516 1.263571 0.8234113
NULL
Since this model used a logit link (\(log(\frac{\pi}{1-\pi})\)), in order to return the estimates to the scale of the response, we can either:
- for the y-intercept:
- backtransform by exponentiation to odds ratio (\(\frac{\pi}{1-\pi}\)).
- backtransform to probability by exponentiation of the odds ratio (\(\pi\)).
- for the slopes:
- backtransform by exponentiation to odds ratio.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (logit) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\):
- the odds of \(y\) being 1 vs 0 is expected to be exp(-3.329) = 0.036 (the y-intercept).
- the probability of \(y\) being 1 is expected to be plogis(-3.329) = 0.035 (the y-intercept).
- for every one unit change in \(x\), the odds of \(y\) increases by exp(0.823) =
2.278 (the slope) - this is equivalent to a 127.826% increase in the odds of the response being 1 per one unit change in \(x\)
- the slope (on the link scale) could be as low as 0.383 or as high as 1.264 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 1.467 or as high as 3.538 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- % of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a poisson regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& Bin(\pi_i, n)\\ log(\frac{\pi_i}{1-\pi_i}) =& \beta_0 + \beta_1 (x_i - \bar{x})\\ \end{align} \]
dat6b.glm <- glmmTMB(cbind(count, total - count) ~ scale(x, scale = FALSE), data = dat6, family = binomial(link = "logit"))
dat6b.glm
Formula: cbind(count, total - count) ~ scale(x, scale = FALSE)
Data: dat6
AIC BIC logLik df.resid
22.93434 23.53951 -9.46717 8
Number of obs: 10
Fixed Effects:
Conditional model:
(Intercept) scale(x, scale = FALSE)
1.1993 0.8234
And to explore the diagnostics
NULL
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Family: binomial ( logit )
Formula: cbind(count, total - count) ~ scale(x, scale = FALSE)
Data: dat6
AIC BIC logLik deviance df.resid
22.9 23.5 -9.5 18.9 8
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.1993 0.5016 2.391 0.016801 *
scale(x, scale = FALSE) 0.8234 0.2246 3.667 0.000246 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.5 % 97.5 % Estimate
(Intercept) 0.2162207 2.182382 1.1993015
scale(x, scale = FALSE) 0.3832501 1.263573 0.8234114
NULL
Since this model used a logit link (\(log(\frac{\pi}{1-\pi})\)), in order to return the estimates to the scale of the response, we can either:
- for the y-intercept:
- backtransform by exponentiation to odds ratio (\(\frac{\pi}{1-\pi}\)).
- backtransform to probability by exponentiation of the odds ratio (\(\pi\)).
- for the slopes:
- backtransform by exponentiation to odds ratio.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (logit) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered):
- the odds of \(y\) being 1 vs 0 is expected to be exp(1.199) = 3.318 (the y-intercept).
- the probability of \(y\) being 1 is expected to be plogis(1.199) = 0.768 (the y-intercept).
- for every one unit change in \(x\), the odds of \(y\) increases by exp(0.823) =
2.278 (the slope) - this is equivalent to a 127.826% increase in the odds of the response being 1 per one unit change in \(x\)
- the slope (on the link scale) could be as low as 0.383 or as high as 1.264 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 1.467 or as high as 3.538 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- % of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a poisson regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& Bin(\pi_i, n)\\ log(\frac{\pi_i}{1-\pi}) =& \beta_0 + \beta_1 (x_i - \bar{x})/\sigma_{x}\\ \end{align} \]
dat6c.glm <- glmmTMB(cbind(count, total - count) ~ scale(x), data = dat6, family = binomial(link = "logit"))
dat6c.glm
Formula: cbind(count, total - count) ~ scale(x)
Data: dat6
AIC BIC logLik df.resid
22.93434 23.53951 -9.46717 8
Number of obs: 10
Fixed Effects:
Conditional model:
(Intercept) scale(x)
1.199 2.493
And to explore the diagnostics
NULL
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Family: binomial ( logit )
Formula: cbind(count, total - count) ~ scale(x)
Data: dat6
AIC BIC logLik deviance df.resid
22.9 23.5 -9.5 18.9 8
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.1993 0.5016 2.391 0.016801 *
scale(x) 2.4930 0.6799 3.667 0.000246 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.5 % 97.5 % Estimate
(Intercept) 0.2162206 2.182382 1.199301
scale(x) 1.1603468 3.825657 2.493002
NULL
Since this model used a logit link (\(log(\frac{\pi}{1-\pi})\)), in order to return the estimates to the scale of the response, we can either:
- for the y-intercept:
- backtransform by exponentiation to odds ratio (\(\frac{\pi}{1-\pi}\)).
- backtransform to probability by exponentiation of the odds ratio (\(\pi\)).
- for the slopes:
- backtransform by exponentiation to odds ratio.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (logit) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered):
- the odds of \(y\) being 1 vs 0 is expected to be exp(1.199) = 3.318 (the y-intercept).
- the probability of \(y\) being 1 is expected to be plogis(1.199) = 0.768 (the y-intercept).
- for every one unit change in \(x\) (which represents a span of approximately 34% of the range of \(x\) since standardised), the odds of \(y\) increases by exp(2.493) =
12.098 (the slope) - this is equivalent to a 1109.753% increase in the odds of the response being 1 per one unit change in (standardised) \(x\)
- the slope (on the link scale) could be as low as 1.16 or as high as 3.826 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 3.191 or as high as 45.863 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- % of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
10 Iterative re-weighted least squares
Recall from the previous section that the simple linear model can be written as:
\[ \begin{align} y_i =& \beta_0 + \beta_1 x_i + \varepsilon_i\\ Y =& \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \end{align} \]
This could be re-written as:
\[ \begin{align} y_i \sim{}& N(\mu_i, \sigma^2I)\\ \mu_i =& \beta_0 + \beta_1 x_i\\ \end{align} \]
indicating that the response (\(y\)) is distributed as a Gaussian (normal) the mean of which is determined by the linear relationship with \(x\) and there is a constant variance (\(\sigma^2\)) - all observations are independently and identically distributed (iid) - independently drawn from a distribution with the same variance.
The \(I\) signifies an identity matrix (a square matrix whose diagonals are all one).
More generally, the above can be written in vector/matrix form as:
\[ \begin{align} Y \sim{}& N(\boldsymbol{\mu}, \sigma^2I)\\ \boldsymbol{\mu} =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]
If we suspect that the residuals are not independent and identically distributed (if for example they were Poisson distributed), then we could alter (generalise) the above to:
\[ \begin{align} Y \sim{}& N(\boldsymbol{\mu}, \sigma^2\boldsymbol{W}^{-1})\\ \boldsymbol{\mu} =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]
where \(\boldsymbol{W}\) is a matrix of positive diagonals .
This allows us to generalise to other (exponential) families (such as Binomial, Poisson, Negative Binomial, Gamma etc). For example, if our response (\(Y\)) were count data, we might consider a Poisson.
\[ \begin{align} Y \sim{}& Pois(\boldsymbol{\lambda})\\ \boldsymbol{\lambda} =& e^{\boldsymbol{X}\boldsymbol{\beta}}\\ &\text{OR equivalently}\\ log(\boldsymbol{\lambda}) =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]
The Poisson distribution (\(P(x|\lambda) = e^{-\lambda}\lambda^x/x!\)) is parameterised by a single parameter (\(\lambda\)) that represents both the mean and variance (as well as degrees of freedom). Poisson data are bound at the lower end by zero (negative values are not defined) - it is not possible to count fewer than 0 things. The Poisson family includes an exponential term, therefore to map the expected values back onto the natural scale (scale of the observed data), we use a link function (in the case of the Poission, this link is a log link). Hence the above can be generalized even further to:
\[ \begin{align} Y \sim{}& D(\boldsymbol{\eta}, ...)\\ g(\boldsymbol{\eta}) =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]
where \(D\) is the nominated family, \(g\) is the link function and \(...\) represents any additional parameters required by the nominated distribution (such as \(\sigma^2\boldsymbol{W}^{-1}\) in the case of the Gaussian distribution).
In regular OLS regression we are trying to minimize the sum squares of residuals (\(||Y - \boldsymbol{X}\boldsymbol{\beta}||^2\)). This assumes that all residuals are independent and identically distributed as each will be given equal weight when summing. However, we may wish to weight some observations (residuals) more heavily than others. For example, a reasonably common situation is for the variance to be related to the mean (typically as the mean increases, so does the variance). We can partly compensate for this be weighting the residuals inversely proportional to the mean (\(w_i = 1/\mu_i\)).
Now, rather than trying to minimise the sum squares of residuals (\(||Y - \boldsymbol{X}\boldsymbol{\beta}||^2\)), we now try to minimise the sum weighted residuals (\(||Y - \boldsymbol{X}\boldsymbol{\beta}||^{\boldsymbol{p}}\)), where \(\boldsymbol{p}\) is a vector of powers.
This can also (and once we take into consideration the link function), be written as the sum square of weighted residuals (\((g(\boldsymbol{X}\boldsymbol{\beta}) - Y)^T\boldsymbol{W}(g(\boldsymbol{X}\boldsymbol{\beta}) - Y)\)), where \(\boldsymbol{W}\) is the diagonal of a weights matrix.
Rather than calculate the residuals on the response scale (by backtransforming the linear predictor \(\boldsymbol{X}\boldsymbol{\beta}\) to the same scale as the response \(Y\)), the residuals are typically calculated on the link scale after adjusting the response variable onto this scale.
The response is adjusted (\(\boldsymbol{z}\)) by adding the difference between observed and expected (natural scale) re-mapped onto the link scale (by dividing by the derivative of the expected value on the natural scale) to the expected value (on the link scale).
\[ \boldsymbol{z} = \boldsymbol{\eta} + \frac{Y-g(\boldsymbol{\eta})}{g'(\boldsymbol{\eta})} \]
where \(\eta\) is the estimated linear predictor (predicted value of \(y\) on the scale of the link function), \(g(\eta)\) is the inverse link of \(\eta\) and \(g'(\eta)\) is the derivative of \(g(\eta)\).
The weighted least squares formulation then becomes:
\[ \boldsymbol{\beta} = (\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{X}^{-1})\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{z} \]
where \(\boldsymbol{W}\) is the diagonal of a weights matrix:
\[ \boldsymbol{W} = diag\left(\frac{g'(\boldsymbol{\eta})^2}{var(g(\boldsymbol{\eta}))}\right) \]
Unfortunately, there is no closed form solution for the above and thus it is usually solved iteratively, and is thus called iterative re-weighted least squares. A very simplistic implementation of an iterative re-weighted least squares could be:
- Start with some initial values for the parameters (\(\beta\)} - typically 0 for each parameter (on the link scale and thus, 1 on the response scale)
- Calculate the predicted value of the response given the current estimates of \(\boldsymbol{\beta}\) as well as the adjusted response (\(\boldsymbol{z}\)) and weights (\(\boldsymbol{W}\)).
- Update the estimated parameter values
- Repeat steps 2-3 until either the maximum number of iterations has occurred or a measure of the difference between successive iterations is below a set tolerance.
squares algorithm
optim.IRLS <- function(X, y, family, iter.max=25, abs.tol=1e-06) {
oldpar <- newpar <- rep(0, ncol(X))
iter <- 0; not.converged <- 1; not.max.iter <- 1
while (not.converged & not.max.iter) {
eta <- X %*% newpar # calculate (update) the linear predictor
g <- family()$linkinv(eta) # calculate (update) the linear predictor on the natural scale (inverse link)
g. <- family()$mu.eta(eta) # calculate (update) the derivative of linear predictor on natural scale
z <- eta + (y - g) / g. # calculate (update) the adjusted response
W <- as.vector(g.^2 / family()$variance(g)) # calculate (update) the weights
oldpar <- newpar
newpar <- solve(crossprod(X,W*X), crossprod(X, W*z), tol = 2*.Machine$double.eps)
iter <- iter + 1
not.max.iter <- (iter < iter.max)*1
not.converged <- (max(abs((1-newpar/oldpar)))>abs.tol)*1
}
list(B.optim = newpar, Iter = iter-1)
}
Note, when the family is Gaussian, the above is simplified to the OLS form:
\[ \begin{align} \boldsymbol{\beta} =& 0\\ \boldsymbol{\eta} =& \boldsymbol{X}\boldsymbol{\beta} = 0\\ \boldsymbol{z} =& \boldsymbol{\eta} + \frac{Y-g(\boldsymbol{\eta})}{g'(\boldsymbol{\eta})} = 0 + \frac{Y-0}{1} = Y\\ \boldsymbol{W} =& diag\left(\frac{g'(\boldsymbol{\eta})^2}{var(g(\boldsymbol{\eta}))}\right) = diag(1^2/1) = 1\\ \boldsymbol{\beta} =& (\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{X}^{-1})\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{z} = (\boldsymbol{X}^T\boldsymbol{X}^{-1})\boldsymbol{X}^TY \end{align} \]
Lets try it out, starting with an intercept only model - which is just estimating the mean response value (according to a Poisson distribution).
$B.optim
[,1]
[1,] 27.811
$Iter
[1] 1
This estimate is (not surprisingly) identical to the OLS estimate.
So what if we modified the response such that it was some positive integers and we nominate a Poisson distribution..
$B.optim
[,1]
[1,] 2.379546
$Iter
[1] 12
[,1]
[1,] 10.8
After 12 iterations, the algorithm converged on an estimate of 2.3795461, which on the natural scale would be r exp(optim.IRLS(cbind(rep(1,length(dat3$y))), dat3$y, family=poisson)$B.optim)
.
In R, there is an inbuilt function called glm()
that fits generalized linear models by default via iterative re-weighted least squares.
Call: glm(formula = y ~ 1, family = poisson(link = "log"), data = dat3)
Coefficients:
(Intercept)
2.38
Degrees of Freedom: 9 Total (i.e. Null); 9 Residual
Null Deviance: 89.83
Residual Deviance: 89.83 AIC: 129.3
And now we will explore the full model:
$B.optim
[,1]
(Intercept) 0.0513556
x 0.3426288
$Iter
[1] 24
[,1]
(Intercept) 1.052697
x 1.408646
Both estimates are very close to the parameters on which the simulated data were generated, and identical to those returned from the inbuilt glm()
function.
\[ \begin{align} y_i \sim{}& N(\mu_i, \sigma^2)\\ \mu_i =& \beta_0 + \beta_1 x_i\\ \end{align} \]
Call: glm(formula = y ~ x, data = dat)
Coefficients:
(Intercept) x
2.651 4.575
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 2033
Residual Deviance: 306.7 AIC: 68.61
And to explore the diagnostics
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Call:
glm(formula = y ~ x, data = dat)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.6507 4.2299 0.627 0.548349
x 4.5746 0.6817 6.710 0.000151 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 38.34014)
Null deviance: 2033.20 on 9 degrees of freedom
Residual deviance: 306.72 on 8 degrees of freedom
AIC: 68.612
Number of Fisher Scoring iterations: 2
2.5 % 97.5 %
(Intercept) -5.639787 10.941121
x 3.238478 5.910734
R2: 0.849
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered), \(y\) is expected to be 2.651 (the y-intercept).
- for every one unit change in \(x\), the expected value of \(y\) increases by 4.575 (the slope)
- the slope could be as low as 3.238 or as high as 5.911 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 84.914% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
\[ \begin{align} y_i \sim{}& N(\mu_i, \sigma^2)\\ \mu_i =& \beta_0 + \beta_1 (x_i - \bar{x})\\ \end{align} \]
Call: glm(formula = y ~ scale(x, scale = FALSE), data = dat)
Coefficients:
(Intercept) scale(x, scale = FALSE)
27.811 4.575
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 2033
Residual Deviance: 306.7 AIC: 68.61
And to explore the diagnostics
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Call:
glm(formula = y ~ scale(x, scale = FALSE), data = dat)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.8110 1.9581 14.20 5.88e-07 ***
scale(x, scale = FALSE) 4.5746 0.6817 6.71 0.000151 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 38.34014)
Null deviance: 2033.20 on 9 degrees of freedom
Residual deviance: 306.72 on 8 degrees of freedom
AIC: 68.612
Number of Fisher Scoring iterations: 2
2.5 % 97.5 %
(Intercept) 23.973266 31.648734
scale(x, scale = FALSE) 3.238478 5.910734
`r2()` does not support models of class `glm`.
NULL
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered), \(y\) is expected to be 27.811 (the y-intercept).
- for every one unit change in \(x\) (which represents a span of approximately 34% of the range of \(x\)), the expected value of \(y\) increases by 4.575 (the slope)
- the slope could be as low as 3.238 or as high as 5.911 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- % of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
\[ \begin{align} y_i \sim{}& N(\mu_i, \sigma^2)\\ \mu_i =& \beta_0 + \beta_1 (x_i - \bar{x})/\sigma_{x}\\ \end{align} \]
Call: glm(formula = y ~ scale(x), data = dat)
Coefficients:
(Intercept) scale(x)
27.81 13.85
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 2033
Residual Deviance: 306.7 AIC: 68.61
And to explore the diagnostics
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Call:
glm(formula = y ~ scale(x), data = dat)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.811 1.958 14.20 5.88e-07 ***
scale(x) 13.850 2.064 6.71 0.000151 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 38.34014)
Null deviance: 2033.20 on 9 degrees of freedom
Residual deviance: 306.72 on 8 degrees of freedom
AIC: 68.612
Number of Fisher Scoring iterations: 2
2.5 % 97.5 %
(Intercept) 23.97327 31.64873
scale(x) 9.80498 17.89563
`r2()` does not support models of class `glm`.
NULL
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered), \(y\) is expected to be 27.811 (the y-intercept).
- for every one unit change in \(x\) (which represents a span of approximately 34% of the range of \(x\) since standardised), the expected value of \(y\) increases by
r round(fixef(dat1c.glmw)[2], 3)
(the slope) - the slope could be as low as 9.805 or as high as 17.896 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- % of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a poisson regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& Pois(\lambda_i)\\ log(\lambda_i) =& \beta_0 + \beta_1 x_i\\ \end{align} \]
Call: glm(formula = y ~ x, family = poisson(link = "log"), data = dat3)
Coefficients:
(Intercept) x
0.05136 0.34263
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 89.83
Residual Deviance: 8.879 AIC: 50.35
And to explore the diagnostics
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Call:
glm(formula = y ~ x, family = poisson(link = "log"), data = dat3)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.05136 0.35408 0.145 0.885
x 0.34263 0.04319 7.932 2.15e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 89.8290 on 9 degrees of freedom
Residual deviance: 8.8787 on 8 degrees of freedom
AIC: 50.355
Number of Fisher Scoring iterations: 4
2.5 % 97.5 %
(Intercept) -0.6834618 0.7077688
x 0.2608235 0.4305322
# R2 for Generalized Linear Regression
Nagelkerke's R2: 1.000
Since this model used a log link, in order to return the estimates to the scale of the response, we need to backtransform by exponentiation.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (log) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\), \(y\) is expected to be exp(0.051) = 1.053 (the y-intercept).
- for every one unit change in \(x\), the expected value of \(y\) increases by exp(0.343) =
1.409 (the slope) - this is equivalent to a 40.865% increase in the response per one unit change in \(x\)
- the slope (on the link scale) could be as low as 0.261 or as high as 0.431 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 1.298 or as high as 1.538 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 99.982% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a poisson regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& Pois(\lambda_i)\\ log(\lambda_i) =& \beta_0 + \beta_1 (x_i - \bar{x})\\ \end{align} \]
dat3b.glmw <- glm(y ~ scale(x, scale = FALSE), data = dat3, family = poisson(link = "log"))
dat3b.glmw
Call: glm(formula = y ~ scale(x, scale = FALSE), family = poisson(link = "log"),
data = dat3)
Coefficients:
(Intercept) scale(x, scale = FALSE)
1.9358 0.3426
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 89.83
Residual Deviance: 8.879 AIC: 50.35
And to explore the diagnostics
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Call:
glm(formula = y ~ scale(x, scale = FALSE), family = poisson(link = "log"),
data = dat3)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.93581 0.14109 13.720 < 2e-16 ***
scale(x, scale = FALSE) 0.34263 0.04319 7.932 2.15e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 89.8290 on 9 degrees of freedom
Residual deviance: 8.8787 on 8 degrees of freedom
AIC: 50.355
Number of Fisher Scoring iterations: 4
2.5 % 97.5 %
(Intercept) 1.6421890 2.1970557
scale(x, scale = FALSE) 0.2608235 0.4305322
# R2 for Generalized Linear Regression
Nagelkerke's R2: 1.000
Since this model used a log link, in order to return the estimates to the scale of the response, we need to backtransform by exponentiation.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (log) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered), \(y\) is expected to be exp(1.936) = 6.93 (the y-intercept).
- for every one unit change in \(x\), the expected value of \(y\) increases by exp(0.343) =
1.409 (the slope) - this is equivalent to a 40.865% increase in the response per one unit change in \(x\)
- the slope (on the link scale) could be as low as 0.261 or as high as 0.431 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 1.298 or as high as 1.538 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 99.982% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a poisson regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& Pois(\lambda_i)\\ log(\lambda_i) =& \beta_0 + \beta_1 (x_i - \bar{x})/\sigma_{x}\\ \end{align} \]
Call: glm(formula = y ~ scale(x), family = poisson(link = "log"), data = dat3)
Coefficients:
(Intercept) scale(x)
1.936 1.037
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 89.83
Residual Deviance: 8.879 AIC: 50.35
And to explore the diagnostics
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Call:
glm(formula = y ~ scale(x), family = poisson(link = "log"), data = dat3)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9358 0.1411 13.720 < 2e-16 ***
scale(x) 1.0374 0.1308 7.932 2.15e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 89.8290 on 9 degrees of freedom
Residual deviance: 8.8787 on 8 degrees of freedom
AIC: 50.355
Number of Fisher Scoring iterations: 4
2.5 % 97.5 %
(Intercept) 1.6421890 2.197056
scale(x) 0.7896823 1.303501
# R2 for Generalized Linear Regression
Nagelkerke's R2: 1.000
Since this model used a log link, in order to return the estimates to the scale of the response, we need to backtransform by exponentiation.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (log) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered), \(y\) is expected to be exp(1.936) = 6.93 (the y-intercept).
- for every one unit change in \(x\) (which represents a span of approximately 34% of the range of \(x\) since standardised), the expected value of \(y\) increases by exp(1.037) = 2.822 (the slope)
- this is equivalent to a 182.176% increase in the response per one unit change in (standardised) \(x\)
- the slope (on the link scale) could be as low as 0.79 or as high as 1.304 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 2.203 or as high as 3.682 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 99.982% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a Negative Binomial regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& NB(\lambda_i, \phi)\\ log(\lambda_i) =& \beta_0 + \beta_1 x_i\\ \end{align} \]
Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached
Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached
Call: MASS::glm.nb(formula = y ~ x, data = dat4, init.theta = 107501.5977,
link = log)
Coefficients:
(Intercept) x
0.3913 0.2792
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 59.77
Residual Deviance: 9.712 AIC: 51.39
And to explore the diagnostics
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 0.5987, p-value = 0.56
alternative hypothesis: two.sided
Conclusions
- no obvious issues with these diagnostics
Call:
MASS::glm.nb(formula = y ~ x, data = dat4, init.theta = 107501.5977,
link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.3913 0.3415 1.146 0.252
x 0.2792 0.0431 6.479 9.26e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(107501.6) family taken to be 1)
Null deviance: 59.7734 on 9 degrees of freedom
Residual deviance: 9.7121 on 8 degrees of freedom
AIC: 51.395
Number of Fisher Scoring iterations: 1
Theta: 107502
Std. Err.: 2840765
Warning while fitting theta: iteration limit reached
2 x log-likelihood: -45.395
2.5 % 97.5 %
(Intercept) -0.3177550 1.024170
x 0.1972806 0.366662
# R2 for Generalized Linear Regression
Nagelkerke's R2: 0.996
Since this model used a log link, in order to return the estimates to the scale of the response, we need to backtransform by exponentiation.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (log) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\), \(y\) is expected to be exp(0.391) = 1.479 (the y-intercept).
- for every one unit change in \(x\), the expected value of \(y\) increases by exp(0.279) = 1.322 (the slope)
- this is equivalent to a 32.213% increase in the response per one unit change in \(x\)
- the slope (on the link scale) could be as low as 0.197 or as high as 0.367 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 1.218 or as high as 1.443 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 99.583% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a poisson regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& NB(\lambda_i, \phi)\\ log(\lambda_i) =& \beta_0 + \beta_1 (x_i - \bar{x})\\ \end{align} \]
Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached
Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached
Call: MASS::glm.nb(formula = y ~ scale(x, scale = FALSE), data = dat4,
init.theta = 107501.5853, link = log)
Coefficients:
(Intercept) scale(x, scale = FALSE)
1.9272 0.2792
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 59.77
Residual Deviance: 9.712 AIC: 51.39
And to explore the diagnostics
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Call:
MASS::glm.nb(formula = y ~ scale(x, scale = FALSE), data = dat4,
init.theta = 107501.5853, link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9272 0.1362 14.150 < 2e-16 ***
scale(x, scale = FALSE) 0.2792 0.0431 6.479 9.26e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(107501.6) family taken to be 1)
Null deviance: 59.7734 on 9 degrees of freedom
Residual deviance: 9.7121 on 8 degrees of freedom
AIC: 51.395
Number of Fisher Scoring iterations: 1
Theta: 107502
Std. Err.: 2840767
Warning while fitting theta: iteration limit reached
2 x log-likelihood: -45.395
2.5 % 97.5 %
(Intercept) 1.6442791 2.179941
scale(x, scale = FALSE) 0.1972806 0.366662
# R2 for Generalized Linear Regression
Nagelkerke's R2: 0.996
Since this model used a log link, in order to return the estimates to the scale of the response, we need to backtransform by exponentiation.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (log) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered), \(y\) is expected to be exp(1.927) = 6.87 (the y-intercept).
- for every one unit change in \(x\), the expected value of \(y\) increases by exp(0.279) = 1.322 (the slope)
- this is equivalent to a 32.213% increase in the response per one unit change in \(x\)
- the slope (on the link scale) could be as low as 0.197 or as high as 0.367 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 1.218 or as high as 1.443 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 99.583% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a poisson regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& NB(\lambda_i, \phi)\\ log(\lambda_i) =& \beta_0 + \beta_1 (x_i - \bar{x})/\sigma_{x}\\ \end{align} \]
Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached
Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached
Call: MASS::glm.nb(formula = y ~ scale(x), data = dat4, init.theta = 107501.6077,
link = log)
Coefficients:
(Intercept) scale(x)
1.9272 0.8455
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 59.77
Residual Deviance: 9.712 AIC: 51.39
And to explore the diagnostics
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Call:
MASS::glm.nb(formula = y ~ scale(x), data = dat4, init.theta = 107501.6077,
link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9272 0.1362 14.150 < 2e-16 ***
scale(x) 0.8455 0.1305 6.479 9.26e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(107501.6) family taken to be 1)
Null deviance: 59.7734 on 9 degrees of freedom
Residual deviance: 9.7121 on 8 degrees of freedom
AIC: 51.395
Number of Fisher Scoring iterations: 1
Theta: 107502
Std. Err.: 2840768
Warning while fitting theta: iteration limit reached
2 x log-likelihood: -45.395
2.5 % 97.5 %
(Intercept) 1.6442791 2.179941
scale(x) 0.5972968 1.110124
# R2 for Generalized Linear Regression
Nagelkerke's R2: 0.996
Since this model used a log link, in order to return the estimates to the scale of the response, we need to backtransform by exponentiation.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (log) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered), \(y\) is expected to be exp(1.927) = 6.87 (the y-intercept).
- for every one unit change in \(x\) (which represents a span of approximately 34% of the range of \(x\) since standardised), the expected value of \(y\) increases by exp(0.845) = 2.329 (the slope)
- this is equivalent to a 132.905% increase in the response per one unit change in (standardised) \(x\)
- the slope (on the link scale) could be as low as 0.597 or as high as 1.11 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 1.817 or as high as 3.035 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 99.583% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a Bernoulli (Binomial with number of trials = 1) regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& Bin(\pi, 1)\\ logit(\frac{\pi}{1-\pi}) =& \beta_0 + \beta_1 x_i\\ \end{align} \]
Call: glm(formula = y ~ x, family = binomial(link = "logit"), data = dat5)
Coefficients:
(Intercept) x
-5.825 1.295
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 13.46
Residual Deviance: 5.014 AIC: 9.014
And to explore the diagnostics
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 1.3257, p-value = 0.76
alternative hypothesis: two.sided
Conclusions
- no obvious issues with these diagnostics
Call:
glm(formula = y ~ x, family = binomial(link = "logit"), data = dat5)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.825 3.986 -1.461 0.144
x 1.295 0.845 1.533 0.125
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13.4602 on 9 degrees of freedom
Residual deviance: 5.0138 on 8 degrees of freedom
AIC: 9.0138
Number of Fisher Scoring iterations: 6
2.5 % 97.5 %
(Intercept) -19.3257223 -0.7627791
x 0.2736857 4.1921925
# R2 for Logistic Regression
Tjur's R2: 0.653
Since this model used a logit link (\(log(\frac{\pi}{1-\pi})\)), in order to return the estimates to the scale of the response, we can either:
- for the y-intercept:
- backtransform by exponentiation to odds ratio (\(\frac{\pi}{1-\pi}\)).
- backtransform to probability by exponentiation of the odds ratio (\(\pi\)).
- for the slopes:
- backtransform by exponentiation to odds ratio.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (logit) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\):
- the odds of \(y\) being 1 vs 0 is expected to be exp(-5.825) = 0.003 (the y-intercept).
- the probability of \(y\) being 1 is expected to be plogis(-5.825) = 0.003 (the y-intercept).
- for every one unit change in \(x\), the odds of \(y\) increases by exp(1.295) =
3.653 (the slope) - this is equivalent to a 265.259% increase in the odds of the response being 1 per one unit change in \(x\)
- the slope (on the link scale) could be as low as 0.274 or as high as 4.192 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 1.315 or as high as 66.168 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 65.333% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a poisson regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& Bin(\pi_i, 1)\\ log(\frac{\pi_i}{1-\pi_i}) =& \beta_0 + \beta_1 (x_i - \bar{x})\\ \end{align} \]
dat5b.glmw <- glm(y ~ scale(x, scale = FALSE), data = dat5, family = binomial(link = "logit"))
dat5b.glmw
Call: glm(formula = y ~ scale(x, scale = FALSE), family = binomial(link = "logit"),
data = dat5)
Coefficients:
(Intercept) scale(x, scale = FALSE)
1.300 1.295
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 13.46
Residual Deviance: 5.014 AIC: 9.014
And to explore the diagnostics
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Call:
glm(formula = y ~ scale(x, scale = FALSE), family = binomial(link = "logit"),
data = dat5)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.300 1.411 0.922 0.357
scale(x, scale = FALSE) 1.295 0.845 1.533 0.125
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13.4602 on 9 degrees of freedom
Residual deviance: 5.0138 on 8 degrees of freedom
AIC: 9.0138
Number of Fisher Scoring iterations: 6
2.5 % 97.5 %
(Intercept) -0.9433976 5.567602
scale(x, scale = FALSE) 0.2736857 4.192193
# R2 for Logistic Regression
Tjur's R2: 0.653
Since this model used a logit link (\(log(\frac{\pi}{1-\pi})\)), in order to return the estimates to the scale of the response, we can either:
- for the y-intercept:
- backtransform by exponentiation to odds ratio (\(\frac{\pi}{1-\pi}\)).
- backtransform to probability by exponentiation of the odds ratio (\(\pi\)).
- for the slopes:
- backtransform by exponentiation to odds ratio.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (logit) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered):
- the odds of \(y\) being 1 vs 0 is expected to be exp(1.3) = 3.67 (the y-intercept).
- the probability of \(y\) being 1 is expected to be plogis(1.3) = 0.786 (the y-intercept).
- for every one unit change in \(x\), the odds of \(y\) increases by exp(1.295) = 3.653 (the slope)
- this is equivalent to a 265.259% increase in the odds of the response being 1 per one unit change in \(x\)
- the slope (on the link scale) could be as low as 0.274 or as high as 4.192 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 1.315 or as high as 66.168 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 65.333% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a poisson regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& Bin(\pi_i, \phi)\\ log(\frac{\pi_i}{1-\pi}) =& \beta_0 + \beta_1 (x_i - \bar{x})/\sigma_{x}\\ \end{align} \]
Call: glm(formula = y ~ scale(x), family = binomial(link = "logit"),
data = dat5)
Coefficients:
(Intercept) scale(x)
1.300 3.922
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 13.46
Residual Deviance: 5.014 AIC: 9.014
And to explore the diagnostics
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Call:
glm(formula = y ~ scale(x), family = binomial(link = "logit"),
data = dat5)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.300 1.410 0.922 0.357
scale(x) 3.922 2.558 1.533 0.125
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13.4602 on 9 degrees of freedom
Residual deviance: 5.0138 on 8 degrees of freedom
AIC: 9.0138
Number of Fisher Scoring iterations: 6
2.5 % 97.5 %
(Intercept) -0.9433976 5.567602
scale(x) 0.8286247 12.692493
# R2 for Logistic Regression
Tjur's R2: 0.653
Since this model used a logit link (\(log(\frac{\pi}{1-\pi})\)), in order to return the estimates to the scale of the response, we can either:
- for the y-intercept:
- backtransform by exponentiation to odds ratio (\(\frac{\pi}{1-\pi}\)).
- backtransform to probability by exponentiation of the odds ratio (\(\pi\)).
- for the slopes:
- backtransform by exponentiation to odds ratio.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (logit) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered):
- the odds of \(y\) being 1 vs 0 is expected to be exp(1.3) = 3.67 (the y-intercept).
- the probability of \(y\) being 1 is expected to be plogis(1.3) = 0.786 (the y-intercept).
- for every one unit change in \(x\) (which represents a span of approximately 34% of the range of \(x\) since standardised), the odds of \(y\) increases by exp(3.922) = 50.508 (the slope)
- this is equivalent to a 4950.794% increase in the odds of the response being 1 per one unit change in (standardised) \(x\)
- the slope (on the link scale) could be as low as 0.829 or as high as 12.692 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 2.29 or as high as 3.2529677^{5} (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- 65.333% of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a Binomial regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& Bin(\pi, n)\\ logit(\frac{\pi}{1-\pi}) =& \beta_0 + \beta_1 x_i\\ \end{align} \]
dat6.glmw <- glm(cbind(count, total - count) ~ x, data = dat6, family = binomial(link = "logit"))
dat6.glmw
Call: glm(formula = cbind(count, total - count) ~ x, family = binomial(link = "logit"),
data = dat6)
Coefficients:
(Intercept) x
-3.3295 0.8234
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 38.6
Residual Deviance: 9.669 AIC: 22.93
And to explore the diagnostics
OK: No outliers detected.
- Based on the following method and threshold: cook (0.9).
- For variable: (Whole model)
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 1.0614, p-value = 0.76
alternative hypothesis: two.sided
Conclusions
- no obvious issues with these diagnostics
Call:
glm(formula = cbind(count, total - count) ~ x, family = binomial(link = "logit"),
data = dat6)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.3295 1.0441 -3.189 0.001429 **
x 0.8234 0.2246 3.667 0.000246 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 38.5956 on 9 degrees of freedom
Residual deviance: 9.6688 on 8 degrees of freedom
AIC: 22.934
Number of Fisher Scoring iterations: 5
2.5 % 97.5 %
(Intercept) -5.7410373 -1.537283
x 0.4517374 1.355764
NULL
Since this model used a logit link (\(log(\frac{\pi}{1-\pi})\)), in order to return the estimates to the scale of the response, we can either:
- for the y-intercept:
- backtransform by exponentiation to odds ratio (\(\frac{\pi}{1-\pi}\)).
- backtransform to probability by exponentiation of the odds ratio (\(\pi\)).
- for the slopes:
- backtransform by exponentiation to odds ratio.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (logit) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\):
- the odds of \(y\) being 1 vs 0 is expected to be exp(-3.329) = 0.036 (the y-intercept).
- the probability of \(y\) being 1 is expected to be plogis(-3.329) = 0.035 (the y-intercept).
- for every one unit change in \(x\), the odds of \(y\) increases by exp(0.823) =
2.278 (the slope) - this is equivalent to a 127.826% increase in the odds of the response being 1 per one unit change in \(x\)
- the slope (on the link scale) could be as low as 0.452 or as high as 1.356 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 1.571 or as high as 3.88 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- % of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a poisson regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& Bin(\pi_i, n)\\ log(\frac{\pi_i}{1-\pi_i}) =& \beta_0 + \beta_1 (x_i - \bar{x})\\ \end{align} \]
dat6b.glmw <- glm(cbind(count, total - count) ~ scale(x, scale = FALSE), data = dat6, family = binomial(link = "logit"))
dat6b.glmw
Call: glm(formula = cbind(count, total - count) ~ scale(x, scale = FALSE),
family = binomial(link = "logit"), data = dat6)
Coefficients:
(Intercept) scale(x, scale = FALSE)
1.1993 0.8234
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 38.6
Residual Deviance: 9.669 AIC: 22.93
And to explore the diagnostics
OK: No outliers detected.
- Based on the following method and threshold: cook (0.9).
- For variable: (Whole model)
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Call:
glm(formula = cbind(count, total - count) ~ scale(x, scale = FALSE),
family = binomial(link = "logit"), data = dat6)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.1993 0.5016 2.391 0.016800 *
scale(x, scale = FALSE) 0.8234 0.2246 3.667 0.000246 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 38.5956 on 9 degrees of freedom
Residual deviance: 9.6688 on 8 degrees of freedom
AIC: 22.934
Number of Fisher Scoring iterations: 5
2.5 % 97.5 %
(Intercept) 0.3199488 2.343090
scale(x, scale = FALSE) 0.4517374 1.355764
NULL
Since this model used a logit link (\(log(\frac{\pi}{1-\pi})\)), in order to return the estimates to the scale of the response, we can either:
- for the y-intercept:
- backtransform by exponentiation to odds ratio (\(\frac{\pi}{1-\pi}\)).
- backtransform to probability by exponentiation of the odds ratio (\(\pi\)).
- for the slopes:
- backtransform by exponentiation to odds ratio.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (logit) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered):
- the odds of \(y\) being 1 vs 0 is expected to be exp(1.199) = 3.318 (the y-intercept).
- the probability of \(y\) being 1 is expected to be plogis(1.199) = 0.768 (the y-intercept).
- for every one unit change in \(x\), the odds of \(y\) increases by exp(0.823) = 2.278 (the slope)
- this is equivalent to a 127.826% increase in the odds of the response being 1 per one unit change in \(x\)
- the slope (on the link scale) could be as low as 0.452 or as high as 1.356 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 1.571 or as high as 3.88 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- % of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)
Given that exploratory data analysis suggested that the assumptions were not likely to be met for a simple gaussian model and that this was subsequently confirmed when we explored the diagnostics resulting from the OLS model of these data, we will instead entertain a poisson regression model. The model will be of the form:
\[ \begin{align} y_i \sim{}& Bin(\pi_i, n)\\ log(\frac{\pi_i}{1-\pi}) =& \beta_0 + \beta_1 (x_i - \bar{x})/\sigma_{x}\\ \end{align} \]
dat6c.glmw <- glm(cbind(count, total - count) ~ scale(x), data = dat6, family = binomial(link = "logit"))
dat6c.glmw
Call: glm(formula = cbind(count, total - count) ~ scale(x), family = binomial(link = "logit"),
data = dat6)
Coefficients:
(Intercept) scale(x)
1.199 2.493
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 38.6
Residual Deviance: 9.669 AIC: 22.93
And to explore the diagnostics
OK: No outliers detected.
- Based on the following method and threshold: cook (0.9).
- For variable: (Whole model)
Conclusions:
- these diagnostics are broadly acceptable given the small nature of the data
Conclusions
- no obvious issues with these diagnostics
Call:
glm(formula = cbind(count, total - count) ~ scale(x), family = binomial(link = "logit"),
data = dat6)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.1993 0.5016 2.391 0.016800 *
scale(x) 2.4930 0.6799 3.667 0.000246 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 38.5956 on 9 degrees of freedom
Residual deviance: 9.6688 on 8 degrees of freedom
AIC: 22.934
Number of Fisher Scoring iterations: 5
2.5 % 97.5 %
(Intercept) 0.3199488 2.343090
scale(x) 1.3677029 4.104779
NULL
Since this model used a logit link (\(log(\frac{\pi}{1-\pi})\)), in order to return the estimates to the scale of the response, we can either:
- for the y-intercept:
- backtransform by exponentiation to odds ratio (\(\frac{\pi}{1-\pi}\)).
- backtransform to probability by exponentiation of the odds ratio (\(\pi\)).
- for the slopes:
- backtransform by exponentiation to odds ratio.
Importantly, if using confidence intervals for the purpose of inference testsing:
- if using estimates on the link (logit) scale, intervals that do not include 0 are “significant”
- if using estimates on the response scale, intervals that do not include 1 are “significant”
Interpretation:
- when \(x=0\) (e.g. the average \(x\) value since centered):
- the odds of \(y\) being 1 vs 0 is expected to be exp(1.199) = 3.318 (the y-intercept).
- the probability of \(y\) being 1 is expected to be plogis(1.199) = 0.768 (the y-intercept).
- for every one unit change in \(x\) (which represents a span of approximately 34% of the range of \(x\) since standardised), the odds of \(y\) increases by exp(2.493) = 12.098 (the slope)
- this is equivalent to a 1109.753% increase in the odds of the response being 1 per one unit change in (standardised) \(x\)
- the slope (on the link scale) could be as low as 1.368 or as high as 4.105 (95% confidence interval of the slope)
- as the above interval does not include the value of 0, the linear relationship (slope) can be considered significant
- the slope (on the response scale) could be as low as 3.926 or as high as 60.629 (95% confidence interval of the slope)
- as the above interval does not include the value of 1, the linear relationship (slope) can be considered significant
- we would reject the null hypothesis of no relationship (p-value for slope is less than 0.05)
- % of the variance in \(y\) is explained by its linear relationship with \(x\) (R squared value)