GLM from first principles (nearly)

30 July, 2023

By: Murray Logan

Preparations

Load the necessary libraries

library(tidyverse) #for data wrangling and plotting
library(gganimate) #for animations
library(gridExtra) #for additional plotting routines

Preamble

Statistics is a branch of mathematics. Therefore it is inevitable that this tutorial WILL visit numerous mathematical concepts and present mathematical numenclature. In an attempt to contain the dread felt by readers who have long seen off the tortorous high school mathematics lessons, I will try to introduce these elements as gently as possible and will attempt to maintain a consistent use of symbology throughout. Keep in mind, that although it is possible to fit linear models without understanding the underlying mathematics, a more solid grounding will help ensure that models are fit and interpreted correctly.

Linear models

One of the main goals of fitting a statistical model is to use a sample of observed data to be able to gain some insights into the presence or nature of relationships between a response and predictors in a defined population. This is achieved by first proposing one or more (deterministic) mathematical representations of possible relationships and then challenge the representation with data.

For example, we might propose that an increase in a predictor (\(x\)) will be associated with a linear increase in a response (\(y\)).

The general mathematical equation of the linear line is:

\[ y_i = \beta_0 + \beta x_i \]

where:

  • \(y_i\) represents a vector (think single variable) of response values (the subset \(i\) is a way to indicate that there are multiple values),
  • \(x_i\) represents the value of the \(ith\) observed predictor value.
  • \(\beta_0\) represents a single y-intercept (expected value of \(y\) when \(x\)=0),
  • \(\beta\) represents the slope of the linear line (this is the rate of change in \(y\) associated with 1 unit change in \(x\)).

The above expression can be re-written as: \[ y_i = \beta_0\times 1 + \beta_1\times x_i \]

which we can visualize in vector/matrix form as: \[ \underbrace{\begin{bmatrix} y_1\\ y_2\\ y_3\\ ...\\ y_i \end{bmatrix}}_{Y} = \underbrace{\begin{bmatrix} 1&x_1\\ 1&x_2\\ 1&x_3\\ ...\\ 1&x_i \end{bmatrix}}_{\boldsymbol{X}} \underbrace{\vphantom{\begin{bmatrix} y_1\\ y_2\\ y_3\\ ...\\ y_i \end{bmatrix}}\begin{bmatrix} \beta_0&\beta \end{bmatrix}}_{\boldsymbol{\beta}} \]

So given a sample of observed response (\(y\)) and predictor (\(x\)) values, we want to be able to estimate the two unknowns (\(\beta_0\) and \(\beta\)) in the above mathematical expression. If we assume that there is no uncertainty in the data, we can easily calculate the necessary value of \(\beta_0\) and \(\beta\) using any two data pairs and solving simultaneous equations.

simultaneous equations solution \[ \begin{align} 7 =& \beta_0 + \beta 1, 12 = \beta_0 + \beta 2\\ \beta_0 =& 7 - \beta ~~(\text{isolate }\beta_0\text{ from the first equation})\\ \beta =& 5 ~~(\text{solve for }\beta)\\ \beta_0 =& 2 ~~(\text{substitute }\beta\text{ and solve for }\beta_0)\\ \end{align} \]

The above model is a deterministic or mathematical model - it suggests complete confidence in the relationship. As already indicated, the point of statistical modelling is to formulate the equation in the presence of a sample of data drawn from the actual population. The models themselves are always a low-dimensional representation of reality. They cannot hope to capture all the complexity contained in the system and nor do they want to. They just want to provide insights into a very small fraction of this complexity.

Since a set of observed responses is likely to be the result of a large (if not infinite) number of interacting processes, the chances that all observed responses will respond to the predictor in exactly the same way is highly unlikely. Therefore, in all data there is noise. In modelling, we want to see if there is any detectable signal amongst all the noise.

Statistical models include both a deterministic component (a mathematical expression that relates a response to one or more predictors) as well as a stochastic component (representing uncertaintly). For example, the following figures highlight the distinction between a purely deterministic (mathematical) model (left hand figure) and a statistical model (right hand figure).

## `geom_smooth()` using formula = 'y ~ x'

The statistical model captures the stochastic component (\(\varepsilon\)) which represents the degree to which each observed value differs from what would be expected in the absence of any noise (uncertainty). These differences between observed and expected values are called residuals. \[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]

Again, in vector and matrix form, this would be: \[ \underbrace{\begin{bmatrix} y_1\\ y_2\\ y_3\\ ...\\ y_i \end{bmatrix}}_{Y} = \underbrace{\begin{bmatrix} 1&x_1\\ 1&x_2\\ 1&x_3\\ ...\\ 1&x_i \end{bmatrix}}_{\boldsymbol{X}} \underbrace{\vphantom{\begin{bmatrix} y_1\\ y_2\\ y_3\\ ...\\ y_i \end{bmatrix}}\begin{bmatrix} \beta_0&\beta \end{bmatrix}}_{\boldsymbol{\beta}} + \underbrace{\begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \varepsilon_3\\ ...\\ \varepsilon_i \end{bmatrix}}_{\boldsymbol{\varepsilon}} \]

  • \(Y\) represents a column vector of the observed data
  • \(\boldsymbol{X}\) represents what is called the model matrix. In this case, that is just a single column matrix of 1s. The ones indicate that there is an intercept. In matrix algebra, matrices are multiplied, so a column of ones just ensures that the thing being multiplied does exist.
  • \(\boldsymbol{\beta}\) represents a vector of parameters to be estimated. In this case, there is only one.
  • \(\boldsymbol{\varepsilon}\) represents a column vector or residuals.

This could all be summarised as: \[ Y = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \]

Example data

This tutorial will blend theoretical discussions with actual calculations and model fits. I believe that by bridging the divide between theory and application, we all gain better understanding. The applied components of this tutorial will be motivated by two fabricated data sets - one that simulates samples drawn from a Gaussian (normal) distribution and one that simulates samples drawn from a Poisson distribution.

The advantage of simulated data over real data is that with simulated data, we know the ‘truth’ and can therefore gauge the accuracy of estimates.

Motivating example 1

Lets formally simulate the data illustrated above. The underlying process dictates that on average a one unit change in the predictor (x) will be associated with a five unit change in response (y) and when the predictor has a value of 0, the response will typically be 2. Hence, the response (y) will be related to the predictor (x) via the following:

\[ y = 2 + 5x \]

This is a deterministic model, it has no uncertainty. In order to simulate actual data, we need to add some random noise. We will assume that the residuals are drawn from a Gaussian distribution with a mean of zero and standard deviation of 4. The predictor will comprise of 10 uniformly distributed integer values between 1 and 10. We will round the response to two decimal places.

For repeatability, a seed will be employed on the random number generator.
Note, the smaller the dataset, the less it is likely to represent the underlying deterministic equation, so we should keep this in mind when we look at how closely our estimated parameters approximate the ‘true’ values. Hence, the seed has been chosen to yield data that maintain a general trend that is consistent with the defining parameters.

set.seed(234)
dat = data.frame(x=1:10) %>%
    mutate(y = round(2 + 5*x + rnorm(10, 0, 4), 2))
dat
##     x     y
## 1   1  9.64
## 2   2  3.79
## 3   3 11.00
## 4   4 27.88
## 5   5 32.84
## 6   6 32.56
## 7   7 37.84
## 8   8 29.86
## 9   9 45.05
## 10 10 47.65
ggplot(data=dat) + geom_point(aes(y=y, x=x))

We will use these data in two ways. Firstly, to estimate the mean and variance of the reponse (y) ignoring the predcitor (x) and secondly to estimate the relationship between the reponse and predictor.

For the former, we know that the mean and variance of the response (y) can be calculated as:

\[ \begin{align} \bar{y} =& \frac{1}{n}\sum^n_{i=1}y_i\\ var(y) =& \frac{1}{n}\sum^n_{i=1}(y-\bar{y})^2\\ sd(y) =& \sqrt{var(y)} \end{align} \]

mean(dat$y)
## [1] 27.811
var(dat$y)
## [1] 225.9111
sd(dat$y)
## [1] 15.03034

Motivating example 2

The Poisson distribution is only parameterized by a single parameter (\(\lambda\)) which represents both the mean and variance. Furthermore, Poisson data can only be positive integers.

Unlike simple trend between two Gaussian or uniform distributions, modelling against a Poisson distribution alters the scale to logarithms. This needs to be taken into account when we simulate the data. The parameters that we used to simulate the underlying processes need to either be on a logarithmic scale, or else converted to a logarithmic scale prior to using them for generating the random data.

Moreover, for any model that involves a non-identity link function (such as a logarithmic link function for Poisson models), ‘slope’ is only constant on the scale of the link function. When it is back transformed onto the natural scale (scale of the data), it takes on a different meaning and interpretation.

We will chose \(\beta_0\) to represent a value of 2 when x=0. As for the ‘effect’ of the predictor on the response, lets say that for every one unit increase in the predictor the response increaes by 20% (on the natural scale). Hence, on the log scale, the slope will be \(log(1.2)=\) 0.1823216.

set.seed(123)
beta = c(2, 1.20)
beta=log(beta)
n = 10
dat2 = data.frame(x=seq(1,10,len=n)) %>%
    mutate(y = rpois(n, lambda=exp(beta[1] + beta[2]*x)))
dat2
##     x  y
## 1   1  1
## 2   2  4
## 3   3  3
## 4   4  7
## 5   5  9
## 6   6  2
## 7   7  7
## 8   8 12
## 9   9 10
## 10 10 18
ggplot(data=dat2) + geom_point(aes(y=y, x=x)) 

OLS

Motivating example 1 (response only)

The Ordinary Least Squares (OLS) estimate of a parameter (e.g the mean, \(\mu\)) is the value of this parameter that minimize the sum of the square residuals (e.g. \((y_i-\mu)^2\)). That is, it minimize the total difference between the observed values (\(x_i\)) and those expected (\(\mu\)). We could calculate the residual sum of squares (RSS) for a large number of combinations of potential parameter values in order to find the ‘optimum’ parameter values (those that minimize RSS).

mu = seq(15,40,len=1000)
rss = vector('numeric', length(mu))
for (i in 1:length(mu)) {
    rss[i] = sum((dat$y - mu[i])^2)
}
mu[which(rss==min(rss))]
## [1] 27.81281
ggplot(data=NULL) + geom_line(aes(y=rss, x=mu))

This minimum is very close to that calculated with the simple formula (as applied with the mean() function).

Matrix algebra

Now this is all fine and well, but what if we need to estimate multiple parameters. Lets see how we could estimate the mean via matrix algebra as a stepping stone towards solving more complex problems.

We start by expressing the problem as a linear equation.

\[ y_i = \beta_0 + \varepsilon_i \]

Here, we are saying that our observed values, can be considered to be equal to the mean of the values plus the individual deviations (residuals) of each of the values from the mean. Note, in the context of a linear model, this is just an intercept only model.

Recall that the typical linear model might look like:

Now lets represent this in matrix form. \[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]

Which is the same as: \[ y_i = \beta_0\times 1 + \beta_1\times x_i + \varepsilon_i \]

Returning to our intercept only model then: \[ y_i = \beta_0\times 1 + \varepsilon_i \]

And in matrix form, this would be: \[ \underbrace{\begin{bmatrix} y_1\\ y_2\\ y_3\\ ...\\ y_n \end{bmatrix}}_{Y} = \underbrace{\begin{bmatrix} 1\\ 1\\ 1\\ ...\\ 1 \end{bmatrix}}_{\boldsymbol{X}} \underbrace{\vphantom{\begin{bmatrix} y_1\\ y_2\\ y_3\\ ...\\ y_n \end{bmatrix}}\begin{bmatrix} \beta_0 \end{bmatrix}}_{\boldsymbol{\beta}} + \underbrace{\begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \varepsilon_3\\ ...\\ \varepsilon_n \end{bmatrix}}_{\boldsymbol{\varepsilon}} \]

  • \(Y\) represents a column vector of the observed data
  • \(\boldsymbol{X}\) represents what is called the model matrix. In this case, that is just a single column matrix of 1s. The ones indicate that there is an intercept. In matrix algebra, matrices are multiplied, so a column of ones just ensures that the thing being multiplied does exist.
  • \(\boldsymbol{\beta}\) represents a vector of parameters to be estimated. In this case, there is only one.
  • \(\boldsymbol{\varepsilon}\) represents a column vector or residuals.

This could all be summarised as: \[ Y = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \]

If we assume that the residuals (\(\varepsilon\)) are (independently and identically distributed, iid) all independently drawn from the same normal distribution with a mean of 0 and a constant variance, then there is a rather elegant solution for solving for \(\boldsymbol{\beta}\).

Recall that OLS estimates the value of the unknown parameters (\(\boldsymbol{\beta}\)) as the values that minimize the sum of the squared residuals. So lets start by expressing the above equation in terms of \(\boldsymbol{\varepsilon}\).

\[ \boldsymbol{\varepsilon} = Y - \boldsymbol{X}\boldsymbol{\beta} \]

Via matrix algebra, we can calulate the sum of square residuals as (\(\boldsymbol{\varepsilon}'\boldsymbol{\varepsilon}\)), where \(\boldsymbol{\varepsilon}'\) is the transpose of \(\boldsymbol{\varepsilon}\), so \(\boldsymbol{\varepsilon}'\boldsymbol{\varepsilon}\) is just:

\[ \begin{bmatrix} \varepsilon_1 & \varepsilon_2 & \varepsilon_3 &...& \varepsilon_n \end{bmatrix} \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \varepsilon_3\\ ...\\ \varepsilon_n\\ \end{bmatrix} = \begin{bmatrix} \varepsilon_1\times\varepsilon_1 + \varepsilon_2\times\varepsilon_2 + \varepsilon_3\times\varepsilon_3 + ... + \varepsilon_n\times\varepsilon_n \end{bmatrix} \]

Based on all this, we could re-write \(\boldsymbol{\varepsilon}^T\boldsymbol{\varepsilon}\) as:

\[ \boldsymbol{\varepsilon}^T\boldsymbol{\varepsilon} = (Y - \boldsymbol{X}\boldsymbol{\beta})^T(Y - \boldsymbol{X}\boldsymbol{\beta}) \]

If we then take the derivatives of the above with respect to \(\boldsymbol{\beta}\), we will obtain the values of \(\boldsymbol{\beta}\) that minimizes the sum of square residuals. Altimately, we arrive at:

\[ \boldsymbol{\beta} = \underbrace{(\boldsymbol{X}^T\boldsymbol{X})^{-1}}_{\frac{1}{n}} \underbrace{\boldsymbol{X}^T Y}_{\sum^n_{i=1}y_i} \]

where \((\boldsymbol{X}^T\boldsymbol{X})^{-1}\) is the inverse of \((\boldsymbol{X}^T\boldsymbol{X})\). In matrix algebra, matrices are never divided, only multiplied. Therefore, to divide by \((\boldsymbol{X}^T\boldsymbol{X})\), it is necessary to multiply by the inverse of \((\boldsymbol{X}^T\boldsymbol{X})^{-1}\), in the same way that dividing by \(n\) is the same as multiplying by the inverse of \(n\) which is of course \(\frac{1}{n}\).

In the above equation, we can see how the matrix form translates directly into the simple calculation of a mean. \((\boldsymbol{X}^T\boldsymbol{X})^{-1}\) equates to \(\frac{1}{n}\) and \(\boldsymbol{X}^T Y\) equates to \(\sum^n_{i=1}y_i\).

So to estimate \(\boldsymbol{\beta}\), all we have to do is perform the above matrix multiplications. Unfortunately, matrix inversion (e.g. calculating \((\boldsymbol{X}^T\boldsymbol{X})^{-1}\)) is not always straight forward. Note in this case it is straight forward, because the matrix only has a single value).

In order to be able to invert a matrix, the following must be satisfied:

  • the matrix must be square (have the same number of rows as columns)
  • be full rank, that is the number of independent columns (the rank) must be equal to the number of columns.

If these are the case, then we can obtain the inverse by decomposing the square matrix into two parts:

  • the inverse (i.e. what we are after)
  • an identity matrix (a bi-product that is typically thrown away)

In R, this can be achieved via the solve() (or qr.solve()) functions.

## Generate the single column matrix of 1s
X = cbind(rep(1,length(dat$y)))
## perform the matrix multiplications to solve for beta
## beta = (X'X)^-1 X'Y
beta = solve(t(X) %*% X) %*% t(X) %*% dat$y
beta
##        [,1]
## [1,] 27.811

This is the same as the empirical mean we calculated earlier.

Linear regression via matrix algebra (example 1 data)

Recall the motivating example 1 data as:

dat
##     x     y
## 1   1  9.64
## 2   2  3.79
## 3   3 11.00
## 4   4 27.88
## 5   5 32.84
## 6   6 32.56
## 7   7 37.84
## 8   8 29.86
## 9   9 45.05
## 10 10 47.65

So the statistical model might be:

\[ \begin{align} y_i =& \beta_0 + \beta_1 x_i + \varepsilon_i\\ Y =& \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \end{align} \]

which can be represented in matrix form as:

\[ \begin{bmatrix} 9.64\\ 3.79\\ 11\\ ...\\ \end{bmatrix} = \underbrace{\begin{bmatrix} 1&1\\ 1&2\\ 1&3\\ ...\\ \end{bmatrix}}_{\text{model matrix}} \begin{bmatrix} \beta_0\\ \beta_1 \end{bmatrix} + \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \varepsilon_3\\ ...\\ \end{bmatrix} \]

So, the response is represented by a vector and this is related to the model matrix (a matrix representing the predictor(s)), a vector of parameters to estimate (\(\beta_0\): the intercept and \(\beta_1\): the slope) and a vector of residuals (random noise).

Recall that in order to estimate \(\boldsymbol{\beta}\), we solve the following:

\[ \boldsymbol{\beta} = (\boldsymbol{X}^T\boldsymbol{X})^{-1} \boldsymbol{X}^T Y \]

## Generate the model matrix
X = model.matrix(~x, data=dat)
## Solve for beta
beta = solve(t(X) %*% X) %*% t(X) %*% dat$y
beta
##                 [,1]
## (Intercept) 2.650667
## x           4.574606

This whole process, and more is conveniently wrapped into a simple function within R called lm().

lm(y~x, data=dat)
## 
## Call:
## lm(formula = y ~ x, data = dat)
## 
## Coefficients:
## (Intercept)            x  
##       2.651        4.575

Generalized linear models

Recall from the previous section that the simple linear model can be written as: \[ \begin{align} y_i =& \beta_0 + \beta_1 x_i + \varepsilon_i\\ Y =& \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \end{align} \]

This could be re-written as:

\[ \begin{align} y_i \sim{}& N(\mu_i, \sigma^2I)\\ \mu_i =& \beta_0 + \beta_1 x_i\\ \end{align} \]

indicating that the response (\(y\)) is distributed as a Gaussian (normal) the mean of which is determined by the linear relationship with \(x\) and there is a constant variance (\(\sigma^2\)) - all observations are independently and identically distributed (iid) - independently drawn from a distribution with the same variance.

The \(I\) signifies an identity matrix (a square matrix whose diagonals are all one).

More generally, the above can be written in vector/matrix form as: \[ \begin{align} Y \sim{}& N(\boldsymbol{\mu}, \sigma^2I)\\ \boldsymbol{\mu} =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]

If we suspect that the residuals are not independent and identically distributed (if for example they were Poisson distributed), then we could alter (generalise) the above to:

\[ \begin{align} Y \sim{}& N(\boldsymbol{\mu}, \sigma^2\boldsymbol{W}^{-1})\\ \boldsymbol{\mu} =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]

where \(\boldsymbol{W}\) is a matrix of positive diagonals .

This allows us to generalise to other (exponential) families (such as Binomial, Poisson, Negative Binomial, Gamma etc). For example, if our response (\(Y\)) were count data, we might consider a Poisson.

\[ \begin{align} Y \sim{}& Pois(\boldsymbol{\lambda})\\ \boldsymbol{\lambda} =& e^{\boldsymbol{X}\boldsymbol{\beta}}\\ &\text{OR equivalently}\\ log(\boldsymbol{\lambda}) =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]

The Poisson distribution (\(P(x|\lambda) = e^{-\lambda}\lambda^x/x!\)) is parameterised by a single parameter (\(\lambda\)) that represents both the mean and variance (as well as degrees of freedom). Poisson data are bound at the lower end by zero (negative values are not defined) - it is not possible to count fewer than 0 things. The Poisson family includes an exponential term, therefore to map the expected values back onto the natural scale (scale of the observed data), we use a link function (in the case of the Poission, this link is a log link). Hence the above can be generalized even further to:

\[ \begin{align} Y \sim{}& D(\boldsymbol{\eta}, ...)\\ g(\boldsymbol{\eta}) =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]

where \(D\) is the nominated family, \(g\) is the link function and \(...\) represents any additional parameters required by the nominated distribution (such as \(\sigma^2\boldsymbol{W}^{-1}\) in the case of the Gaussian distribution).

Maximum likelihood

Ordinary Least Squares provides an elegant solution for when the data satisfy certain assumptions (normality, homogeneity, independence, etc), yet for many other situations, it is not appropriate.

Motivating example 1

As with OLS, lets start with motivating example 1 data

dat$y
##  [1]  9.64  3.79 11.00 27.88 32.84 32.56 37.84 29.86 45.05 47.65

A more general alternative to OLS is Maximum Likelihood (ML).

Likelihood is a measure of how probable (likely) a set of observations are at following (or being drawn from) a specific distribution. For example, we could evaluate the likelihood that the observations could have come from a normal (Gaussian) distribution with a specific mean and standard deviation.

Before we can understand likelihood, we must first remind ourselves of a couple of things about probability.

  1. for any continuous distribution, the probability of obtaining a specific values (that is, that a specific value (\(X\)) is equal to a particular reference values (\(x\))) is infinitely small (\(Pr(X=x)=0\)). We can only directly estimate probabilities of obtaining values less than (or greater than) nominated reference points (quantiles).
  2. it is possible to calculate the probability that a specific value (\(X\)) is between two reference points (\(Q1\) and \(Q2\)). This is just the probability of \(X\) being less than Q1 minus the probability of \(X\) being less than Q2 (\(Pr(Q1 < X > Q2)\)).

So although we cant estimate the probability that \(Pr(X=x)\) directly from the distributions’ function (\(f(x)\)), we can approximate this by calculating the probability that \(X\) is in the interval \([x, x+\Delta]\):

\[ \frac{Pr(x < X \le x + \Delta)}{\Delta} \]

The smaller \(\Delta\) (so long as it is larger than 0), the more accurate the estimate. This becomes a simple calculus problem. The derivative (\(f'(x)\)) of the distributions’ function is a probability density function (PDF). The PDF allows us to approximate the probability of obtaining a single value from a distribution.

Provided the data are all iid (individually and identically distributed), and thus from the same distribution, the probability (likelihood) that a set of values (\(X\)) comes from a specific distribution (described by its parameters, \(\theta\)) can be calculated as the product of their individual probabilities.

\[ L(X|\theta) = \prod^n_{i=1} f'(x_i|\theta) \]

The products of probability densities can soon become very small and this can lead to computation and rounding issues. Hence it is more usual to work with the logarithms of likelihood. The log laws indicate that the log of a product is the same as the sum of the individual logs (\(log(A\times B) = log(A) + log(B)\)).

\[ LL(X|\theta) = \sum^n_{i=1} log(f'(x_i|\theta)) \]

The PDF of a Gaussian distribution is:

\[ P(x|\mu, \sigma) = \frac{1}{\sqrt{\sigma^2 2\pi}} exp^{-\frac{1}{2}((x-\mu)/\sigma)^2} \]

So in order to estimate the optimum values for the parameters (\(\mu\) and \(\sigma\)), given our data (\(x\)), we would maximize the following:

Returning to our example,

\[ \begin{align} LL(x_1, x_2, x_3, ..., x_n|\mu, \sigma^2) =& \sum^n_{i=1}ln(\frac{1}{\sqrt{\sigma^2 2\pi}} exp^{-\frac{1}{2}((x_i-\mu)/\sigma)^2})\\ =& -\frac{n}{2}ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum^n_{i=1}(x_i-\mu)^2 \end{align} \]

Note optimization routines usually attempt to minimize rather than maximize. Nevertheless, finding the maximum of a log-likelihood (what we are after) is the same as finding the minimum of the negative log-likelihood.

For the examples and routines that follow, we will write our own code from first principles as well as use existing built in R functions. Although the code that we write could be written to find maximums, the inbuilt functions typically optimise on minimums. As a result, to enable us to compare routines, we will work with minimizing negative log-likelihood.

Do, lets now write a function for calculating the negative log-likelihood for our data given a Gaussian distribution based on the formula above.

LL.gaus = function(theta, x) {
    m=theta[1]
    s=theta[2]
    ll = -(length(x)/2)*(log(2*pi*s^2)) + (-1/(2*s^2)) * sum((x-m)^2)
    ##OR
    ## ll = sum(dnorm(x, mean=m, sd=s, log=TRUE))
    return(-ll)
}

In a similar manner to the brute force approach we used to approximate OLS earlier, lets use a brute force approach to explore the partial negative log-likelihood profile for the mean and then approximate the mean (we will fix the standard deviation at 1). We refer to it as a partial profile, because it is holding the other parameter constant.

mu = seq(15,40,len=1000)
theta=cbind(mu=mu,sigma=1)
B=apply(theta, 1, LL.gaus, dat$y)
theta[which(B==min(B)),]
##       mu    sigma 
## 27.81281  1.00000
ggplot(data=NULL) + geom_line(aes(y=B, x=theta[,'mu']))

Again, this estimation is very close to the ‘true’ value.

Optimization

Optimization algorithms are essentially search algorithms. They are attempting to find a single point in multidimensional space. There are numerous types of algorithms, each of which offers different advantages and disadvantages under different circumstances. For example, some assume that the underlying likelihood is very smooth (changes very gradually) and gain efficiencies out of being able to located minimums via differentiation. Others are less efficient, yet more accurate when the likelihood is not smooth or there are multiple local minima.

Lets use an analogy to gain a better appreciation of the problem and solutions. Imagine we had a big block of land and we wanted to install a well from which to draw water from an underground aquifer. Although there are no physical restrictions on where the well can be positioned, we are keen on it being as shallow as possible (perhaps because it is cheaper to drill a shallow well).

The depth from the land surface down to the aquifer is not constant over space and we want to be able to put our well in the shallowest point. Although we do not know the true underground topography, we can drill narrow pilot holes and accurately measure the depth down to the aquifer at any point in space.

To put this another way, although we do not know what the underground profile looks like throughout the entire spatial domain (all possible latitide/longitude values), we can estimate its value (depth) at any point (single latitude/longitude).

To find the optimum location for our well, we need a search algorithm. One that is able to find the latitude/longitude associated with the minimum depth. We will showcase a few different options and try to describe the advantages and disadvantages of each. For example, in our well analogy, if the depth profile was very smooth (undulated very slowly), we might be able to use approximations to the curvature of the undulations to find where any minimums are. On the other hand, if the profile is not smooth (perhaps there are underground caves or other abrupt underground geological features), such approximations may be very inaccurate and more exhaustive searching (such as a grid of pilot holes) may be required.

Just like with the underground aquifer, although a (negative) log-likelihood function has an unknown profile in multidimensional space (one dimension for each parameter to estimate), we can evaluate it for any combination of parameters.

Brute force

One conceptually simple way of searching for the minimum of a function is to evaluate the function for a large number of parameter combinations (perhaps in a grid). For example, to drill a pilot hole every 100m in a grid. If the grid is fine enough, it will located the minimum (maximum) no matter what the functions profile is. However, the finer the grid, the more effort is required - lower efficiency.

The following code chunk evaluates and plots the negative log-likelihood for a full (\(100\times 100\)) grid of parameter combinations. Negative log-likelihood is represented as a colour along the green to white spectrum. The blue lines represent major contours in the profile. The optimum parameters (those associated with the minimum negative log-likelihood and thus maximum log-likelihood) are indicated by the black solid point.

mu = seq(15,40,len=100)
sigma=seq(10,20,len=100)
theta = expand.grid(mu=mu, sigma=sigma)
theta$LL=apply(theta, 1, LL.gaus, x=dat$y)
ggplot(data=theta,aes(y=mu, x=sigma, fill=LL)) +
    geom_tile(show.legend=FALSE) + geom_contour(aes(z=LL)) +
    geom_point(data=theta[which(theta$LL==min(theta$LL)),], aes(y=mu, x=sigma), fill='black') +
    scale_fill_gradientn(colors=terrain.colors(12)) +
    scale_x_continuous(expand=c(0,0)) +
    scale_y_continuous(expand=c(0,0))
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?

theta[which(theta$LL==min(theta$LL)),]
##            mu    sigma       LL
## 4252 27.87879 14.24242 40.76342

The optimum solution yielded an estimated mean of 27.8787879 and sigma of 14.2424242. These values are very similar to the empirical calculations.

The brute force approach is a useful illustration, but it is practically limited to just a one or two parameters and since the entire search space within the domain needs to be evaluated. This is essentially an exhaustive search algorithm. Furthermore, the accuracy is dependent on the resolution of the grid. The finer the grid, the more accurate, however it does require \(n^p\) function evaluations (where \(n\) is the number of grid increments per \(p\) parameters.

Of course, we can always use a relatively course search grid and having identified the ‘optimum’ parameter configuration within this grid, we could apply a finer resolution grid over a narrower search domain. This is analogous to starting with a 100x100m drilling grid and then centring a 1x1m drilling grid around the ‘best’ location.

Simplex methods

A simplex is a multidimensional space that has one more vertex than there are parameter dimensions (\(p+1\)). In this case, we are estimating two parameters, therefore the simplex is a triangle (three vertices). The most common form of simplex optimisation is the Nelder-Mead algorithm. As with most optimisation algorithms, the Nelder-Mead algorithm is a search algorithm that aims to find a point in multidimensional space as efficiently as possible.

Optimisation routines work on a wide variety of function (not just likelihood functions). To keep the terminology about what is being optimised general, the function to be optimised is often referred to as a loss or objective function. A loss function is any function that evaluates the performance (or fit) of a set of events (i.e. data). The further the loss is from zero, the worse the fit (hence the desire to find the minimum).

The Nelder-Mead algorithm can be described as (keep in mind that it is a minimisation rather than maximisation):

  1. Start with some initial estimate values - e.g. a set of \(p+1\) vertices for each parameter.

  2. Identify the vertex with the highest (ie worst) loss (negative log-likelihood).

  3. Reflect the simplex around the centroid of the other vertices 3a. If the reflected point is better (lower negative log-likelihood) than the second worst point, but not better than the best point, replace the worst point with the reflected point. 3b. If instead, the reflected point is the best vertex, then expand this point by doubling the reflection distance

     3b.1.  If this expanded point is better than the reflected point,
            replace the worst point with the expanded point
     3b.2   Otherwise replace the worst point with the reflected point.

    3c. If instead neither 3a or 3b (the reflected point is not better than the second worst point, then contract the point by halving the reflection distance.

     3c.1.  If this contracted point is better than the worst point, replace the worst point with the contracted point
     3c.2.  Otherwise, shrink the entire simplex (contract all vertices towards the centroid)
  4. Repeat Steps 2-3 until either the maximum number of iterations have occurred or the change in loss between two successive iterations is below a certain threshold.

Clearly, the more iterations are performed, the more accurate the estimates, and yet the longer the search will take.

In the well analogy, the simplex represents three points of the search triangle. Reflecting the triangle allows us to move away from the direction we know to be deeper. We expand the triangle in order to explore a new direction and contract the triangle to narrow our search area.

Code for illustrating the process is listed as details below. This code is a modification of the code presented in [https://github.com/nicolaivicol/nelder-mead-R/blob/master/optimNM.R]

get.optim.NM <- function(X.vert, params.init, objective.fn, iter.max=250, abs.tol=0.0001, x, control=list(fnscale=1,refl=1,expan=2,contr=-0.5,shrink=0.5))
{
                                        # input dimension
    X.len <- length(params.init)
                                        # initialize controls before iterations of searching
    iter <- 0; not.converged <- 1; not.max.iter <- 1
    X.optim <- params.init; f_X.optim <- control$fnscale*objective.fn(X.optim, x=x)
                                        # while loop, iterations
    while (not.converged & not.max.iter)
    {
                                        # get values at vertices
        f_X.vert <- control$fnscale*apply(X = X.vert, MARGIN = 1, FUN = objective.fn, x) 
                                        # order ascending X.vert and f(X.vert), by f(X.vert)
        X.order <- sort(f_X.vert, index.return = TRUE)$ix
        X.vert <- X.vert[X.order, ]
        f_X.vert <- f_X.vert[X.order]
                                        # get centroid (mean on each dimension) of all points except the worst
        X.centr <- apply(X = X.vert[1:X.len, ], MARGIN = 2, FUN = mean)
                                        # get reflected point
        X.refl <- X.centr + control$refl*(X.centr - X.vert[X.len+1, ])
        f_X.refl <- control$fnscale*objective.fn(X.refl,x)
        if ((f_X.vert[1] <= f_X.refl) & (f_X.refl < f_X.vert[X.len]))
        { 
                                        # if the reflected point is better than the second worst, but not better than the best...
                                        # ... then obtain a new simplex by replacing the worst point with the reflected point
            X.vert[X.len+1, ] <- X.refl 
        } else if (f_X.refl < f_X.vert[1]) {
                                        # if the reflected point is the best point so far
                                        # ... then compute the expanded point
            X.expan <- X.centr + control$expan*(X.centr - X.vert[X.len+1, ])
            f_X.expan <- control$fnscale*objective.fn(X.expan,x)
                                        # ... if the expanded point is better than the reflected point
            if (f_X.expan < f_X.refl)
            {
                                        # ... then obtain a new simplex by replacing the worst point with the expanded point
                X.vert[X.len+1, ] <- X.expan   
            } else {
                                        # ... else obtain a new simplex by replacing the worst point with the reflected point
                X.vert[X.len+1, ] <- X.refl
            }
        } else {
                                        # ... reflected point is not better than second worst
                                        # ... then compute the contracted point
            X.contr <- X.centr + control$contr*(X.centr - X.vert[X.len+1, ])
            f_X.contr <- control$fnscale*objective.fn(X.contr,x)
                                        # ... if the contracted point is better than the worst point
            if (f_X.contr < f_X.vert[X.len+1])
            {
                                        # ... then obtain a new simplex by replacing the worst point with the contracted point
                X.vert[X.len+1, ] <- X.contr
            } else {
                                        # ... shrink the simplex: X = X1 + coef.shrink(X-X1)
                X.vert <- sweep(control$shrink*sweep(X.vert, 2, X.vert[1, ], FUN = "-"), 2, X.vert[1, ], FUN="+")
            }    
        }
                                        # get values at vertices
        f_X.vert <- control$fnscale*apply(X = X.vert, MARGIN = 1, FUN = objective.fn, x) 
                                        # order asc X.vert and f(X.vert)
        X.order <- sort(f_X.vert, index.return = TRUE)$ix
        X.vert <- X.vert[X.order, ]
        f_X.vert <- f_X.vert[X.order]   
                                        # update controls
        iter <- iter + 1 
        not.max.iter <- (iter < iter.max)*1
        not.converged <- (abs(control$fnscale*objective.fn(X.vert[X.len, ],x)- control$fnscale*objective.fn(X.vert[1, ],x)) > abs.tol)*1
        
        X.optim <- X.vert[1, ]; f_X.optim <- control$fnscale*objective.fn(X.optim,x)
    }
    return(list(X.optim=X.optim, f_X.optim=f_X.optim, X.vert=X.vert, iter=iter))   
}

We can illustrate the iterative process by plotting the outcome of a limited number of iterations - in this case 10 iterations. In this illustration, the filled in triangle represents the current optimum simplex. Previous simplexes remain as open triangle.

## Starting values at the center of the vectices
params.init=c(mu=20, sigma=12)
d.mu=0.5
d.sigma=0.3
simplex = rbind(Vertex1=params.init + c(d.mu,d.sigma),
                Vertex2=params.init + c(-d.mu,d.sigma),
                Vertex3=params.init + c(-d.mu,-d.sigma)) %>%
    data.frame
base.plot = ggplot(data=theta,aes(y=mu, x=sigma)) +
    geom_tile(aes(fill=LL), show.legend=FALSE) + geom_contour(aes(z=LL)) +
    scale_fill_gradientn(colors=terrain.colors(12)) +
    scale_x_continuous(expand=c(0,0)) +
    scale_y_continuous(expand=c(0,0))

newdata =vector('list', 15)
a=get.optim.NM(X.vert=as.matrix(simplex), params.init, objective.fn=LL.gaus, x=dat$y, iter.max=1, control=list(fnscale=1,refl=1,expan=2,contr=-0.5,shrink=0.5))
newdata[[1]] =data.frame(a$X.vert)
for (i in 2:15) {
    a=get.optim.NM(X.vert=a$X.vert, params.init=a$X.optim, objective.fn=LL.gaus, x=dat$y, iter.max=1, control=list(fnscale=1,refl=1,expan=2,contr=-0.5,shrink=0.5))
    newdata[[i]] = data.frame(a$X.vert)
}
newdata = bind_rows(newdata, .id='Iter') %>% mutate(Iter=factor(Iter, levels=unique(Iter)))
g=base.plot + 
    geom_polygon(data=newdata, aes(y=mu, x=sigma, group=Iter), color='black', fill='grey40') +
    transition_manual(Iter) + shadow_trail(distance=0.1,alpha=0.4, color='grey40', fill=NA) + labs(title='Iter: {current_frame}')
ga=animate(g, fps = 20, duration = 15)
#ga=animate(g, renderer=av_renderer())
anim_save('simplexAnim.gif', animation=ga, path='../tut/glm_part1_files',  renderer=av_renderer())

Having seen this illustration, we could allow the Nelder-Mead simplex optimization to iterate more thoroughly. We will now instruct the routine to iterate a maximum of 250 times. Along with setting a maximum number of iterations, most optimizations also have a stopping trigger based around the extent of improvement between iterations. This convergence tolerance defines a threshold difference below which two successive iteration outcomes are considered the same. The lower the value, the more accuracy is demanded.

It is also a good idea to repeat the iterations again from multiple starting configurations to ensure that any single optimization has not just settled n a local minimum (maximum).

get.optim.NM(X.vert=as.matrix(simplex), params.init, objective.fn=LL.gaus, x=dat$y, iter.max=250, abs.tol=1e-08, control=list(fnscale=1,refl=1,expan=2,contr=-0.5,shrink=0.5))
## $X.optim
##       mu    sigma 
## 27.81238 14.25925 
## 
## $f_X.optim
##    sigma 
## 40.76329 
## 
## $X.vert
##               mu    sigma
## Vertex3 27.81238 14.25925
## Vertex2 27.80955 14.25881
## Vertex1 27.81327 14.25858
## 
## $iter
## [1] 32
get.optim.NM(X.vert=as.matrix(simplex - c(-0.5,0.2)), params.init, objective.fn=LL.gaus, x=dat$y, iter.max=250, abs.tol=1e-08, control=list(fnscale=1,refl=1,expan=2,contr=-0.5,shrink=0.5))
## $X.optim
##       mu    sigma 
## 27.81122 14.25812 
## 
## $f_X.optim
##    sigma 
## 40.76329 
## 
## $X.vert
##               mu    sigma
## Vertex1 27.81122 14.25812
## Vertex2 27.80970 14.25939
## Vertex3 27.81149 14.26005
## 
## $iter
## [1] 38

The two sets of estimated parameters (listed as X.optim) are very similar (converged) and are very similar to those calculated empirically.

R has an inbuilt function (optim()) that is an interface to numerous optimization algorithms, and the default algorithm is Nelder-Mead. As with other optimizations, optim() defaults to minimizing. To force it to maximize (if our likelihood function returned log-likelihood rather than negative log-likelihood), we can indicate that the fnscale is -1. Other important optimization control parameters include:

  • maxit - the maximum number of iterations to perform (100 for Nelder-Mead).
  • abstol - the absolute convergence tolerance. The lower the tolerance, the smaller the change in optimized value (log-likelihood) required before convergence is reached.
optim(par=c(20,12), LL.gaus, x=dat$y, control=list(fnscale=1))
## $par
## [1] 27.81143 14.25771
## 
## $value
## [1] 40.76329
## 
## $counts
## function gradient 
##       53       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

We can see that the negative log-likelihood calculation was performed 51 times before convergence. Whilst this is not the same as the number of iterations, it does provide an estimate of the total computation load.

If we compare this to the brute force approach (which required \(100\times 100=10,000\) evaluations), the Nelder-Mead simplex approach is a substantial improvement.

Derivative methods

If the profile of a function is smooth enough, and the function itself is doubly differentiable (can be differentiated into first and second order derivatives), then we can make use of a group of algorithms based on a root-finding algorithm devised by Isaac Newton. In mathematical contexts, a root is the value of the parameters (\(x\)) when the function equals zero (\(f(x)=0\)).

A simplified version of Newton’s method, the Newton-Raphson method shows that root (\(x_{n+1}\)) of a function (\(f(x_n)\)) can be approximated by iteratively solving:

\[ x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)} \]

where \(x_n\) is the initial parameter estimate, \(x_{n+1}\) is the improved parameter estimate, \(f(x_n)\) is the value of the function at \(x_ n\) and \(f'(x_n)\) is the first order derivative (the slope) of the function at \(x_n)\).

Before we use this approach to estimate maximum likelihood estimates, lets see how it works with a polynomial function (e.g. \(y=0.1x^4 + x^3 - x^2 + 30\)).

In the following animation, we will use the Newton-Raphson method to estimate the value of \(x\) when \(y=0\). The animation will go through five iterations. For each iteration, the red line represents the slope associated with the initial estimate of \(x\). The point where this line intersects with the dashed line (\(y=0\)) is the updated estimate for \(x\). We can see that by the fifth iteration, the estimated \(x\) is has began to stabalise (converge) on a value of approximately \(1.98\).

Code used to implement Newton-Raphson method for simple functions and generate animation
NR = function(f, x, return.grid=FALSE) {
    if (is.function(f)) f=body(f)[[2]]    
    f_x = eval(f)
    f._x = eval(D(f,'x'))
    if (return.grid) {
        list(x1=x - f_x/f._x, f_x=f_x, f._x=f._x)
    }else {
        list(x1=x - f_x/f._x)
    }
}
optim.NR = function(f, x0, abs.tol=1.0E-6, iter.max=250, return.grid=FALSE) {
    iter <- 0; not.converged <- 1; not.max.iter <- 1
    fgrid = list()
    while (not.converged & not.max.iter) {
        nr=NR(f, x=x0,return.grid)
        x1=nr$x1
        iter <- iter + 1 
        not.max.iter <- (iter < iter.max)*1
        not.converged=(abs(x0-x1)>abs.tol)*1
        if (return.grid) fgrid[[iter]] = c(x0=x0, x1=x1, f_x=nr$f_x, f._x=nr$f._x)
        x0=x1
    }
    list(X.optim=x0,iter=iter, grid=do.call('rbind',fgrid))
}
f = function(x) {0.1*x^4 + x^3 - x^2 + 10*x + 30} 
a=optim.NR(f, x0=3, abs.tol=0.001, return.grid=TRUE)

x = seq(-4,4,len=100)
dat1 = data.frame(x=x, y=f(x))
gdat = a$grid %>% as.data.frame %>%
    mutate(Iter=1:n(),
           x=ifelse(x0>0,x0+1, x0-1),
           y=f_x + (x-x0)*f._x)

g=ggplot() + geom_line(data=dat1, aes(y=y, x=x)) +
    geom_hline(yintercept=0, linetype='dashed') +
    geom_segment(data=gdat, aes(x=x,y=y,xend=x1,yend=0), color='red') +
    geom_segment(data=gdat, aes(x=x0,xend=x0,y=0,yend=f_x), linetype='dashed')+
    geom_text(data=gdat %>% filter(Iter<4), aes(x=x0, y=-5, label="x[0]"), parse=TRUE) +
    geom_text(data=gdat %>% filter(Iter<4), aes(x=x1, y=-5, label="x[1]"), parse=TRUE) +
    geom_point(data=gdat, aes(x=x0, y=f_x)) +
    geom_text(data=gdat %>% filter(Iter<4), aes(x=x0, y=f_x, label="f(x)"), parse=TRUE, nudge_x=0.1, nudge_y=-5, hjust=0, vjust=0) +
    geom_text(data=gdat %>% filter(Iter<4), aes(x=x0+0.3+(x1-x0)/2, y=(f_x/2)-5, label="f*minute*(x)"), parse=TRUE, color='red') +
    geom_text(data=gdat, aes(x=-4,y=140, label=paste0("Iter == ", Iter)), parse=TRUE, hjust=0) +
    geom_text(data=gdat, aes(x=-4,y=120, label=paste0("x[1] == ", round(x1,3))), parse=TRUE, hjust=0) +
    transition_manual(Iter)
ga=animate(g, fps = 20, duration = 5)
anim_save('NMAnim1.gif', animation=ga, path='../tut/glm_part1_files')

If we allow our implementation of the Newton-Raphson method to run from an initial guess of 3 until it converges:

optim.NR(f, x0=3)
## $X.optim
## [1] -1.982394
## 
## $iter
## [1] 6
## 
## $grid
## NULL

We see that it takes just 6 to converge.

In the above, we estimated the root (value of \(x\) when \(f(x)=0\)) of a function. If instead, we want to find the value of \(x\) when the function is at its minimum (e.g. optimisation), then we want to find the root of the first derivative (\(f'(x)\)) of the function - that is, we want to find the value of \(x\) where the slope of \(f(x)\) is 0.

Again, in order to illustrate the principles of what we are trying to achieve, we must digress from our actual example. Rather than try to find the root of a function, we will now try to find the root of the derivative of a function (so as to find the minimum of a function). So we are now shifting our focus away from the profile of the function and onto the profile of the derivative of the function (since the derivative of a function is a slope profile).

The left hand side of the following figure represents the profile of the function (\(y = 0.001x^3 - 0.001x^2 - 0.3x + 5\)).
The right hand side represents the derivative (\(0.001(3x^2) - 0.301(2x)\)) of that function.

f = function(x) {0.001*x^3 - 0.001*x^2 -0.3*x + 5}
f1=D(body(f)[[2]], 'x')
x = seq(0,20,len=100)
dat1 = data.frame(x=x, y=f(x))
g1=ggplot() + geom_line(data=dat1, aes(y=y, x=x)) 

g2=ggplot() + geom_line(data=data.frame(x=x, y=eval(f1, envi=list(x=x))), aes(y=y, x=x)) +
    geom_hline(yintercept=0, linetype='dashed')
grid.arrange(g1,g2, nrow=1)

(a=optim.NR(f1, x0=3, abs.tol=0.001, return.grid=TRUE))
## $X.optim
## [1] 10.33889
## 
## $iter
## [1] 6
## 
## $grid
##            x0       x1           f_x       f._x
## [1,]  3.00000 20.43750 -2.790000e-01 0.01600000
## [2,] 20.43750 12.87523  9.121992e-01 0.12062500
## [3,] 12.87523 10.59535  1.715639e-01 0.07525136
## [4,] 10.59535 10.34209  1.559353e-02 0.06157209
## [5,] 10.34209 10.33889  1.924166e-04 0.06005255
## [6,] 10.33889 10.33889  3.079948e-08 0.06003333

If we animate the process:

Animation code
dat1 = data.frame(x, y=eval(f1, env=list(x=x)))
gdat = a$grid %>% as.data.frame %>%
    mutate(Iter=1:n(),
           x=ifelse(x0>0,x0+1, x0-1),
           y=f_x + (x-x0)*f._x)

g2=ggplot() + geom_line(data=dat1, aes(y=y, x=x)) +
    geom_hline(yintercept=0, linetype='dashed') +
    geom_segment(data=gdat, aes(x=x,y=y,xend=x1,yend=0), color='red') +
    geom_segment(data=gdat, aes(x=x0,xend=x0,y=0,yend=f_x), linetype='dashed') +
    geom_text(data=gdat %>% filter(Iter<4), aes(x=x0, y=-0.05, label="x[0]"), parse=TRUE) +
    geom_text(data=gdat %>% filter(Iter<4), aes(x=x1, y=-0.05, label="x[1]"), parse=TRUE) +
    geom_point(data=gdat, aes(x=x0, y=f_x)) +
    geom_text(data=gdat %>% filter(Iter<4), aes(x=x0, y=f_x, label="f(x)"), parse=TRUE, nudge_x=0.1, nudge_y=-0.05, hjust=0, vjust=0) +
    geom_text(data=gdat %>% filter(Iter<4), aes(x=x0+0.3+(x1-x0)/2, y=(f_x/2)-0.05, label="f*minute*(x)"), parse=TRUE, color='red') +
    geom_text(data=gdat, aes(x=0,y=0.8, label=paste0("Iter == ", Iter)), parse=TRUE, hjust=0) +
    geom_text(data=gdat, aes(x=-0,y=1, label=paste0("x[1] == ", round(x1,3))), parse=TRUE, hjust=0) +
    transition_manual(Iter)
ga=animate(g2, fps = 20, duration = 6)
anim_save('NMAnim2.gif', animation=ga, path='../tut/glm_part1_files')

Note that we are now looking for were the slope of the profile is equal to zero. This makes no distinction between a peak (maximum) and a valley (minimum) as both have a slope of zero. Provided the (negative)log-likelihood profile is monotonic (has either a single peak or valley), this should be OK. It can however be problematic if the profile has local minima and maxima. To minimize issues, it is best to select starting values (inital parameter values) that are likely to be reasonably close to the optimum values.

Ok, great. Now how do we use this approach to optimize for multiple parameters.

Recall that the Newton-Raphson method for optimisation estimates the root by subtracting the ratio of the first order derivative of the (loss) function by the second order derivative of the function.
\[ x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)} \]

When there are multiple parameters to estimate, then there are multiple partial gradients (slopes).

Now that we are back to estimating the mean and variance of our example data, we have two parameters to estimate. Therefore we need:

  • the partial derivative of the negative log-likelihood with respect to the mean parameter
  • the partial derivative of the negative log-likelihood with respect to the sigma parameter
  • the second order partial derivative with respect to mean of the first order derivative with respect to the mean
  • the second order partial derivative with respect to sigma of the first order derivative with respect to the mean
  • the second order partial derivative with respect to mean of the first order derivative with respect to the sigma
  • the second order partial derivative with respect to sigma of the first order derivative with respect to the sigma

The second order partial derivatives form a square matrix called a Hessian matrix.

When there are multiple parameters to estimate, the above formula becomes: \[ \boldsymbol{x_{n+1}} = \boldsymbol{x_n} - \frac{\boldsymbol{g_n}}{\boldsymbol{H_n}} \]

where \(\boldsymbol{g_n}\) is a vector of partial gradients (first order derivatives), \(\boldsymbol{H_n}\) is a matrix of second order partial derivatives and \(\boldsymbol{x_n}\) and \(\boldsymbol{x_{n+1}}\) are the previous and updated parameter estimates respectively.

Recall that matrices cannot be divided. Rather we must multiply by the matrix inverse. Hence the above equation becomes:

\[ \boldsymbol{x_{n+1}} = \boldsymbol{x_n} - (\boldsymbol{H_n})^{-1}\boldsymbol{g_n} \]

Recall also to invert a matrix, it must be decomposed into an identity matrix and the inverse. In R, this can be achieved via either solve() or qr.solve.

On top of this, there is the need to calculate the first and second order derivatives for a function for which there is no equation. Hence, we need to approximate the derivatives using finite differences. That is,we can estimate a derivative (gradient at a specific point on a profile), by calculating the rise over run for a very small run (\(\Delta\))

\[ \frac{df(x)}{dx} \approx \frac{(f(x) + \Delta x/2) - (f(x - \Delta x/2))}{\Delta x} \]

Now that all the pieces are in place, we can demonstrate this by iterating through a number of Newton-Raphson cycles to estimate the mean and sigma of our example 1 data.

Newton-Raphson and animation code
params.init=c(mu=20, sigma=12)

## The following function calculates the difference-quotient approximation of gradient
approx.grad = function(theta, fn, eps=1e-05) {
    p = length(theta)
    nf = length(fn(theta))
    eps.mat = diag(p) * (eps/2)
    Gmat = array(0, dim=c(nf,p))
    for (i in 1:p) {
        Gmat[,i] = (fn(theta + eps.mat[,i]) - fn(theta - eps.mat[,i]))/eps
    }
    if (nf>1) Gmat else c(Gmat)
}


optim.NR = function(params.init, fn, x, iter.max=250, abs.tol=1e-05, control=list(eps=1e-06)) {
    fnc = function(bb) t(approx.grad(bb, fn))
    eps = control$eps
    gradfn = function(x) approx.grad(x, fnc, eps)
    iter <- 0; not.converged <- 1; not.max.iter <- 1
    newpar = params.init
    oldpar = params.init - 1
    while (not.converged & not.max.iter) {
        oldpar = newpar
        newpar = oldpar - solve(gradfn(oldpar), t(fnc(oldpar)))
        iter = iter + 1
        not.max.iter <- (iter < iter.max)*1
        not.converged = (abs(fn(oldpar) - fn(newpar))>abs.tol)*1
    }
    list(iter=iter, final=as.vector(newpar), LL=fn(as.vector(newpar)), gradient = fnc(newpar), hessian = gradfn(newpar))
}


#optim.NR(params.init, fn=function(t) LL.gaus(t, x=y), iter.max=1)

mu = seq(15,040,len=100)
sigma=seq(10,20,len=100)
theta = expand.grid(mu=mu, sigma=sigma)
theta$LL=apply(theta, 1, LL.gaus, x=dat$y)

base.plot = ggplot(data=theta,aes(y=mu, x=sigma)) +
    geom_tile(aes(fill=LL), show.legend=FALSE) + geom_contour(aes(z=LL)) +
    scale_fill_gradientn(colors=terrain.colors(12)) +
    scale_x_continuous(expand=c(0,0)) +
    scale_y_continuous(expand=c(0,0))

newdata =vector('list', 10)
newdata[[1]] =data.frame(t(setNames(params.init, c('mu','sigma'))))
p.i = params.init
for (i in 2:10) {
    a=optim.NR(params.init=p.i, fn=function(t) LL.gaus(t, x=dat$y), iter.max=1)
    newdata[[i]] = data.frame(t(setNames(a$final,c('mu','sigma'))))
    p.i = as.vector(a$final)
}
newdata = bind_rows(newdata, .id='Iter') %>% mutate(Iter=factor(Iter, levels=unique(Iter)), nIter=as.numeric(Iter))
g=base.plot + 
    geom_point(data=newdata, aes(y=mu, x=sigma, group=Iter), color='black') +
    geom_path(data=newdata, aes(y=mu, x=sigma), color='black') +
    transition_reveal(nIter) +
    labs(title = "Iteration: {frame_along}")
ga=animate(g, nframes=10, fps=1)

anim_save('NMAnim3.gif', animation=ga, path='../tut/glm_part1_files')

Now lets allow the routine (optim.NR() defined in the concealed code above) to run to convergence.

optim.NR(params.init, fn=function(t) LL.gaus(t, x=dat$y))
## $iter
## [1] 5
## 
## $final
## [1] 27.81098 14.25903
## 
## $LL
## [1] 40.76329
## 
## $gradient
##              [,1]          [,2]
## [1,] -1.04734e-06 -5.478284e-07
## 
## $hessian
##              [,1]      [,2]
## [1,] 0.0490274488 0.0000000
## [2,] 0.0007105427 0.0980549

Again, we see that these estimates (final in the output) are very similar to the empirical calculations. Furthermore, notice that convergence took only 5 iterations.

The inverse of the hessian matrix is the variance-covariance matrix. Therefore, we can also generate estimates of the standard error of the estimates:

\[ SE = \sqrt{diag(\boldsymbol{H})} \]

H=optim.NR(params.init, fn=function(t) LL.gaus(t, x=dat$y))$hessian
sqrt(diag(solve(H)))
## [1] 4.516275 3.193488

R has an inbuilt routine (nlm()) that performs the Newton-like method for optimisation.

nlm(LL.gaus, params.init, x=dat$y, hessian=TRUE, gradtol=1e-03)
## $minimum
## [1] 40.76329
## 
## $estimate
## [1] 27.81205 14.25769
## 
## $gradient
## [1]  5.249938e-05 -1.312170e-04
## 
## $hessian
##               [,1]          [,2]
## [1,]  4.919280e-02 -1.686155e-05
## [2,] -1.686155e-05  9.836416e-02
## 
## $code
## [1] 1
## 
## $iterations
## [1] 9
H=nlm(LL.gaus, params.init, x=dat$y, hessian=TRUE, gradtol=1e-06*2)$hessian
sqrt(diag(solve(H)))
## [1] 4.509099 3.189208

Stochastic global optimisation

When the profile of a function is not smooth, derivative methods can struggle to locate a minimum efficiently (if at all) and when the profile has multiple local minima, both derivative and simplex methods can fail to converge on the “true” parameter values. Stochastic global methods add a random (stochastic) component to increase the potential of finding the global minimum when there are multiple local minima.

If we return briefly to our well analogy, we might appreciate that there could be multiple underground caves and depending on where our initial pilot hole is drilled, the more efficient search algorithms might quickly hone in on a point that is locally shallow yet not the shallowest location in the entire landscape. Of course repeating the search from multiple random locations could help alleviate this issue, yet it might still be difficult to discover the shallowest point if this is associated with a very abrupt (rather than gradual) underground geological feature.

The classic stochastic global method is called simulated annealing or the Metropolis algorithm and it works as follows:

  1. Start with some initial estimate values (one for each parameter) and calculate the loss function (e.i. the negative log-likelihood)
  2. Pick a new set of parameter estimates close to the previous - e.g. jump a small distance in multidimensional space - and again calculate the loss function.
  3. If the value of the loss function is better (lower), accept the new parameters and return to Step 2.
  4. If the value of the loss function is not better,
    • Calculate the difference (\(\Delta L = L_{new} - L_{old}\)) in loss between the new and old parameter estimates
    • Pick a random number between 0 and 1
    • Accept the new parameter estimates if the random number is less than \(e^{-\Delta L/k}\) (where \(k\) is a constant that regulates the acceptance propensity), otherwise retain the previous parameter estimates.
    • Periodically (e.g. every 100 iterations), decrease \(k\) so as to reduce the acceptance propensity.
    • Return to Step 2.

Rather than have a stopping rule based on convergence, simulated annealing continues until the maximum number of iterations have been performed. Along with the capacity and encouragement to occasionally move ‘uphill’ (towards higher loss values), a large number of iterations increases the chances that the global minima will be discovered even in the presence of multiple minima. The iterations, keep track of the parameters associated with the ‘best’ configuration.

There are variants of this algorithm that control the jump distance used to select the next point. By adaptively increasing and reducing the jump length following acceptance and non-acceptance respectively, these variances are able to further encourage wider exploration of the parameter space.

The following animation illustrates 4096 iterations (stepping up in \(log_2\) increments). The red point indicates the current ‘best’ parameter estimates. Note that whilst the algorithm ‘discovered’ the ‘best’ solution after approximately 100 iterations, it continued to explore the profile thoroughly. If there had been other minima, it is likely to have discovered them as well.

Simulated Annealing code
optim.SANN = function(params.init, fn, iter.max=2500, jump=0.05, k=100) {
    bestpar <- newpar <- oldpar <- params.init
    iter=0; not.max.iter <- 1
    inck=k/10
    while(not.max.iter) {
        ## jump to new parameter set
        newpar = oldpar + replicate(length(params.init), runif(1,-jump,jump))
        deltaL = fn(newpar)-fn(oldpar)
        if (deltaL<0) { #improvement
            oldpar=newpar
        } else {
            rnd = runif(1)
            if (rnd <= exp(-deltaL/k)) oldpar=newpar
        }
        if (fn(newpar)<fn(bestpar)) bestpar=newpar
        iter=iter+1
        if ((iter %% inck)==0) k = k/10
        not.max.iter = (iter < iter.max)*1
    }
    list(iter=iter, final=as.vector(bestpar), LL=fn(bestpar), last=as.vector(oldpar), LL=fn(bestpar))
}
#optim.SANN(params.init=c(-0.25,1), fn=function(p) LL.gaus(p, x=y), iter.max=2500)
Simulated Annealing animation code
p.i = params.init
iter.cnt = 1
iters = 2500
newdata =vector('list', iters)
bestpar = p.i
for (i in 1:iters) {
    a=optim.SANN(params.init=p.i, fn=function(t) LL.gaus(t, x=dat$y), iter.max=1, jump=0.5, k=0.1)
    if (LL.gaus(a$last, x=dat$y)<LL.gaus(bestpar, x=dat$y)) bestpar=a$last
    newdata[[i]] = data.frame(t(setNames(a$last,c('mu','sigma'))), iter=floor(log2(i))+1, t(setNames(bestpar,c('bestmu','bestsigma'))))
    p.i = as.vector(a$last)
}
newdata = bind_rows(newdata, .id='Iter') %>% mutate(iter=factor(iter, levels=unique(iter)), nIter=as.numeric(as.character(iter)))
g=base.plot + 
    geom_point(data=newdata, aes(y=mu, x=sigma, group=iter), color='black') +
    geom_path(data=newdata, aes(y=mu, x=sigma), color='black') +
    geom_point(data=newdata, aes(y=bestmu, x=bestsigma), color='red') + 
    transition_reveal(nIter) +
    labs(title = "Iteration: {2^frame_along}")
ga=animate(g, nframes=12, fps=1)
#ga
anim_save('SANNAnim1.gif', animation=ga, path='../tut/glm_part1_files')

Allowing the simulated annealing to iterate 2500 times (with updating \(k\)):

optim.SANN(params.init=c(20,12), fn=function(t) LL.gaus(t, x=dat$y), iter.max=2500, jump=0.5, k=0.1)
## $iter
## [1] 2500
## 
## $final
## [1] 27.76059 14.27346
## 
## $LL
## [1] 40.76336
## 
## $last
## [1] 25.85511 13.81652
## 
## $LL
## [1] 40.76336

Which is again very similar the empirical estimates.

Linear regression via ML (example 2)

Recall the motivating example 2 data as:

dat
##     x     y
## 1   1  9.64
## 2   2  3.79
## 3   3 11.00
## 4   4 27.88
## 5   5 32.84
## 6   6 32.56
## 7   7 37.84
## 8   8 29.86
## 9   9 45.05
## 10 10 47.65

and the statistical model as:

\[ \begin{align} y_i =& \beta_0 + \beta_1 x_i + \varepsilon_i\\ Y =& \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \end{align} \]

This could be re-written as:

\[ \begin{align} y_i \sim{}& N(\mu_i, \sigma^2)\\ \mu_i =& \beta_0 + \beta_1 x_i\\ \end{align} \]

OR more generally as: \[ \begin{align} Y \sim{}& N(\boldsymbol{\mu}, \sigma^2)\\ \boldsymbol{\mu} =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]

This allows us to generalise to other (exponential) families (such as Binomial, Poisson, Negative Binomial, Gamma etc). For example, if our response (\(Y\)) were count data, we might consider a Poisson.

\[ \begin{align} Y \sim{}& Pois(\boldsymbol{\lambda})\\ \boldsymbol{\lambda} =& e^{\boldsymbol{X}\boldsymbol{\beta}}\\ &\text{OR equivalently}\\ log(\boldsymbol{\lambda}) =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]

The Poisson distribution (\(P(x|\lambda) = e^{-\lambda}\lambda^x/x!\)) is parameterised by a single parameter (\(\lambda\)) that represents both the mean and variance (as well as degrees of freedom). Poisson data are bound at the lower end by zero (negative values are not defined) - it is not possible to count fewer than 0 things. The Poisson family includes an exponential term, therefore to map the expected values back onto the natural scale (scale of the observed data), we use a link function (in the case of the Poission, this link is a log link). Hence the above can be generalized even further to:

\[ \begin{align} Y \sim{}& D(\boldsymbol{\mu}, ...)\\ g(\boldsymbol{\mu}) =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]

where \(D\) is the nominated family, \(g\) is the link function and \(...\) represents any additional parameters required by the nominated distribution (such as \(\sigma^2\) in the case of the Gaussian distribution).

LL.gaus <- function(theta, y, X){
  k           <- ncol(X) # get the number of columns (independent vars)
  beta        <- theta[1:k] # vector of weights intialized with starting values
  mu          <- X %*% beta  # X is dimension (n x k) and beta is dimension (k x 1)
  sigma       <- theta[k+1] # should be pulled from the last of the starting values vector
  ll          <- sum(dnorm(y, mean = mu, sd = sigma, log = TRUE)) # sum of the PDF over each observation
  return(-ll)
}
X = model.matrix(~x, data=dat)

Simplex methods (Nelder-Mead)

optim(c(1,1,1), LL.gaus, y=dat$y, X=X, method='Nelder-Mead')
## $par
## [1] 2.632546 4.576511 5.544126
## 
## $value
## [1] 31.30618
## 
## $counts
## function gradient 
##      186       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
optim(c(277,1,100), LL.gaus, y=dat$y, X=X, method='Nelder-Mead')
## $par
## [1] 2.641518 4.575765 5.540469
## 
## $value
## [1] 31.30616
## 
## $counts
## function gradient 
##      202       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Derivative methods (Newton-Raphson)

nlm(LL.gaus, c(1,1,1), y=dat$y, X=X, hessian=TRUE)
## $minimum
## [1] 31.30615
## 
## $estimate
## [1] 2.650723 4.574596 5.538237
## 
## $gradient
## [1]  9.140716e-07  4.504384e-06 -4.240237e-07
## 
## $hessian
##               [,1]        [,2]          [,3]
## [1,]  3.260295e-01  1.79316224 -1.580293e-05
## [2,]  1.793162e+00 12.55213575 -1.027930e-03
## [3,] -1.580293e-05 -0.00102793  6.517343e-01
## 
## $code
## [1] 1
## 
## $iterations
## [1] 60
nlm(LL.gaus, c(277,1,100), y=dat$y, X=X, hessian=TRUE)
## $minimum
## [1] 31.30615
## 
## $estimate
## [1] 2.650667 4.574606 5.538240
## 
## $gradient
## [1]  4.250791e-08  1.960451e-07 -1.429171e-07
## 
## $hessian
##               [,1]         [,2]          [,3]
## [1,]  3.260291e-01  1.793160125 -1.563384e-05
## [2,]  1.793160e+00 12.552121005 -1.036734e-03
## [3,] -1.563384e-05 -0.001036734  6.517324e-01
## 
## $code
## [1] 1
## 
## $iterations
## [1] 34

Stochastic global methods (Simulated annealing)

optim(c(1,1,1), LL.gaus, y=dat$y, X=X, method='SANN')
## $par
## [1] 2.103100 4.662195 5.650878
## 
## $value
## [1] 31.32074
## 
## $counts
## function gradient 
##    10000       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
optim(c(277,1,100), LL.gaus, y=dat$y, X=X, method='SANN')
## $par
## [1] 280.77992 -36.48519 112.79404
## 
## $value
## [1] 62.99827
## 
## $counts
## function gradient 
##    10000       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Iterative re-weighted least squares

Recall from the previous section that the simple linear model can be written as: \[ \begin{align} y_i =& \beta_0 + \beta_1 x_i + \varepsilon_i\\ Y =& \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \end{align} \]

This could be re-written as:

\[ \begin{align} y_i \sim{}& N(\mu_i, \sigma^2I)\\ \mu_i =& \beta_0 + \beta_1 x_i\\ \end{align} \]

indicating that the response (\(y\)) is distributed as a Gaussian (normal) the mean of which is determined by the linear relationship with \(x\) and there is a constant variance (\(\sigma^2\)) - all observations are independently and identically distributed (iid) - independently drawn from a distribution with the same variance.

The \(I\) signifies an identity matrix (a square matrix whose diagonals are all one).

More generally, the above can be written in vector/matrix form as: \[ \begin{align} Y \sim{}& N(\boldsymbol{\mu}, \sigma^2I)\\ \boldsymbol{\mu} =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]

If we suspect that the residuals are not independent and identically distributed (if for example they were Poisson distributed), then we could alter (generalise) the above to:

\[ \begin{align} Y \sim{}& N(\boldsymbol{\mu}, \sigma^2\boldsymbol{W}^{-1})\\ \boldsymbol{\mu} =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]

where \(\boldsymbol{W}\) is a matrix of positive diagonals .

This allows us to generalise to other (exponential) families (such as Binomial, Poisson, Negative Binomial, Gamma etc). For example, if our response (\(Y\)) were count data, we might consider a Poisson.

\[ \begin{align} Y \sim{}& Pois(\boldsymbol{\lambda})\\ \boldsymbol{\lambda} =& e^{\boldsymbol{X}\boldsymbol{\beta}}\\ &\text{OR equivalently}\\ log(\boldsymbol{\lambda}) =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]

The Poisson distribution (\(P(x|\lambda) = e^{-\lambda}\lambda^x/x!\)) is parameterised by a single parameter (\(\lambda\)) that represents both the mean and variance (as well as degrees of freedom). Poisson data are bound at the lower end by zero (negative values are not defined) - it is not possible to count fewer than 0 things. The Poisson family includes an exponential term, therefore to map the expected values back onto the natural scale (scale of the observed data), we use a link function (in the case of the Poission, this link is a log link). Hence the above can be generalized even further to:

\[ \begin{align} Y \sim{}& D(\boldsymbol{\eta}, ...)\\ g(\boldsymbol{\eta}) =& \boldsymbol{X}\boldsymbol{\beta}\\ \end{align} \]

where \(D\) is the nominated family, \(g\) is the link function and \(...\) represents any additional parameters required by the nominated distribution (such as \(\sigma^2\boldsymbol{W}^{-1}\) in the case of the Gaussian distribution).

In regular OLS regression we are trying to minimize the sum squares of residuals (\(||Y - \boldsymbol{X}\boldsymbol{\beta}||^2\)). This assumes that all residuals are independent and identically distributed as each will be given equal weight when summing. However, we may wish to weight some observations (residuals) more heavily than others. For example, a reasonably common situation is for the variance to be related to the mean (typically as the mean increases, so does the variance). We can partly compensate for this be weighting the residuals inversely proportional to the mean (\(w_i = 1/\mu_i\)).

Now, rather than trying to minimise the sum squares of residuals (\(||Y - \boldsymbol{X}\boldsymbol{\beta}||^2\)), we now try to minimise the sum weighted residuals (\(||Y - \boldsymbol{X}\boldsymbol{\beta}||^\boldsymbol{p}\)), where \(\boldsymbol{p}\) is a vector of powers.

This can also (and once we take into consideration the link function), be written as the sum square of weighted residuals (\((g(\boldsymbol{X}\boldsymbol{\beta}) - Y)^T\boldsymbol{W}(g(\boldsymbol{X}\boldsymbol{\beta}) - Y)\)), where \(\boldsymbol{W}\) is the diagonal of a weights matrix.

Rather than calculate the residuals on the response scale (by backtransforming the linear predictor \(\boldsymbol{X}\boldsymbol{\beta}\) to the same scale as the response \(Y\)), the residuals are typically calculated on the link scale after adjusting the response variable onto this scale.

The response is adjusted (\(\boldsymbol{z}\)) by adding the difference between observed and expected (natural scale) re-mapped onto the link scale (by dividing by the derivative of the expected value on the natural scale) to the expected value (on the link scale).

\[ \boldsymbol{z} = \boldsymbol{\eta} + \frac{Y-g(\boldsymbol{\eta})}{g'(\boldsymbol{\eta})} \]

where \(\eta\) is the estimated linear predictor (predicted value of \(y\) on the scale of the link function), \(g(\eta)\) is the inverse link of \(\eta\) and \(g'(\eta)\) is the derivative of \(g(\eta)\).

The weighted least squares formulation then becomes: \[ \boldsymbol{\beta} = (\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{X}^{-1})\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{z} \]

where \(\boldsymbol{W}\) is the diagonal of a weights matrix:

\[ \boldsymbol{W} = diag\left(\frac{g'(\boldsymbol{\eta})^2}{var(g(\boldsymbol{\eta}))}\right) \]

Unfortunately, there is no closed form solution for the above and thus it is usually solved iteratively, and is thus called iterative re-weighted least squares. A very simplistic implementation of an iterative re-weighted least squares could be:

  1. Start with some initial values for the parameters (\(\beta\)} - typically 0 for each parameter (on the link scale and thus, 1 on the response scale)
  2. Calculate the predicted value of the response given the current estimates of \(\boldsymbol{\beta}\) as well as the adjusted response (\(\boldsymbol{z}\)) and weights (\(\boldsymbol{W}\)).
  3. Update the estimated parameter values
  4. Repeat septs 2-3 until either the maximum number of iterations has occurred or a measure of the difference between successive iterations is below a set tolerance.
Code used to implement a simple iterative re-weighted least squares algorithm
optim.IRLS = function(X, y, family, iter.max=25, abs.tol=1e-06) {
    oldpar <- newpar <- rep(0, ncol(X))
    iter <- 0; not.converged <- 1; not.max.iter <- 1
    while (not.converged & not.max.iter) {
        eta = X %*% newpar         # calculate (update) the linear predictor
        g = family()$linkinv(eta)  # calculate (update) the linear predictor on the natural scale (inverse link)
        g. = family()$mu.eta(eta)  # calculate (update) the derivative of linear predictor on natural scale
        z = eta + (y - g) / g.     # calculate (update) the  adjusted response
        W = as.vector(g.^2 / family()$variance(g))  # calculate (update) the weights
        oldpar = newpar
        newpar = solve(crossprod(X,W*X), crossprod(X,W*z), tol=2*.Machine$double.eps)
        iter = iter + 1
        not.max.iter <- (iter < iter.max)*1
        not.converged=(max(abs((1-newpar/oldpar)))>abs.tol)*1
    }
    list(B.optim=newpar, Iter=iter-1)
}

Note, when the family is Gaussian, the above is simplified to the OLS form:

\[ \begin{align} \boldsymbol{\beta} =& 0\\ \boldsymbol{\eta} =& \boldsymbol{X}\boldsymbol{\beta} = 0\\ \boldsymbol{z} =& \boldsymbol{\eta} + \frac{Y-g(\boldsymbol{\eta})}{g'(\boldsymbol{\eta})} = 0 + \frac{Y-0}{1} = Y\\ \boldsymbol{W} =& diag\left(\frac{g'(\boldsymbol{\eta})^2}{var(g(\boldsymbol{\eta}))}\right) = diag(1^2/1) = 1\\ \boldsymbol{\beta} =& (\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{X}^{-1})\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{z} = (\boldsymbol{X}^T\boldsymbol{X}^{-1})\boldsymbol{X}^TY \end{align} \]

Lets try it out, starting with an intercept only model - which is just estimating the mean response value (according to a Poisson distribution).

optim.IRLS(cbind(rep(1,length(dat$y))), dat$y, family=gaussian)
## $B.optim
##        [,1]
## [1,] 27.811
## 
## $Iter
## [1] 1

This estimate is (not surprisingly) identical to the OLS estimate.

So what if we modified the response such that it was some positive integers and we nominate a Poisson distribution..

optim.IRLS(cbind(rep(1,length(dat2$y))), dat2$y, family=poisson)
## $B.optim
##          [,1]
## [1,] 1.987874
## 
## $Iter
## [1] 9
exp(optim.IRLS(cbind(rep(1,length(dat2$y))), dat2$y, family=poisson)$B.optim)
##      [,1]
## [1,]  7.3

After 9 iterations, the algorithm converged on an estimate of 1.9878743, which on the natural scale would be 7.3.

In R, there is an inbuilt function called glm() that fits generalized linear models by default via iterative re-weighted least squares.

glm(y ~ 1, data=dat2, family=poisson)
## 
## Call:  glm(formula = y ~ 1, family = poisson, data = dat2)
## 
## Coefficients:
## (Intercept)  
##       1.988  
## 
## Degrees of Freedom: 9 Total (i.e. Null);  9 Residual
## Null Deviance:       34 
## Residual Deviance: 34    AIC: 71.77

And now we will explore the full model:

X = model.matrix(~x, dat2)
irls = optim.IRLS(X, dat2$y, family=poisson)
irls
## $B.optim
##                  [,1]
## (Intercept) 0.6786313
## x           0.2069949
## 
## $Iter
## [1] 14
exp(irls$B.optim)
##                 [,1]
## (Intercept) 1.971178
## x           1.229976

Both estimates are very close to the parameters on which the simulated data were generated, and identical to those returned from the inbuilt glm() function.

glm(y ~ x, data=dat2, family=poisson)
## 
## Call:  glm(formula = y ~ x, family = poisson, data = dat2)
## 
## Coefficients:
## (Intercept)            x  
##      0.6786       0.2070  
## 
## Degrees of Freedom: 9 Total (i.e. Null);  8 Residual
## Null Deviance:       34 
## Residual Deviance: 10.71     AIC: 50.48

References

Bolker, Benjamin M. 2008. Ecological Models and Data in R. Princeton University Press. http://www.zoology.ufl.edu/bolker/emdbook/.