Introduction to Generalized Additive Models (GAM)

30 July, 2023

By: Murray Logan

Preparations

Load the necessary packages.

library(splines)
library(ggfortify)
library(ggeffects)
library(mgcv)
library(gratia)
library(emmeans)
library(gridExtra)
library(tidyverse)

Motivating example

To illustrate non-parametric models (and GAMs in particular), lets simulate two non-linear trends:

a sine wave
a declining linear trend with sine wave overlayed

We will also add random noise to both trends. We will then plot the sine wave (as a dashed line) along with the realised samples (solid dots).

set.seed(123)
x = seq(0,2*pi, len=50)
yp=sin(x)
y = sin(x) + rnorm(length(x), mean=0, sd=0.5)
dat = data.frame(yp=yp, y=y, x=x)
g1=ggplot(dat, aes(y=y, x=x)) + geom_point() +
    geom_line(aes(y=yp), linetype='dashed') + ggtitle('Example 1')

set.seed(123)
x = seq(0,3*pi, len=50)
yp = 2 - 0.5*x + sin(x)
y = 2 - 0.5*x + sin(x) + rnorm(length(x), mean=0, sd=0.5)
dat2 = data.frame(yp=yp, y=y, x=x)
g2=ggplot(dat2, aes(y=y, x=x)) + geom_point() +
    geom_line(aes(y=yp), linetype='dashed') + ggtitle('Example 2')

grid.arrange(g1,g2, nrow=1)

Polynomials

A simple (generalized) linear models, model a response variable against a linear predictor (linear combination of predictor variables) according to the familiar form:

\[ g(E(y)) = \beta_0 + x_1\beta_1 \]

where \(E(y)\) is the expected value of y (which is dependent on a nominated family), \(g()\) represents the link function that links the expected values of y back to the nominated family, \(\beta_0\) is the y-intercept and \(\beta_i\) is the slope associated with the predictor (\(x_1\)).

If we fit a simple linear model to our example data, we will see that a straight line is not a good fit for either data - there are curvy patterns left in the residuals.

dat.lm = lm(y~x, data=dat)
dat2.lm = lm(y~x, data=dat2)
autoplot(list(dat.lm, dat2.lm), which=1)

grid.arrange(
    ggpredict(dat.lm, ~x) %>% plot(rawdata=TRUE),
    ggpredict(dat2.lm, ~x) %>% plot(rawdata=TRUE),
    nrow=1
    )

Although the model above (\(g(E(y)) = \beta_0 + x_1\beta_1\)) is a linear model, it does not mean that it is restricted to modelling purely linear trends. We could for example, fit a model in which \(y\) is modelled against a polynomial for \(x\). A third order polynomial includes \(x\), \(x^2\) and \(x^3\)

\[ g(E(y)) = \beta_0 + x_1\beta_1 + x_1^2\beta_2 + x_1^3\beta_3 \]

When fitting such a model, we need to include the squared and cubed terms within I() to ensure that they are each considered separate terms. If the model was specified as y ~ x + x^2 + x^3, a single term for x would first be created that was the sum of \(x\), \(x^2\) and \(x^3\) - not what we want.

dat.lm2 = lm(y~x+I(x^2)+I(x^3), data=dat)
dat2.lm2 = lm(y~x+I(x^2)+I(x^3), data=dat2)
autoplot(list(dat.lm2,dat2.lm2), which=1)

grid.arrange(
    ggpredict(dat.lm2, ~x) %>% plot(rawdata=TRUE),
    ggpredict(dat2.lm2, ~x) %>% plot(rawdata=TRUE),
    nrow=1)

The model fitted in the right hand panel (linear decline with sine wave example) appears a little over simplified. While it has captured some elements of the trend (a general decline that is not purely linear), it has not captured the sine wave properly. We will return to this point after exploring the left hand panel.

Although the overal fit for the left hand panel (simple sine wave) is fine, the indivual terms (linear, quadratic and cubic) are not orthogonal (independent) - they are not identifiable. Recall the (multi)collinearity assumption of multiple linear regression, in which no covariate should be correlated to the others. Hence, we cannot not examine the estimated \(\beta\) parameters in order to describe each of the linear, quadratic and cubic components of the trend.

The poly() function can be used to generate orthogonal polynomials:

dat.lm3 = lm(y~poly(x,3), data=dat)
autoplot(dat.lm3, which=1)

ggpredict(dat.lm3, ~x) %>% plot(rawdata=TRUE)

summary(dat.lm3)

## 
## Call:
## lm(formula = y ~ poly(x, 3), data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.84027 -0.27405 -0.09366  0.31064  1.13336 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.01720    0.06557   0.262    0.794    
## poly(x, 3)1 -3.80897    0.46365  -8.215 1.43e-10 ***
## poly(x, 3)2  0.33194    0.46365   0.716    0.478    
## poly(x, 3)3  2.91779    0.46365   6.293 1.05e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4637 on 46 degrees of freedom
## Multiple R-squared:  0.7005, Adjusted R-squared:  0.681 
## F-statistic: 35.87 on 3 and 46 DF,  p-value: 4.198e-12

We can see that there is a significant cubic component to the trend.

Splines

Whilst the fit for Example 1 data was reasonable, the same could not be said for Example 2 data. For the latter, a simple third order polynomial might be inadequate. What if, instead of fitting a single polynomial through the entire data cloud, we fit multiple shorter polynomials through shorter windows of the data - that is, fitting piecewise cubic polynomials.

We could naively start by splitting the range of \(x\) into multiple parts (lets start with two, with a split just left of center), and fit a separate third order polynomial to each part. One of the windows will span from the minimum \(x\) value to \(x-0.5\) (arbitrarily selected) and the other will span from \(x-0.5\) to the maximum value of \(x\). The point where the windows meets is called the breakpoint.

Rather than refer to windows of the data, we could instead refer to knots. In this case, the knots could represent the bounds of the windows. Hence, in this case, there are three knots: one at the start, one near the middle and one at the end.

mid=mean(dat2$x)-0.5
dat3a = dat2 %>% filter(x<=mid)
dat3a.lm = lm(y~poly(x,3), data=dat3a)
dat3b = dat2 %>% filter(x>mid)
dat3b.lm = lm(y~poly(x,3), data=dat3b)

fita=data.frame(x=seq(min(dat3a$x), mid, len=100)) %>%
    cbind(predict(dat3a.lm, newdata=., interval='confidence'))
fitb=data.frame(x=seq(mid, max(dat3b$x), len=100)) %>%
    cbind(predict(dat3b.lm, newdata=., interval='confidence'))

ggplot() +
    geom_ribbon(data=fita, aes(ymin=lwr, ymax=upr, x=x, fill='1'), alpha=0.3, show.legend=FALSE) +
    geom_ribbon(data=fitb, aes(ymin=lwr, ymax=upr, x=x, fill='2'), alpha=0.3, show.legend=FALSE) +
    geom_line(data=fita, aes(y=fit, x=x, color='1'), show.legend=FALSE) +
    geom_line(data=fitb, aes(y=fit, x=x, color='2'), show.legend=FALSE) +
    geom_point(data=dat2, aes(y=y, x=x))

OR equivalently..

mid=mean(dat2$x)-0.5
dat3 = dat2 %>% mutate(x2=ifelse(x-mid > 0, x, 0))
dat3.lm = lm(y~poly(x,3) + poly(x2,3), data=dat3)
newdata=with(dat2, data.frame(x=seq(min(x), max(x), len=1000))) %>%
    mutate(x2=ifelse(x-mid > 0, x, 0))

dat3 = newdata %>% cbind(predict(dat3.lm, newdata=newdata, interval='confidence')) %>%
    mutate(Part=ifelse(x2==0, '1', '2'))

ggplot(dat3, aes(y=fit, x=x)) +
    geom_ribbon(aes(ymin=lwr, ymax=upr, fill=Part), alpha=0.3, show.legend=FALSE) +
    geom_line(aes(color=Part), show.legend=FALSE) +
    geom_point(data=dat2, aes(y=y, x=x))

Hmmm.. Although this does a reasonable job of fitting the data, there is an issue. There is an abrupt change between the two windows. In this case, the two lines do line up reasonably well, but this need not be the case. If the split of the windows happened to coincide with an abrupt change in the data cloud, it is likely that the two lines will be displaced substantially (leading to two different predictions at the same value of the covariate).

More obvious in this case is the very abrupt change in trend trajectory (and uncertainty) at the boundary of the two windows. What we need to do is fit the piecewise cubic polynomials in a way that ensures that ends of the lines 1) meet up and 2) have the same trajectory. In essence, we need to be sure that the first and second derivatives at the point where two lines meet (breakpoints) are the consistent.

This is exactly what splines achieve.

Decompose splines into a set of basis functions.

Beta-spline basis matrices for polynomial splines can be easily created with the bs() function. We simply provide the continuous predictor variable (in this case x) along with the internal knots (boundary knots are included by default) and the polynomial order (e.g. 1, 2 or 3 etc - 3 is default for cubic spline). In this case, we will nominate a single internal knot (out off-center point).

A 3rd order polynomial spline with one internal knot will decompose into a matrix with four variables (representing the four basis functions). We can also visualize this set of basis functions.

k=mid
basis = with(dat2, data.frame(x=x, bs(x, knots=k, degree = 3))) %>%
    gather(key=Basis, value=Value, -x)

basis %>% ggplot(aes(y=Value, x=x, color=Basis)) + geom_line()

The basis matrix forms a model matrix that can be applied to a linear model,

\[ g(E(y)) = \beta_0 + \sum^k_{j=1} b_j(x_1)\beta_j \]

where \(b_j(x_1)\) represent the \(j\) columns of the basis matrix and \(\beta_j\) represent the contribution of each basis function. \(\beta_j\) can estimated via a linear model (along with the intercept).

k=mid
dat2.lm4 = lm(y~bs(x, knots=k, degree=3), data=dat2)
summary(dat2.lm4)

## 
## Call:
## lm(formula = y ~ bs(x, knots = k, degree = 3), data = dat2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.89902 -0.27432 -0.03338  0.38772  1.06711 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     1.7101     0.2996   5.708 8.45e-07 ***
## bs(x, knots = k, degree = 3)1   2.9489     0.5996   4.918 1.21e-05 ***
## bs(x, knots = k, degree = 3)2  -7.4907     0.5150 -14.544  < 2e-16 ***
## bs(x, knots = k, degree = 3)3  -0.1992     0.5621  -0.354    0.725    
## bs(x, knots = k, degree = 3)4  -4.6648     0.3916 -11.911 1.65e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4768 on 45 degrees of freedom
## Multiple R-squared:  0.925,  Adjusted R-squared:  0.9183 
## F-statistic: 138.7 on 4 and 45 DF,  p-value: < 2.2e-16

So having estimated the intercept (\(\beta_0\)) and each of the basis parameters (\(\beta_j\)), we can multiply each basis function by its respective parameter so as to see the contribution of each basis function (colored lines) to the overal model. When doing so, we should also include the intercept (V1 in the graph below). If we then sum up each of the contributions (including the intercept), we will generate the model fit (black line).

basis = cbind(1,bs(dat2$x, knots=k, degree=3))
basis = sweep(basis, MARGIN=2, coef(dat2.lm4), '*')
#basis = basis + coef(dat2.lm4)[1]
fit = rowSums(basis)
basis = basis %>% as.data.frame %>% 
    gather(key=Basis, value=Value) %>%
    cbind(x=dat2$x)

ggplot() + geom_line(data=basis, aes(y=Value, x=x,color=Basis)) +
    geom_line(data=NULL, aes(y=fit, x=dat2$x)) +
    geom_point(data=dat2, aes(y=y, x=x))

The full prediction from the fitted model could have been yielded more simply by:

ggpredict(dat2.lm4, ~x) %>% plot(rawdata=TRUE)

## Warning: Some model terms could not be found in model data.
##   You probably need to load the data into the environment.

That looks ok, but what if we selected more knots? We will try 10 and 20 evenly spaced internal knots.

k=with(dat2, seq(min(x), max(x), len=12))
k=k[c(-1,-length(k))]
dat2.lm5 = lm(y~bs(x, knots=k, degree=3), data=dat2)

k=with(dat2, seq(min(x), max(x), len=22))
k=k[c(-1,-length(k))]
dat2.lm6 = lm(y~bs(x, knots=k, degree=3), data=dat2)

grid.arrange(
    ggpredict(dat2.lm5, ~x) %>% plot(rawdata=TRUE),
    ggpredict(dat2.lm6, ~x) %>% plot(rawdata=TRUE), 
    nrow=1)

## Warning: Some model terms could not be found in model data.
##   You probably need to load the data into the environment.

## Warning: Some model terms could not be found in model data.
##   You probably need to load the data into the environment.

It may be that we have too many knots and are starting to force the model to learn about features that are specific to the sample rather than purely about the underlying process. This is referred to as overfitting. In this context, the spline is under smoothed (or too wiggly).

Of course in this case we have the luxury of knowing the true underlying process. Normally we would not have this oversight, so how can we select the most appropriate number of internal knots (and thus degree of smoothing).

Essentially, we are seeking a degree of smoothing (number of knots) that is a balance between:

explaining the data (minimizing the sum of square residuals). Wiggly lines are more likely to fit the data.
reducing bias. Wiggly lines are more likely to be exhibiting a pattern that is biased towards the stochasticity of the sample data.

Models are naturally geared to estimate parameters in a manner that minimize the residuals (or maximize likelihood), so to counter this, we need to penalize for wiggliness. If we impose a smoothing penalty on the basis functions, we can attempt to strike the above balance by estimating the optimal smoothing parameter. This can be achieved by either:

cross validation. Cross validation generally involves fitting a model numerous times, each time withholding a random selection of data, and using measures of the accuracy of withheld data predictions to help guide certain model fitting parameters (such as the smoothing penalty in this case).
estimate the optimum smoothing penalty via REML optimization.

General Additive Models (GAM) via the mgcv package help us achieve this.

Introduction to GAMs

A simple (generalized) linear models, model a response variable against a linear predictor (linear combination of predictor variables) according to the familiar form:

\[ g(E(y)) = \beta_0 + x_1\beta_1 \]

In a GAM, we can add a smoothing term:

\[ g(E(y)) = \beta_0 + f_1(x_1) \]

where \(f_1(x_1)\) represents a smooth term. It actually represents a set of basis functions.

\[ f_1(x_1) = \sum^k_{j=1}{b_j(x_1)\beta_j} \]

That is, \(f_1(x_1)\) is the sum of \(b_j(x_1)\) basis functions of \(x_1\) multiplied by a set of parameters that determine the contribution of each basis functon. The basis functions are determined by 1) the data (\(x_1\)) and the form of basis function selected (e.g. penalized splines, cubic regression splines etc). The \(\beta_j\) parameters are estimated during the modelling process.

We are going to attempt to fit a smoother (a cubic regression spline to start with) to the samples. However, to get a better appreciation for how GAMs work, lets visualize what the cubic regression splines look like. This can be performed with the help of the smoothCon() function in the mgcv package.

use the smoothCon() function to construct a set of cubic regression spline functions for our predictor variable (x).
extract the model matrix (X) from the resulting list. This matrix will be a (\(n\times k\)) matrix, where \(n\) is the number of x values and \(k\) is the number of basis functions (for a cubic regression splines, there are 10 basis functions by default).
bind the model matrix to the predictor (x) and reshape the result into a long data frame to facilitate plotting.

bs = smoothCon(s(x, bs='cr'), data=dat2)
B=bs[[1]]$X
B = cbind(x=dat2$x, B) %>%
    as.data.frame %>%
    gather(key=Basis, value=y, -x) %>%
    mutate(Basis = factor(Basis, levels=unique(Basis)))

We can now plot these 10 basis functions.

ggplot(B, aes(y=y, x=x)) + geom_line() + facet_wrap(~Basis)

For a cubic regression spline, each basis function is the same, just shifted along the x-axis. The peak of each function corresponds to each of the 10 knots (which in this case are evenly spaced along the axis, because the x data are uniformly distributed along the axis.

We could alternatively plot each of the basis function on the same graph (with different colors).

ggplot(B, aes(y=y, x=x)) + geom_line(aes(color=Basis))

When we then fit the GAM with cubic regression spline, we can fix the number of knots (functions) to a set number. Normally we would not want to do this. but it can be useful fo the purpose of illustration. Lets start by fixing the number of knots to be 10 (since this is what we have been visualizing).

To do so, we will use the gam() function within the mgcv package. The gam package generates a model matrix for the smoothing term (in a similar way that we did above with the smoothCon() function) and then uses this as the linear predictor in a GLM. Indeed, we can (and will) do this for ourselves.

However, before we do, we need to make an adjustment. While the basis funtions (and thus the model matrix) we created earlier are identifiable amongst themselves, they will not orthogonal (independent or distinguishable) to the intercept (which is always included in the linear model). Hence, they should be reparameterized such that each smooth term is constrained to sum-to-zero (\(\sum f(x_1) =0\)). That way, each term is orthogonal to the intercept.

We will therefore regenerate the smooth, this time indicating that the identifiability constraints should be absorbed into the basis to ensure full orthogonality.

bs = smoothCon(s(x, bs='cr'), absorb.cons=TRUE, data=dat2)
dat2.glm = glm(y ~ bs[[1]]$X, data=dat2)
coef(dat2.glm)

##  (Intercept)   bs[[1]]$X1   bs[[1]]$X2   bs[[1]]$X3   bs[[1]]$X4   bs[[1]]$X5 
## -0.131671794  1.932725294  1.402661653  0.006710458 -1.880284432 -2.044086090 
##   bs[[1]]$X6   bs[[1]]$X7   bs[[1]]$X8   bs[[1]]$X9 
## -1.239461463 -1.374067840 -1.828866369 -2.927270326

Now, we will use the gam() function to perform the same analysis (remember, we are going to fix the number of basis functions \(k\) to be 10).

dat2.gam = gam(y ~ s(x, bs='cr', k=10, fx=TRUE), data=dat2)
coefs = coef(dat2.gam)
coefs

##  (Intercept)       s(x).1       s(x).2       s(x).3       s(x).4       s(x).5 
## -0.131671794  1.932725294  1.402661653  0.006710458 -1.880284432 -2.044086090 
##       s(x).6       s(x).7       s(x).8       s(x).9 
## -1.239461463 -1.374067840 -1.828866369 -2.927270326

The estimated coeficients represent the \(\beta_j\) parameters.

We could now scale the 10 cubic regression spline basis functions by multiplying by their corresponding \(\beta\) coefficient. Then, for each value of \(x\), we could sum up the 10 functions to get the overal spline. In R, the sweep() function is used to multiply the columns (MARGIN=2) of each row of a matrix (in this case the matrix representing the basis functions) by a vector (in this case the vector of \(\beta\) parameters). The rowSums function can then be used to sum each row (each row represents a x value).

We will plot the three basis functions (colored lines), the sum of these functions (the fitted GAM smoother - black), the true sine wave and the original samples (solid dots).

newdata = data.frame(x=seq(0,3*pi, len=100))
Xmat = predict(dat2.gam, type='lpmatrix', newdata=newdata)
fits = sweep(Xmat,MARGIN=2,coefs,'*')
newdata = newdata %>% mutate(fit=rowSums(fits))
fits %>% as.data.frame %>% 
    bind_cols(newdata) %>% 
    gather(key=Basis, value=y, -x,-fit) %>%
    ggplot(aes(y=y, x=x)) + geom_line(aes(color=Basis)) +
    geom_line(aes(y=fit)) +
    geom_point(data=dat2) +
    geom_line(data=dat2, aes(y=yp), linetype='dashed')

The trend (black line) appears to be overfit (not smooth enough, too wiggly). That is, the trend seems to be starting to focus on the more stoichastic elements of the sample rather than the pure underlying trend. What if instead we ask for 10 knots instead of three.

dat2.gam = gam(y ~ s(x, bs='cr', k=3, fx=TRUE), data=dat2)
coefs = coef(dat2.gam)
newdata = data.frame(x=seq(0,3*pi, len=100))
Xmat = predict(dat2.gam, type='lpmatrix', newdata=newdata)
fits = sweep(Xmat,MARGIN=2,coefs,'*')
newdata = newdata %>% mutate(fit=rowSums(fits))
fits %>% as.data.frame %>% 
    bind_cols(newdata) %>% 
    gather(key=Basis, value=y, -x,-fit) %>%
    ggplot(aes(y=y, x=x)) + geom_line(aes(color=Basis)) +
    geom_line(aes(y=fit)) +
    geom_point(data=dat2) +
    geom_line(data=dat2, aes(y=yp), linetype='dashed')

This is not a particularly good match either - the trend now appears to be too smooth.

If the number of basis functions (knots) are not fixed, the gam() function will attempt to estimate the ‘optimum’ degree of knots (and thus smoothing). This is essentially a balance between bias and variance.

the more wiggly (less smooth) the trend, the lower the variance between the trend and the observed data
if the trend is too wiggly, the bias (difference between the observed and ‘real’ trend).

It does this via a smoothing penalty that penalises for wiggliness by maximising the penalized likelihood function:

\[ 2L(\boldsymbol{\beta}) - \boldsymbol{\lambda} (\boldsymbol{\beta^T}\boldsymbol{S}\boldsymbol{\beta}) \]

where \(\boldsymbol{S}\) is a matrix representing the wiggliness of basis functions (typically from the squared second order derivatives of the basis functions) to form the penalty matrix (\(\boldsymbol{\beta^T}\boldsymbol{S}\boldsymbol{\beta}\)) and \(\boldsymbol{\lambda}\) is the smoothing parameter that controls the trade off between bias and variance.

The higher the smoothing parameter (\(\boldsymbol{\lambda}\)), the smoother (flatter) the trend. As \(\lambda\) approaches \(\infty\), the trend approaches a regular linear line. A \(\lambda\) of 0 results in the equivalent of fixing the number of knots as above.

dat2.gam = gam(y ~ s(x, bs='cr'), data=dat2, method='REML')
coefs = coef(dat2.gam)
newdata = data.frame(x=seq(0,3*pi, len=100))
Xmat = predict(dat2.gam, type='lpmatrix', newdata=newdata)
fits = sweep(Xmat,MARGIN=2,coefs,'*')
newdata = newdata %>% mutate(fit=rowSums(fits))
fits %>% as.data.frame %>% 
    bind_cols(newdata) %>% 
    gather(key=Basis, value=y, -x,-fit) %>%
    ggplot(aes(y=y, x=x)) + geom_line(aes(color=Basis)) +
    geom_line(aes(y=fit)) +
    geom_point(data=dat2) +
    geom_line(data=dat2, aes(y=yp), linetype='dashed')

That’s better.

Model diagnostics

However, before getting too excited, (just like all analysis) we should check the various model diagnostics so as to validate the model.

Firstly, we should establish whether the maximum number of knots (basis functions) is enough or whether there is any evidence that the smoother may be under fit. This is done via the k.check() function.

k.check(dat2.gam)

##      k'      edf  k-index p-value
## s(x)  9 6.768293 1.165484  0.8225

For each smooth term (in this case there is only one), it lists:

k' - the maximum number of knots. In this case there are 9 - there was a total of 10, however one is used for the intercept
edf - the estimated degrees of freedom. This is a measure of the complexity of the smooth terms. It is the number of knots minus any penalized likelihood constraints (and is typically not an integer).
k-index - is a measure of the amount of pattern left in the residuals. Technically, it is the difference between neighbouring residuals standardized by the residual variance. The lower the k-index below 1, the greater the chances are that there is still some pattern remaining in the residuals (and thus the term is over smooth).
p-value - proportion of k-index values from repeated randomization of neighbouring residuals that are less than the actual k-index

If k-index is less than 1, the p-value is low and edf is similar to k', there is evidence that the initial number of knots (\(k\)) was too low. It is recommended that the number of knots be doubled and the model refit.

In the current case, there is no issue with \(k\).

Now we should turn our attention to diagnostics about the actual fit.

Quantile-Quantile plots of the residuals. This plots the actual quantiles against quantiles expected based on the nominated family (in this case Gaussian). Ideally, we are hoping to see a straight line.
Residual plot - plot of residuals versus the linear predictor. As with all residual plots, we are hoping to find no patterns.
Histogram of residuals - ideally the residuals should be normally distributed and centered around 0.
Observed versus fitted values - the closer to a straight line, the better the model performs at predicting the training data.

Note, in all cases, deviance residuals are used.

gratia::appraise(dat2.gam)

In this case, all diagnostics appear reasonable.

In multiple linear models, (multi)colliniarity arises when one or more of the continuous predictors are correlated to the other predictor(s). When this is the case, the offending predictors compete within the model and the individual partial slope estimates can be either under or overestimated (slopes may even change polarity). In GAMs, the nonlinear equivalent is called concurvity. Concurvity measures how well each smooth could be approximated by either a combination of the other smooth(s) or each individual other smooth.

concurvity(dat2.gam)

##                  para         s(x)
## worst    3.155444e-31 1.117389e-31
## observed 3.155444e-31 3.459132e-34
## estimate 3.155444e-31 6.755856e-33

concurvity(dat2.gam, full=FALSE)

## $worst
##              para         s(x)
## para 1.000000e+00 1.117389e-31
## s(x) 3.155444e-31 1.000000e+00
## 
## $observed
##              para         s(x)
## para 1.000000e+00 3.459132e-34
## s(x) 3.155444e-31 1.000000e+00
## 
## $estimate
##              para         s(x)
## para 1.000000e+00 6.755856e-33
## s(x) 3.155444e-31 1.000000e+00

In this case, we only have a single parametric term (the intercept) and a single smooth term. And since the smoother is always constructed to be orthogonal to the intercept, there should not be an issue - this is confirmed in the output above.

Exploring the model

Before looking at the estimated model parameters, it is often useful to have a look at a partial effects plot to get a feel for the model. As with other models, it can really help in interpreting the model output(s).

gratia::draw(dat2.gam)

Note that the plot is always zero centered on the y-axis. Recall that all smoothers are sum-to-zero. Hence, the partial effects plot represents the trends not the absolute values. In this particular case, our actual response was approximately zero centered anyway (since it was fabricated from a sine wave that oscillates from -1 to 1).

Now we can explore the estimated parameters..

summary(dat2.gam)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x, bs = "cr")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.13167    0.06474  -2.034   0.0483 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##        edf Ref.df     F p-value    
## s(x) 6.768  7.862 76.61  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.925   Deviance explained = 93.5%
## -REML = 42.322  Scale est. = 0.20954   n = 50

In this case:

there is a single parametric coefficient:
- the intercept is estimated to be -0.1316718. Note as with other models, the hypothesis test associated with this term is typically of little interest
there is a single smooth term:
- edf is 6.7682933. This is the estimated degrees of freedom. The higher the number, the more wiggly the smoother.
- Ref.df NA. This is the degrees of freedom used for the F test p-value calculation.
- there is evidence that the smoother is different from a straight horizontal line.
the adjusted \(R^2\) is 0.9246975. This is the proportion of the variance explained - calculated by comparing the variance to a null model. Note, this can be negative if the model is a worse fit than the null model.
the deviance explained is 0.9350989. This is the proportion of the null deviance explained by the model and is typically preferred over \(R^2\).
the scale parameter is typically calculated as sum of squares of the pearson residuals, divided by the effective residual degrees of freedom.

Summary plots

data.list = with(dat2, list(x=seq(min(x), max(x), len=100)))
newdata = emmeans(dat2.gam, ~x, at=data.list) %>% as.data.frame
newdata %>%
    ggplot(aes(y=emmean, x=x)) +
    geom_ribbon(aes(ymin=lower.CL, ymax=upper.CL), fill='blue', color=NA, alpha=0.3) +
    geom_line() +
    geom_point(data=dat2, aes(y=y)) +
    theme_bw()

We can also perform all the necessary underlying calculations manually.

newdata = with(dat2, data.frame(x=seq(min(x), max(x), len=100)))
Xmat = predict(dat2.gam, newdata=newdata, type='lpmatrix')
coefs = coef(dat2.gam)
Fit = as.vector(coefs %*% t(Xmat))
SE = sqrt(diag(Xmat %*% vcov(dat2.gam) %*% t(Xmat)))
Q = qt(0.975, df=df.residual(dat2.gam))
newdata = newdata %>% mutate(fit = Fit,
                             lower = Fit-Q*SE,
                             upper = Fit+Q*SE)
newdata %>%
    ggplot(aes(y=Fit, x=x)) +
    geom_ribbon(aes(ymin=lower, ymax=upper), fill='blue', color=NA, alpha=0.3) +
    geom_line() +
    geom_point(data=dat2, aes(y=y)) +
    theme_bw()

Basis functions available in `mgcv`

Basis function	Name	Properties
`tp`	Thin plate spline	Low rank (far fewer parameters than data), isotropic (equal smoothing in any direction) regression splines
`ts`	Thin plate spline	Thin plate spline with penalties on the null space such that it can be shrunk to zero. Useful in combination with `by=` as this does not penalize the null space.
`cr`	Cubic regression spline	Splines with knots spread evenly throughout the covariate domain
`cc`	Cyclic cubic regression spline	Cubic regression splines with ends that meet (have the same second order derivatives)
`ps`	Penalized B-spline	Allow the distribution of knots to be based on data distribution
`cp`	Cyclic penalized B-spline	Penalized B-splines with ends that meet
`gp`	Gaussian process	Gaussian process smooths with five sets of correlation structures
`mrf()`	Markov random fields	Useful for modelling of discrete space. Uses penalties based on neighbourhood matrices (pairwise distances between discrete locations)
`re`	Random effects	Parametric terms penalized via identity matrices. Equivalent to i.i.d in mixed effects models
`fs`	Factor smooth interaction	For single dimension smoothers. Duplicates the basis functions for each level of the categorical covariate, yet uses a single smoothing parameter across all.
`sos`	Splines on the sphere	Isotropic 2D splines in spherical space. Useful for large spatial domains.
`so`	Soap film smooths	Smooths within polygon boundaries. These are useful for modelling complex spatial areas.

Smoother definition functions	Type	Properties
`s()`	General spline smoothers	- for multidimensional smooths, assumes that each component are on the same scale as there is only a single smoothing parameter for the smooth
`te()`	Tensor product smooth	- smooth functions of numerous covariates that are built as the tensor product of the comprising smooths (and penalties)
		- the interaction between numerous terms, each with their own smoothing parameter that penalizes the average wiggliness of that term.
`ti()`	Tensor product interaction smooth	- smooths that include only the highest order interactions (exclude the basis functions associated with the main effects of the comprising smooths)
`t2()`	Alternative tensor product smooth with non-overlapping …	- creates basis functions and penalties for paired combinations of separate penalized and unpenalized components of each term