Causal inference
Other useful tutorials or resources
1 Preparations
This tutorial makes use of a couple of specialist R packages (dagitty
and ggdag
) in addition to the tidyverse
and Bayesian modelling package (brms
). So we will start by loading the necessary libraries
2 Background
Understanding the relationships between variables is a fundamental aspect of ecological research. However, interpreting these relationships requires careful consideration of statistical principles and study design. Carefully controlled experimental designs have traditionally been the gold standard for inferring causation. However, observational studies are becoming increasingly important to answer large-scale ecological questions for which experiments are not feasible or suitable. Whilst most studies aim to answer causal questions, doing so can be challenging and requires careful consideration of causal relationships if we are to avoid confounding and other biases - hence the phrase “correlation does not imply causation”.
Moreover, there is often an urge to measure as many variables as possible just in case they might turn out to be useful in statistical analyses - particularly in observational studies, or surveys. Then when it comes to performing the statistical analyses, there is a temptation to put everything in the model under the false belief that the model will be “better”. Blindly doing so will often result in the models behaving quite differently to what you might expect and produce spurious and missleading estimates.
This tutorial introduces the concepts of correlation, confounding, and causation, emphasizing their importance in ecological studies as well as how we can identify and adjust models for causal inference.
3 Structural Causal Modelling (SCM) framework
Structural Causal Modelling (SCM) is a useful and robust logic-based framework for identifying cause and effect relationships and thus supporting causal inference from observational data. The framework comprises the following seven steps (the details each step will be covered below):
Create a conceptual model of the system of interest
Choose a statistical test
Test the consistency of your conceptual model with the data
Identify biases (confounding, overcontrol, collider) that you need to adjust for when testing the cause of interest
Test for causality with the appropriate statistical test
Repeat steps 3-5 for each cause of interest
Repeat steps 1-6 for other conceptual models
3.1 Step 1 - create the conceptual model
The conceptual model represents the researchers’ understanding of the system of interest as well as their assumptions about the causal structure between variables. This structure should be based on domain knowledge, prior research, and expert opinion.
For example, a conceptual model might indicate that X
effects Y
(X -> Y
), or perhaps X
and Y
might both be effected by Z
and thus Z
is a confounder of the relationship between X
and Y
(X <- Z -> Y
).
The conceptual model may include both observed and unobserved variables. Essentially, the conceptual model encodes our structural assumptions about the relationships in the system. Importantly, they do not impose the manner in which those relationships are investigated. That is, they are agnostic to the kinds of statistical models you use - they could be linear/additive/non-linear, frequentist/Bayesian, hierarchical etc.
The conceptual model is visually represented as a Directed Acyclic Graph (DAG) where nodes (dots) represent variables and arrows represent assumed causal relationships. Following are some important properties of a DAG:
- the direction of an arrow indicates the direction of causality, e.g.
X -> Y
means thatX
causesY
. - it is acyclic, meaning that there are no cycles or loops in the graph, i.e. you cannot go from
X
toY
and then back toX
. - the presence of an arrow implies a possible relationship
- the absence of an arrow encodes an explicit assumption of no relationship
To help describe DAGs, I think it would be useful to provide some examples with real variables. So lets consider a hypothetical example presented in Arif and MacNeil (2022) that explored the effects of marine protected areas (MPAs) on reef fish biomass.
There is an R package (dagitty) to help with the construction and interrogation of DAGs. For constructing the DAG, we can use either the dagitty()
function or the dagify()
function.
dag <- dagitty("
dag {
depth -> mpa
depth -> fishing
mpa -> fishing
fishing -> biomass
complexity -> mpa
complexity -> biomass
human -> fishing
human -> biomass
mpa -> coral
biomass -> coral
}
")
ggdag(dag, text = TRUE, text_size = 2.5) +
theme_dag_blank()
If the coordinates of the nodes are not specified (which they are not in the above example), they are determined by the “spring” layout algorithm. This algorithm attempts to position the nodes in a way that minimizes the overall energy of the system, similar to how physical systems tend to settle into stable configurations. Importantly, it is a stoichastic algorithm, meaning that the positions of the nodes will change each time the code is run. This can be useful for visualizing complex relationships, but it can also make it difficult to reproduce specific layouts. If you want to ensure that the layout remains consistent across runs, you can specify the coordinates of the nodes manually.
Furthermore, we have an opportunity to tag our response (outcome) and main predictor(s) (exposure).
dag <- dagitty('
dag {
depth -> mpa
depth -> fishing
mpa -> fishing
fishing -> biomass
complexity -> mpa
complexity -> biomass
human -> fishing
human -> biomass
mpa -> coral
biomass -> coral
biomass[outcome, pos="3, 1"]
fishing[pos = "2, 1"]
mpa[exposure, pos = "1, 1"]
depth[pos = "1, 2"]
complexity[pos = "2, 2"]
human[pos = "3, 2"]
coral[pos = "2, 0"]
}
')
ggdag(dag, text = TRUE, text_size = 2.5) +
theme_dag_blank()
dag <- dagify(
biomass ~ fishing + complexity + human,
fishing ~ mpa + depth + human,
mpa ~ depth + complexity,
coral ~ mpa + biomass
)
ggdag(dag, text = TRUE, text_size = 2.5) +
theme_dag_blank()
If the coordinates of the nodes are not specified (which they are not in the above example), they are determined by the “spring” layout algorithm. This algorithm attempts to position the nodes in a way that minimizes the overall energy of the system, similar to how physical systems tend to settle into stable configurations. Importantly, it is a stoichastic algorithm, meaning that the positions of the nodes will change each time the code is run. This can be useful for visualizing complex relationships, but it can also make it difficult to reproduce specific layouts. If you want to ensure that the layout remains consistent across runs, you can specify the coordinates of the nodes manually.
Furthermore, we have an opportunity to tag our response (outcome) and main predictor(s) (exposure).
coords <- list(
x = c(
biomass = 3, fishing = 2, mpa = 1, depth = 1,
complexity = 2, human = 3, coral = 2
),
y = c(
biomass = 1, fishing = 1, mpa = 1, depth = 2,
complexity = 2, human = 2, coral = 0
)
)
dag <- dagify(
biomass ~ fishing + complexity + human,
fishing ~ mpa + depth + human,
mpa ~ depth + complexity,
coral ~ mpa + biomass,
exposure = "mpa",
outcome = "biomass",
coords = coords
)
ggdag(dag, text = TRUE, text_size = 2.5) +
theme_dag_blank()
3.2 Step 2 - choose a statistical test
This step is largely independent of the first step, but will have consequences for Step 5. As always, the decisions about what sort of statistical analyses to use should be informed by the nature of the questions being asked as well as exploratory data analyses. For the purpose of Step 5, this tutorial will use frequentist linear models, yet for the specific causal inferences we will use Bayesian models.
3.3 Step 3
As indicated above, a DAG encodes all our structural relationship assumptions. For example, in our hypothetical reef observation study, we make numerous assumptions. Some of the more obvious assumptions include:
- depth and human gravity are independent
- complexity and depth are independent
- MPA and human gravity are independent
In each of these cases, there are no paths in which arrows can be followed from one of these variables to the other (even via other variables).
Others are a more complex and less obvious, such as:
- fishing and structural complexity are independent if we condition on MPA and depth. There is a path (
complexity -> mpa -> fishing
) unless we includempa
in which case, this path is blocked. Similarly, sincedepth
is a determinant ofmpa
, we also need to block this path. - coral cover and fishing are independent if we condition on biomass and MPA. There is a path (
fishing -> biomass -> coral
) unless we block this path by includingbiomass
. There is also a path (fishing <- mpa -> coral
) wherefishing
andcoral
are linked because they have a common cause (mpa
) and are thus only independent once we condition onmpa
.
There are also many more assumptions like those above, most of which are very difficult to identify without knowing all the complex rules and having a lot practice. Fortunately, once we have a DAG, we can then check to see which variables are conditionally independent of one another via the dagitty::impliedConditionalIndependencies()
function. Conditionally independent means that two variables are independent (not correlated) once we condition on (similar to controlling for - e.g. adding to a model) another variable. This provides a set of the explicit assumptions that are encapsulated within our conceptual model.
bmss _||_ dpth | cmpl, fshn, humn
bmss _||_ mpa | cmpl, fshn, humn
cmpl _||_ corl | bmss, mpa
cmpl _||_ dpth
cmpl _||_ fshn | dpth, mpa
cmpl _||_ humn
corl _||_ dpth | cmpl, fshn, humn, mpa
corl _||_ dpth | bmss, mpa
corl _||_ fshn | bmss, mpa
corl _||_ humn | bmss, mpa
dpth _||_ humn
humn _||_ mpa
The above function abbreviates variables and uses the following notation devices:
- \(\perp\!\!\!\perp\) is indendent of
- \(|\) is conditional on
Lets go through how to interpret some of these:
dpth _||_ humn
(\(Depth\perp\!\!\!\perp Human~gravity\)), depth is independent of human gravitycorl _||_ fshn | bmss, mpa
(\(Coral~cover\perp\!\!\!\perp Fishing~pressure | Fish~biomass, MPA\)), coral cover is independent of fishing pressure conditional on fish biomass and MPA status.
Graphically, if two variables are independent (there is no path between them), then they are d-separated (blocked) by the conditioning set.
Once we have the set of assumptions, we should systematically go through and test each of them against our data. For example, from above, we should fit a model of depth against human gravity. If any of the assumptions are violated, then we need to revise the DAG (but this should always be informed by domain knowledge)
The example, as presented in Arif and MacNeil (2022) is intentionally fictitious. This allows us to simulate data and in so doing be in the otherwise impossible situation of knowing the “truth”. This means that we can analyse the data and see how well the estimates align with the true values. So lets simulate data similar to that used by Arif and MacNeil (2022). Note, the observations generated in this simulation are completely unrealistic as they are simply generated from standard normal distributions. Nevertheless, the principles that follow are still valid.
set.seed(123)
N <- 10000
depth <- rnorm(N, mean = 0, sd = 1)
human <- rnorm(N, mean = 0, sd = 1)
complexity <- rnorm(N, mean = 0, sd = 1)
mpa <- rbinom(N, 1, prob = plogis(0.2 * depth + 2.8 * complexity))
fishing <- rnorm(N, mean = -0.99 * mpa + -0.2 * depth + 0.3 * human, sd = 1)
biomass <- rnorm(N, -1.1 * fishing + -0.4 * human + 1.65 * complexity, sd = 1)
coral <- rnorm(N, mean = 0.5 * mpa + 2.5 * biomass, sd = 1)
dat <- data.frame(depth, human, complexity, mpa, fishing, coral, biomass)
head(dat)
depth human complexity mpa fishing coral biomass
1 -0.56047565 2.3707252 -0.8362967 0 -0.06474014 -2.798193 -1.2848044
2 -0.23017749 -0.1668120 -0.2205730 1 -0.63374371 2.524107 0.1764197
3 1.55870831 0.9269614 -2.1035148 0 0.38843668 -10.319247 -4.1216673
4 0.07050839 -0.5681517 -1.6678075 0 -0.92777814 -3.268197 -1.7298705
5 0.12928774 0.2250901 -1.0979629 0 0.22189078 -3.114242 -1.5624274
6 1.71506499 1.1319859 -1.6656212 0 0.96233946 -7.754280 -3.2893050
For illustrative purposes, there is a function in the dagitty
package that systematically goes through the set of conditional independences and tests each one (localTests()
). Although we should perform the usual exploratory data analyses and carefully consider each of the individual analyses (e.g. the model structure, error distribution etc), in the interest of brevity, we will use the localTests()
function on this occasion.
By default, the localTests()
function performs a series of correlation tests.
estimate p.value 2.5%
bmss _||_ dpth | cmpl, fshn, humn -1.326104e-02 0.18491146 -0.032855508
bmss _||_ mpa | cmpl, fshn, humn 1.255555e-02 0.20938999 -0.007049193
cmpl _||_ corl | bmss, mpa -2.162561e-03 0.82882998 -0.021763670
cmpl _||_ dpth 2.000494e-02 0.04545055 0.000405027
cmpl _||_ fshn | dpth, mpa 4.665034e-05 0.99627879 -0.019555398
cmpl _||_ humn -3.067806e-03 0.75904469 -0.022666513
corl _||_ dpth | cmpl, fshn, humn, mpa -2.190013e-02 0.02855351 -0.041486319
corl _||_ dpth | bmss, mpa -2.557142e-02 0.01055610 -0.045150831
corl _||_ fshn | bmss, mpa -2.631970e-03 0.79244899 -0.022232854
corl _||_ humn | bmss, mpa -7.609865e-03 0.44677002 -0.027207838
dpth _||_ humn 6.021718e-03 0.54711505 -0.013579955
humn _||_ mpa 8.364386e-03 0.40296780 -0.011237526
97.5%
bmss _||_ dpth | cmpl, fshn, humn 0.006343624
bmss _||_ mpa | cmpl, fshn, humn 0.032150653
cmpl _||_ corl | bmss, mpa 0.017440208
cmpl _||_ dpth 0.039589492
cmpl _||_ fshn | dpth, mpa 0.019648663
cmpl _||_ humn 0.016533258
corl _||_ dpth | cmpl, fshn, humn, mpa -0.002297126
corl _||_ dpth | bmss, mpa -0.005972383
corl _||_ fshn | bmss, mpa 0.016970936
corl _||_ humn | bmss, mpa 0.011993955
dpth _||_ humn 0.025618765
humn _||_ mpa 0.027959873
Interestingly, although the DAG assumptions are broadly consistent with the data, there are three conditional independencies (assumptions) that are unsupported by the data. This is curious given that these data were simulated and these correlations were not baked into the data. That said, if we used a cut off of 0.05, then we are accepting that there is a 5% chance of a false rejection and we did not adjust for multiple tests - if we did, there would definitely not be any p-values less than 0.05 given the number of tests performed.
With real data, we would scrutinise the outcomes of these tests more thoroughly and potentially even either reconsider our conceptual model or otherwise attempt to diagnose why the conceptual model was seemingly inconsistent with the data.
3.4 Step 4 - Identify and adjust for biases
We now need to determine what covariates must (and must not) be added to a statistical model in order to estimate the causal inference of our main predictor (exposure: MPA) on the response (outcome: fish biomass). That is, what combination of covariates must be added in order to avoid biases. This is largely achieved via two rules/criterion:
Backdoor criterion: rules used to identify a set of (covariates) variables that blocks all backdoor (non-causal) paths from the predictor (exposure) to the response (outcome) while leaving all causal paths open. A backdoor path is a path that starts with an arrow pointing into the exposure variable (MPA in this case).
Frontdoor criterion: rules used to identify a set of variables that blocks all frontdoor paths from the exposure to the outcome. A frontdoor path blocks all non-causal paths from exposure to outcome as well as from Z to Y (e.g. it blocks all confounding). A frontdoor path is a path that starts with an arrow pointing from X and into Y via intermediate variables, i.e. X -> Z -> Y, d-separation The fontdoor criterion is useful if the backdoor criterion nominates a variable that is missing from the data - it provides an alternative.
There are three biases that we need to be aware of that operate respectively within the following four relational structures:
Relation type / DAG | Description |
---|---|
|
|
|
|
|
|
|
Overcontrol bias: is caused when we condition on a mediator thereby blocking the path from the exposure to the outcome.
Confounding bias: is caused when a variable that is a common cause of both the exposure and the outcome (i.e. it is a backdoor path) is not included in a model.
Collider bias: is caused when a variable that is a common effect of the exposure and the outcome (i.e. it is a frontdoor path) is included in a model.
So many rules. If this all seems confusing, it is because it is! Fortunately, yet again, there are tools in the dagitty
package to help identify what adjustment sets (covariates required to allow unbiased estimates of causal inferences).
Firstly, we could explore the open paths that exist between the exposure (MPA) and the outcome (Fish biomass).
ggdag_paths(dag, from = "mpa", to = "biomass", text_col = "black") +
coord_cartesian(expand = FALSE, xlim=c(0.5, 3.5), ylim=c(0.5,2.5)) +
theme_dag_blank(panel.border = element_rect(fill = NA))
It appears that there are three open paths between MPA and fish biomass:
- path 1 is a pipe and thus the backdoor criterion would suggest we do not condition on fishing pressure if we wish to avoid overcontrol bias and estimate the total effect of mpa on fish biomass
- path 2 is a fork and thus the backdoor criterion would suggest we do condition on structural complexity if we wish to avoid confounding bias and estimate the total effect of MPA on fish biomass
- path 3 has a fork on top of a pipe. In order to open the flow of information between MPA and fishing (and thus complete the pipe) the backdoor criterion would suggest we do need to condition on depth, but do not condition on fishing pressure.
Even more help comes in the form of the adjustmentSets()
function which will identify these adjustments for us such that we do not need to concern ourselves with pipes, forks and colliders.
Now we will inquire about the adjustment sets when seeking to explore the total effect of MPA on fish biomass.
{ complexity, depth }
set.seed(123)
ggdag_adjustment_set(dag,
exposure = "mpa",
outcome = "biomass",
effect = "total",
shadow = TRUE, text_col = "black"
) +
theme_dag_blank()
This confirms that in addition to MPA, a statistical model would need to include both structural complexity and depth, yet should not include fishing pressure, human gravity or coral cover.
So what if we also wanted to estimate the effects of structural complexity on fish biomass? We can follow the same steps.
ggdag_paths(dag, from = "complexity", to = "biomass", text_col = "black") +
coord_cartesian(expand = FALSE, xlim=c(0.5, 3.5), ylim=c(0.5,2.5)) +
theme_dag_blank(panel.border = element_rect(fill = NA))
There are two paths, the first of which is a direct path and the second is a pipe flowing through two mediators (MPA and fishing). In both cases, the backdoor criterion would suggest that we do not add any other covariates to a model containing structural complexity.
{}
set.seed(123)
ggdag_adjustment_set(dag,
exposure = "complexity",
outcome = "biomass",
effect = "total",
shadow = TRUE, text_col = "black"
) +
theme_dag_blank()
For this last exploration (effect of structural complexity on fish biomass), we noted that there were two paths. When we inquired about the adjustment sets required to estimate the total effects, the backdoor criterion suggested the model should not include any other covariates. But can we distinguish between the direct effects of structural complexity (pathway complexity -> biomass
) and the indirect effects (pathway complexity -> mpa -> fishing -> biomass
)?
{ fishing, human }
{ depth, mpa }
set.seed(123)
ggdag_adjustment_set(dag,
exposure = "complexity",
outcome = "biomass",
effect = "direct",
shadow = TRUE, text_col = "black"
) +
theme_dag_blank()
In order to estimate the direct causal effects of structural complexity on fish biomass, we need to condition on either:
- fishing pressure and human gravity or
- depth and MPA
3.5 Step 5 - test for causality
To get more of a feel for these data, lets plot some of the relationships.
g1 <-
dat |>
ggplot(aes(y = biomass, x = factor(mpa))) +
geom_boxplot()
g2 <-
dat |>
ggplot(aes(y = biomass, x = fishing)) +
geom_point()
g3 <-
dat |>
ggplot(aes(y = fishing, x = factor(mpa))) +
geom_boxplot()
g1 + g2 + g3 & theme_bw()
Before we attempt to estimate the total causal effects of MPA on fish biomass, lets calculate what it should be based on the parameters used to simulated the data. We can do this by decomposing the simulation formulas into their parts. To do so, we just need to focus on the mpa -> fishing -> biomass
pathway.
- biomass was drawn from a normal distribution with a mean of \(-1.1\times fishing + -0.4\times human + 1.65\times complexity\) of which only the fishing component (-1.1) is relevant
- fishing was drawn from a normal distribution with a mean of \(-0.99\times mpa + -0.2\times depth + 0.3\times human\) of which only the MPA component (-0.99) is relevant.
Hence the effect of MPA on fish biomass should be approximately \(-1.1\times -0.99 = 1.098\).
Step 4 above indicated that to yield unbiased causal effects of MPA on fish biomass, the model would need to condition on both depth and structural complexity.
form <- bf(biomass ~ factor(mpa) + depth + complexity)
## summarise data to help inform priors
dat |>
group_by(mpa) |>
summarise(median(biomass), sd(biomass),
depth_sd = sd(biomass) / sd(depth),
complexity_sd = sd(biomass) / sd(complexity)
)
## define priors
priors <- prior(normal(-1, 4), class = "Intercept") +
prior(normal(0, 5), class = "b") +
prior(student_t(3, 0, 4), class = "sigma")
## fit the model
mod_brms1a <- brm(
form,
prior = priors,
data = dat,
refresh = 0,
## backend = "cmdstanr"
backend = "rstan"
)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: biomass ~ factor(mpa) + depth + complexity
Data: dat (Number of observations: 10000)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.02 0.03 -0.04 0.07 1.00 3620 3199
factormpa1 1.11 0.05 1.02 1.20 1.00 3331 3053
depth 0.21 0.02 0.18 0.24 1.00 4348 2398
complexity 1.64 0.02 1.60 1.69 1.00 3235 2958
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.66 0.01 1.63 1.68 1.00 6002 3105
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
The estimate for the effect of MPA on fish biomass is 1.108 (1.018 - 1.2) which is consistent with the true value of \(1.089\).
If we fail to condition on both structural complexity and depth, a fork structure will exist resulting in confounding bias.
form <- bf(biomass ~ factor(mpa))
## summarise data to help inform priors
dat |>
group_by(mpa) |>
summarise(median(biomass), sd(biomass))
## define priors
priors <- prior(normal(-1, 4), class = "Intercept") +
prior(normal(0, 5), class = "b") +
prior(student_t(3, 0, 4), class = "sigma")
## fit the model
mod_brms1b <- brm(
form,
prior = priors,
data = dat,
refresh = 0,
## backend = "cmdstanr"
backend = "rstan"
)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: biomass ~ factor(mpa)
Data: dat (Number of observations: 10000)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -1.12 0.03 -1.18 -1.07 1.00 3905 2963
factormpa1 3.38 0.04 3.29 3.45 1.00 3536 2785
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 2.05 0.01 2.03 2.08 1.00 4932 3381
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
The estimate for the effect of MPA on fish biomass is 3.375 (3.294 - 3.453) which is not consistent with the true value of \(1.089\).
If we correctly recognised that we need to condition on depth and structural complexity (leaving just a pipe structure), yet we erroneously also conditioned on fishing pressure, we would create a overcontrol bias.
form <- bf(biomass ~ factor(mpa) + depth + complexity + fishing)
## summarise data to help inform priors
dat |>
group_by(mpa) |>
summarise(median(biomass), sd(biomass),
depth_sd = sd(biomass) / sd(depth),
complexity_sd = sd(biomass) / sd(complexity),
fishing_sd = sd(biomass) / sd(fishing)
)
## define priors
priors <- prior(normal(-1, 4), class = "Intercept") +
prior(normal(0, 5), class = "b") +
prior(student_t(3, 0, 4), class = "sigma")
## fit the model
mod_brms1c <- brm(
form,
prior = priors,
data = dat,
refresh = 0,
## backend = "cmdstanr"
backend = "rstan"
)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: biomass ~ factor(mpa) + depth + complexity + fishing
Data: dat (Number of observations: 10000)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.01 0.02 -0.02 0.05 1.00 4234 3413
factormpa1 -0.09 0.03 -0.15 -0.03 1.00 3169 2854
depth -0.04 0.01 -0.06 -0.02 1.00 5470 3348
complexity 1.64 0.01 1.61 1.67 1.00 3155 3077
fishing -1.20 0.01 -1.22 -1.18 1.00 4498 3086
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.08 0.01 1.07 1.10 1.00 6757 2510
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
The estimate for the effect of MPA on fish biomass is -0.092 (-0.154 - -0.032) which is not consistent with the true value of \(1.089\).
If we correctly recognised that we need to condition on depth and structural complexity (leaving just a pipe structure), yet we erroneously also conditioned on coral cover, we would create a collider bias.
form <- bf(biomass ~ factor(mpa) + depth + complexity + coral)
## summarise data to help inform priors
dat |>
group_by(mpa) |>
summarise(median(biomass), sd(biomass),
depth_sd = sd(biomass) / sd(depth),
complexity_sd = sd(biomass) / sd(complexity),
coral_sd = sd(biomass) / sd(coral)
)
## define priors
priors <- prior(normal(-1, 4), class = "Intercept") +
prior(normal(0, 5), class = "b") +
prior(student_t(3, 0, 4), class = "sigma")
## fit the model
mod_brms1d <- brm(
form,
prior = priors,
data = dat,
refresh = 0,
## backend = "cmdstanr"
backend = "rstan"
)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: biomass ~ factor(mpa) + depth + complexity + coral
Data: dat (Number of observations: 10000)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -0.01 0.01 -0.02 0.01 1.00 2789 2884
factormpa1 -0.11 0.01 -0.13 -0.09 1.00 2542 2671
depth 0.02 0.00 0.01 0.03 1.00 4086 2820
complexity 0.09 0.01 0.08 0.11 1.00 2568 2519
coral 0.38 0.00 0.37 0.38 1.00 4638 3913
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.39 0.00 0.39 0.40 1.00 3061 2412
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
The estimate for the effect of MPA on fish biomass is -0.113 (-0.134 - -0.091) which is not consistent with the true value of \(1.089\).
To get more of a feel for these data, lets plot some of the relationships.
g1 <-
dat |>
ggplot(aes(y = biomass, x = complexity)) +
geom_point()
g2 <-
dat |>
ggplot(aes(y = biomass, x = fishing)) +
geom_point()
g3 <-
dat |>
ggplot(aes(y = fishing, x = factor(mpa))) +
geom_boxplot()
g4 <-
dat |>
ggplot(aes(y = mpa, x = complexity)) +
geom_point()
g1 + g2 + g3 + g4 & theme_bw()
Recall that there are two pathways between structural complexity and biomass and one of these pathways is a direct pathway (complexity -> biomass
). Hence, we can calculate the direct effects, indirect effects and total effects of structural complexity on fish biomass.
Before we attempt to estimate each of these effects, lets calculate what they should be based on the parameters used to simulated the data. We can do this by decomposing the simulation formulas into their parts.
- biomass was drawn from a normal distribution with a mean of \(-1.1\times fishing + -0.4\times human + 1.65\times complexity\) of which only the complexity component (1.65) is relevant
According to the adjusted sets we explored earlier, the direct effects of structural complexity on fish biomass can be achieved by conditioning on either MPA and depth (which we did when modelling MPA -> biomass
) or fishing pressure and human gravity.
form <- bf(biomass ~ complexity + fishing + human)
## summarise data to help inform priors
dat |>
summarise(median(biomass), sd(biomass),
complexity_sd = sd(biomass) / sd(complexity),
fishing_sd = sd(biomass) / sd(fishing),
human_sd = sd(biomass) / sd(human)
)
## define priors
priors <- prior(normal(-1, 4), class = "Intercept") +
prior(normal(0, 5), class = "b") +
prior(student_t(3, 0, 4), class = "sigma")
## fit the model
mod_brms2a <- brm(
form,
prior = priors,
data = dat,
refresh = 0,
## backend = "cmdstanr"
backend = "rstan"
)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: biomass ~ complexity + fishing + human
Data: dat (Number of observations: 10000)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.02 0.01 -0.00 0.04 1.00 4938 3194
complexity 1.65 0.01 1.63 1.67 1.00 4299 3238
fishing -1.09 0.01 -1.11 -1.08 1.00 5000 3582
human -0.41 0.01 -0.43 -0.39 1.00 4384 3082
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.01 0.01 0.99 1.02 1.00 4714 2926
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
The estimate for the effect of MPA on fish biomass is 1.647 (1.626 - 1.668) which is consistent with the true value of \(1.65\).
Family: gaussian
Links: mu = identity; sigma = identity
Formula: biomass ~ factor(mpa) + depth + complexity
Data: dat (Number of observations: 10000)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.02 0.03 -0.04 0.07 1.00 3620 3199
factormpa1 1.11 0.05 1.02 1.20 1.00 3331 3053
depth 0.21 0.02 0.18 0.24 1.00 4348 2398
complexity 1.64 0.02 1.60 1.69 1.00 3235 2958
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.66 0.01 1.63 1.68 1.00 6002 3105
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
form <- bf(biomass ~ complexity)
## summarise data to help inform priors
dat |>
summarise(median(biomass), sd(biomass),
complexity_sd = sd(biomass) / sd(complexity)
)
## define priors
priors <- prior(normal(-1, 4), class = "Intercept") +
prior(normal(0, 5), class = "b") +
prior(student_t(3, 0, 4), class = "sigma")
## fit the model
mod_brms2b <- brm(
form,
prior = priors,
data = dat,
refresh = 0,
## backend = "cmdstanr"
backend = "rstan"
)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: biomass ~ complexity
Data: dat (Number of observations: 10000)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.57 0.02 0.54 0.61 1.00 3877 3085
complexity 2.03 0.02 1.99 2.06 1.00 4001 3037
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.72 0.01 1.70 1.75 1.00 4433 2907
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
biomass was drawn from a normal distribution with a mean of \(-1.1\times fishing + -0.4\times human + 1.65\times complexity\) of which only the fishing component (-1.1) is relevant
fishing was drawn from a normal distribution with a mean of \(-0.99\times mpa + -0.2\times depth + 0.3\times human\) of which only the MPA component (-0.99) is relevant.
MPA was drawn from a binomial distribution with a probability of \(0.2\times depth + 2.8\times complexity\) (on the logit scale) of which only the complexity component (2.8) is relevant. Since 2.8 is on the logit scale, it does not represent a constant slope (unlike the other Gaussian parameters). We can approximate the linear slope by calculating the average slope.
\[ \frac{dp}{dx} = p(1-p) \cdot \beta \] where \(p = \frac{1}{1 + e^{-\beta\times X}}\) and \(X\) is a vector of the predictor values.
Hence the indirect effect of complexity on fish biomass should be approximately \((-1.1\times -0.99)\times 0.335 = 0.365\)