Causal inference

Author

Murray Logan

Published

July 10, 2025

1 Preparations

This tutorial makes use of a couple of specialist R packages (dagitty and ggdag) in addition to the tidyverse and Bayesian modelling package (brms). So we will start by loading the necessary libraries

library(tidyverse)
library(brms)
library(dagitty)
library(ggdag)
library(patchwork)

2 Background

Understanding the relationships between variables is a fundamental aspect of ecological research. However, interpreting these relationships requires careful consideration of statistical principles and study design. Carefully controlled experimental designs have traditionally been the gold standard for inferring causation. However, observational studies are becoming increasingly important to answer large-scale ecological questions for which experiments are not feasible or suitable. Whilst most studies aim to answer causal questions, doing so can be challenging and requires careful consideration of causal relationships if we are to avoid confounding and other biases - hence the phrase “correlation does not imply causation”.

Moreover, there is often an urge to measure as many variables as possible just in case they might turn out to be useful in statistical analyses - particularly in observational studies, or surveys. Then when it comes to performing the statistical analyses, there is a temptation to put everything in the model under the false belief that the model will be “better”. Blindly doing so will often result in the models behaving quite differently to what you might expect and produce spurious and missleading estimates.

This tutorial introduces the concepts of correlation, confounding, and causation, emphasizing their importance in ecological studies as well as how we can identify and adjust models for causal inference.

3 Structural Causal Modelling (SCM) framework

Structural Causal Modelling (SCM) is a useful and robust logic-based framework for identifying cause and effect relationships and thus supporting causal inference from observational data. The framework comprises the following seven steps (the details each step will be covered below):

  1. Create a conceptual model of the system of interest

  2. Choose a statistical test

  3. Test the consistency of your conceptual model with the data

  4. Identify biases (confounding, overcontrol, collider) that you need to adjust for when testing the cause of interest

  5. Test for causality with the appropriate statistical test

  6. Repeat steps 3-5 for each cause of interest

  7. Repeat steps 1-6 for other conceptual models

3.1 Step 1 - create the conceptual model

The conceptual model represents the researchers’ understanding of the system of interest as well as their assumptions about the causal structure between variables. This structure should be based on domain knowledge, prior research, and expert opinion.

For example, a conceptual model might indicate that X effects Y (X -> Y), or perhaps X and Y might both be effected by Z and thus Z is a confounder of the relationship between X and Y (X <- Z -> Y).

The conceptual model may include both observed and unobserved variables. Essentially, the conceptual model encodes our structural assumptions about the relationships in the system. Importantly, they do not impose the manner in which those relationships are investigated. That is, they are agnostic to the kinds of statistical models you use - they could be linear/additive/non-linear, frequentist/Bayesian, hierarchical etc.

The conceptual model is visually represented as a Directed Acyclic Graph (DAG) where nodes (dots) represent variables and arrows represent assumed causal relationships. Following are some important properties of a DAG:

  • the direction of an arrow indicates the direction of causality, e.g. X -> Y means that X causes Y.
  • it is acyclic, meaning that there are no cycles or loops in the graph, i.e. you cannot go from X to Y and then back to X.
  • the presence of an arrow implies a possible relationship
  • the absence of an arrow encodes an explicit assumption of no relationship

To help describe DAGs, I think it would be useful to provide some examples with real variables. So lets consider a hypothetical example presented in Arif and MacNeil (2022) that explored the effects of marine protected areas (MPAs) on reef fish biomass.

There is an R package (dagitty) to help with the construction and interrogation of DAGs. For constructing the DAG, we can use either the dagitty() function or the dagify() function.

dag <- dagitty("
dag {
  depth -> mpa
  depth -> fishing
  mpa -> fishing
  fishing -> biomass
  complexity -> mpa
  complexity -> biomass
  human -> fishing
  human -> biomass
  mpa -> coral
  biomass -> coral
}
")
ggdag(dag, text = TRUE, text_size = 2.5) +
  theme_dag_blank()

If the coordinates of the nodes are not specified (which they are not in the above example), they are determined by the “spring” layout algorithm. This algorithm attempts to position the nodes in a way that minimizes the overall energy of the system, similar to how physical systems tend to settle into stable configurations. Importantly, it is a stoichastic algorithm, meaning that the positions of the nodes will change each time the code is run. This can be useful for visualizing complex relationships, but it can also make it difficult to reproduce specific layouts. If you want to ensure that the layout remains consistent across runs, you can specify the coordinates of the nodes manually.

Furthermore, we have an opportunity to tag our response (outcome) and main predictor(s) (exposure).

dag <- dagitty('
dag {
  depth -> mpa
  depth -> fishing
  mpa -> fishing
  fishing -> biomass
  complexity -> mpa
  complexity -> biomass
  human -> fishing
  human -> biomass
  mpa -> coral
  biomass -> coral
  biomass[outcome, pos="3, 1"]
  fishing[pos = "2, 1"]
  mpa[exposure, pos = "1, 1"]
  depth[pos = "1, 2"]
  complexity[pos = "2, 2"]
  human[pos = "3, 2"]
  coral[pos = "2, 0"]
}
')
ggdag(dag, text = TRUE, text_size = 2.5) +
  theme_dag_blank()

dag <- dagify(
  biomass ~ fishing + complexity + human,
  fishing ~ mpa + depth + human,
  mpa ~ depth + complexity,
  coral ~ mpa + biomass
)
ggdag(dag, text = TRUE, text_size = 2.5) +
  theme_dag_blank()

If the coordinates of the nodes are not specified (which they are not in the above example), they are determined by the “spring” layout algorithm. This algorithm attempts to position the nodes in a way that minimizes the overall energy of the system, similar to how physical systems tend to settle into stable configurations. Importantly, it is a stoichastic algorithm, meaning that the positions of the nodes will change each time the code is run. This can be useful for visualizing complex relationships, but it can also make it difficult to reproduce specific layouts. If you want to ensure that the layout remains consistent across runs, you can specify the coordinates of the nodes manually.

Furthermore, we have an opportunity to tag our response (outcome) and main predictor(s) (exposure).

coords <- list(
    x = c(
        biomass = 3, fishing = 2, mpa = 1, depth = 1,
        complexity = 2, human = 3, coral = 2
    ),
    y = c(
        biomass = 1, fishing = 1, mpa = 1, depth = 2,
        complexity = 2, human = 2, coral = 0
    )
)
dag <- dagify(
  biomass ~ fishing + complexity + human,
  fishing ~ mpa + depth + human,
  mpa ~ depth + complexity,
  coral ~ mpa + biomass,
  exposure = "mpa",
  outcome = "biomass",
  coords = coords
)
ggdag(dag, text = TRUE, text_size = 2.5) +
  theme_dag_blank()

3.2 Step 2 - choose a statistical test

This step is largely independent of the first step, but will have consequences for Step 5. As always, the decisions about what sort of statistical analyses to use should be informed by the nature of the questions being asked as well as exploratory data analyses. For the purpose of Step 5, this tutorial will use frequentist linear models, yet for the specific causal inferences we will use Bayesian models.

3.3 Step 3

As indicated above, a DAG encodes all our structural relationship assumptions. For example, in our hypothetical reef observation study, we make numerous assumptions. Some of the more obvious assumptions include:

  • depth and human gravity are independent
  • complexity and depth are independent
  • MPA and human gravity are independent

In each of these cases, there are no paths in which arrows can be followed from one of these variables to the other (even via other variables).

Others are a more complex and less obvious, such as:

  • fishing and structural complexity are independent if we condition on MPA and depth. There is a path (complexity -> mpa -> fishing) unless we include mpa in which case, this path is blocked. Similarly, since depth is a determinant of mpa, we also need to block this path.
  • coral cover and fishing are independent if we condition on biomass and MPA. There is a path (fishing -> biomass -> coral) unless we block this path by including biomass. There is also a path (fishing <- mpa -> coral) where fishing and coral are linked because they have a common cause (mpa) and are thus only independent once we condition on mpa.

There are also many more assumptions like those above, most of which are very difficult to identify without knowing all the complex rules and having a lot practice. Fortunately, once we have a DAG, we can then check to see which variables are conditionally independent of one another via the dagitty::impliedConditionalIndependencies() function. Conditionally independent means that two variables are independent (not correlated) once we condition on (similar to controlling for - e.g. adding to a model) another variable. This provides a set of the explicit assumptions that are encapsulated within our conceptual model.

dagitty::impliedConditionalIndependencies(dag) 
bmss _||_ dpth | cmpl, fshn, humn
bmss _||_ mpa | cmpl, fshn, humn
cmpl _||_ corl | bmss, mpa
cmpl _||_ dpth
cmpl _||_ fshn | dpth, mpa
cmpl _||_ humn
corl _||_ dpth | cmpl, fshn, humn, mpa
corl _||_ dpth | bmss, mpa
corl _||_ fshn | bmss, mpa
corl _||_ humn | bmss, mpa
dpth _||_ humn
humn _||_ mpa

The above function abbreviates variables and uses the following notation devices:

  • \(\perp\!\!\!\perp\) is indendent of
  • \(|\) is conditional on

Lets go through how to interpret some of these:

  • dpth _||_ humn (\(Depth\perp\!\!\!\perp Human~gravity\)), depth is independent of human gravity
  • corl _||_ fshn | bmss, mpa (\(Coral~cover\perp\!\!\!\perp Fishing~pressure | Fish~biomass, MPA\)), coral cover is independent of fishing pressure conditional on fish biomass and MPA status.

Graphically, if two variables are independent (there is no path between them), then they are d-separated (blocked) by the conditioning set.

Once we have the set of assumptions, we should systematically go through and test each of them against our data. For example, from above, we should fit a model of depth against human gravity. If any of the assumptions are violated, then we need to revise the DAG (but this should always be informed by domain knowledge)

The example, as presented in Arif and MacNeil (2022) is intentionally fictitious. This allows us to simulate data and in so doing be in the otherwise impossible situation of knowing the “truth”. This means that we can analyse the data and see how well the estimates align with the true values. So lets simulate data similar to that used by Arif and MacNeil (2022). Note, the observations generated in this simulation are completely unrealistic as they are simply generated from standard normal distributions. Nevertheless, the principles that follow are still valid.

set.seed(123)
N <- 10000
depth <- rnorm(N, mean = 0, sd = 1)
human <- rnorm(N, mean = 0, sd = 1)
complexity <- rnorm(N, mean = 0, sd = 1)
mpa <- rbinom(N, 1, prob = plogis(0.2 * depth + 2.8 * complexity))
fishing <- rnorm(N, mean = -0.99 * mpa + -0.2 * depth + 0.3 * human, sd = 1)
biomass <- rnorm(N, -1.1 * fishing + -0.4 * human + 1.65 * complexity, sd = 1)
coral <- rnorm(N, mean = 0.5 * mpa + 2.5 * biomass, sd = 1)
dat <- data.frame(depth, human, complexity, mpa, fishing, coral, biomass)
head(dat)
        depth      human complexity mpa     fishing      coral    biomass
1 -0.56047565  2.3707252 -0.8362967   0 -0.06474014  -2.798193 -1.2848044
2 -0.23017749 -0.1668120 -0.2205730   1 -0.63374371   2.524107  0.1764197
3  1.55870831  0.9269614 -2.1035148   0  0.38843668 -10.319247 -4.1216673
4  0.07050839 -0.5681517 -1.6678075   0 -0.92777814  -3.268197 -1.7298705
5  0.12928774  0.2250901 -1.0979629   0  0.22189078  -3.114242 -1.5624274
6  1.71506499  1.1319859 -1.6656212   0  0.96233946  -7.754280 -3.2893050

For illustrative purposes, there is a function in the dagitty package that systematically goes through the set of conditional independences and tests each one (localTests()). Although we should perform the usual exploratory data analyses and carefully consider each of the individual analyses (e.g. the model structure, error distribution etc), in the interest of brevity, we will use the localTests() function on this occasion.

By default, the localTests() function performs a series of correlation tests.

tests <- localTests(x = dag, data = dat)
tests
                                            estimate    p.value         2.5%
bmss _||_ dpth | cmpl, fshn, humn      -1.326104e-02 0.18491146 -0.032855508
bmss _||_ mpa | cmpl, fshn, humn        1.255555e-02 0.20938999 -0.007049193
cmpl _||_ corl | bmss, mpa             -2.162561e-03 0.82882998 -0.021763670
cmpl _||_ dpth                          2.000494e-02 0.04545055  0.000405027
cmpl _||_ fshn | dpth, mpa              4.665034e-05 0.99627879 -0.019555398
cmpl _||_ humn                         -3.067806e-03 0.75904469 -0.022666513
corl _||_ dpth | cmpl, fshn, humn, mpa -2.190013e-02 0.02855351 -0.041486319
corl _||_ dpth | bmss, mpa             -2.557142e-02 0.01055610 -0.045150831
corl _||_ fshn | bmss, mpa             -2.631970e-03 0.79244899 -0.022232854
corl _||_ humn | bmss, mpa             -7.609865e-03 0.44677002 -0.027207838
dpth _||_ humn                          6.021718e-03 0.54711505 -0.013579955
humn _||_ mpa                           8.364386e-03 0.40296780 -0.011237526
                                              97.5%
bmss _||_ dpth | cmpl, fshn, humn       0.006343624
bmss _||_ mpa | cmpl, fshn, humn        0.032150653
cmpl _||_ corl | bmss, mpa              0.017440208
cmpl _||_ dpth                          0.039589492
cmpl _||_ fshn | dpth, mpa              0.019648663
cmpl _||_ humn                          0.016533258
corl _||_ dpth | cmpl, fshn, humn, mpa -0.002297126
corl _||_ dpth | bmss, mpa             -0.005972383
corl _||_ fshn | bmss, mpa              0.016970936
corl _||_ humn | bmss, mpa              0.011993955
dpth _||_ humn                          0.025618765
humn _||_ mpa                           0.027959873
plotLocalTestResults(tests)

Interestingly, although the DAG assumptions are broadly consistent with the data, there are three conditional independencies (assumptions) that are unsupported by the data. This is curious given that these data were simulated and these correlations were not baked into the data. That said, if we used a cut off of 0.05, then we are accepting that there is a 5% chance of a false rejection and we did not adjust for multiple tests - if we did, there would definitely not be any p-values less than 0.05 given the number of tests performed.

With real data, we would scrutinise the outcomes of these tests more thoroughly and potentially even either reconsider our conceptual model or otherwise attempt to diagnose why the conceptual model was seemingly inconsistent with the data.

3.4 Step 4 - Identify and adjust for biases

We now need to determine what covariates must (and must not) be added to a statistical model in order to estimate the causal inference of our main predictor (exposure: MPA) on the response (outcome: fish biomass). That is, what combination of covariates must be added in order to avoid biases. This is largely achieved via two rules/criterion:

  • Backdoor criterion: rules used to identify a set of (covariates) variables that blocks all backdoor (non-causal) paths from the predictor (exposure) to the response (outcome) while leaving all causal paths open. A backdoor path is a path that starts with an arrow pointing into the exposure variable (MPA in this case).

  • Frontdoor criterion: rules used to identify a set of variables that blocks all frontdoor paths from the exposure to the outcome. A frontdoor path blocks all non-causal paths from exposure to outcome as well as from Z to Y (e.g. it blocks all confounding). A frontdoor path is a path that starts with an arrow pointing from X and into Y via intermediate variables, i.e. X -> Z -> Y, d-separation The fontdoor criterion is useful if the backdoor criterion nominates a variable that is missing from the data - it provides an alternative.

There are three biases that we need to be aware of that operate respectively within the following four relational structures:

Relation type / DAG Description
  • X indirectly effects Y through a mediator, M
  • Conditioning on M blocks the path, resulting in overcontrol bias
  • The backdoor criterion would instruct not to condition on M, unless we want to estimate the direct effects of X on Y
  • C (confounder) is a common cause of both X and Y
  • Not conditioning on C results in confounding bias
  • Conditioning on C blocks the path
  • The backdoor criterion would instruct to condition on C
  • C (collider) is a common effect of both X and Y
  • Conditioning on C results in collider bias
  • X and Y are independent unless we condition on C
  • The backdoor criterion would instruct not to condition on C
  • acts as a weaker form of either of the other relation types
  • Overcontrol bias: is caused when we condition on a mediator thereby blocking the path from the exposure to the outcome.

  • Confounding bias: is caused when a variable that is a common cause of both the exposure and the outcome (i.e. it is a backdoor path) is not included in a model.

  • Collider bias: is caused when a variable that is a common effect of the exposure and the outcome (i.e. it is a frontdoor path) is included in a model.

So many rules. If this all seems confusing, it is because it is! Fortunately, yet again, there are tools in the dagitty package to help identify what adjustment sets (covariates required to allow unbiased estimates of causal inferences).

Firstly, we could explore the open paths that exist between the exposure (MPA) and the outcome (Fish biomass).

ggdag_paths(dag, from = "mpa", to = "biomass", text_col = "black") +
    coord_cartesian(expand = FALSE, xlim=c(0.5, 3.5), ylim=c(0.5,2.5)) +
    theme_dag_blank(panel.border = element_rect(fill = NA)) 

It appears that there are three open paths between MPA and fish biomass:

  • path 1 is a pipe and thus the backdoor criterion would suggest we do not condition on fishing pressure if we wish to avoid overcontrol bias and estimate the total effect of mpa on fish biomass
  • path 2 is a fork and thus the backdoor criterion would suggest we do condition on structural complexity if we wish to avoid confounding bias and estimate the total effect of MPA on fish biomass
  • path 3 has a fork on top of a pipe. In order to open the flow of information between MPA and fishing (and thus complete the pipe) the backdoor criterion would suggest we do need to condition on depth, but do not condition on fishing pressure.

Even more help comes in the form of the adjustmentSets() function which will identify these adjustments for us such that we do not need to concern ourselves with pipes, forks and colliders.

Now we will inquire about the adjustment sets when seeking to explore the total effect of MPA on fish biomass.

dagitty::adjustmentSets(dag,
                        exposure = "mpa",
                        outcome = "biomass",
                        effect = "total"
                        )
{ complexity, depth }
set.seed(123)
ggdag_adjustment_set(dag,
                     exposure = "mpa",
                     outcome = "biomass",
                     effect = "total",
                     shadow = TRUE, text_col = "black"
                     ) +
  theme_dag_blank()

This confirms that in addition to MPA, a statistical model would need to include both structural complexity and depth, yet should not include fishing pressure, human gravity or coral cover.

So what if we also wanted to estimate the effects of structural complexity on fish biomass? We can follow the same steps.

ggdag_paths(dag, from = "complexity", to = "biomass", text_col = "black") +
    coord_cartesian(expand = FALSE, xlim=c(0.5, 3.5), ylim=c(0.5,2.5)) +
    theme_dag_blank(panel.border = element_rect(fill = NA)) 

There are two paths, the first of which is a direct path and the second is a pipe flowing through two mediators (MPA and fishing). In both cases, the backdoor criterion would suggest that we do not add any other covariates to a model containing structural complexity.

dagitty::adjustmentSets(dag,
                        exposure = "complexity",
                        outcome = "biomass",
                        effect = "total"
                        )
 {}
set.seed(123)
ggdag_adjustment_set(dag,
                     exposure = "complexity",
                     outcome = "biomass",
                     effect = "total",
                     shadow = TRUE, text_col = "black"
                     ) +
  theme_dag_blank()

For this last exploration (effect of structural complexity on fish biomass), we noted that there were two paths. When we inquired about the adjustment sets required to estimate the total effects, the backdoor criterion suggested the model should not include any other covariates. But can we distinguish between the direct effects of structural complexity (pathway complexity -> biomass) and the indirect effects (pathway complexity -> mpa -> fishing -> biomass)?

dagitty::adjustmentSets(dag,
                        exposure = "complexity",
                        outcome = "biomass",
                        effect = "direct"
                        )
{ fishing, human }
{ depth, mpa }
set.seed(123)
ggdag_adjustment_set(dag,
                     exposure = "complexity",
                     outcome = "biomass",
                     effect = "direct",
                     shadow = TRUE, text_col = "black"
                     ) +
  theme_dag_blank()

In order to estimate the direct causal effects of structural complexity on fish biomass, we need to condition on either:

  • fishing pressure and human gravity or
  • depth and MPA

3.5 Step 5 - test for causality

To get more of a feel for these data, lets plot some of the relationships.

g1 <-
  dat |>
  ggplot(aes(y = biomass, x = factor(mpa))) +
  geom_boxplot()
g2 <-
  dat |>
  ggplot(aes(y = biomass, x = fishing)) +
  geom_point()
g3 <-
  dat |>
  ggplot(aes(y = fishing, x = factor(mpa))) +
  geom_boxplot()

g1 + g2 + g3 & theme_bw()

Before we attempt to estimate the total causal effects of MPA on fish biomass, lets calculate what it should be based on the parameters used to simulated the data. We can do this by decomposing the simulation formulas into their parts. To do so, we just need to focus on the mpa -> fishing -> biomass pathway.

  • biomass was drawn from a normal distribution with a mean of \(-1.1\times fishing + -0.4\times human + 1.65\times complexity\) of which only the fishing component (-1.1) is relevant
  • fishing was drawn from a normal distribution with a mean of \(-0.99\times mpa + -0.2\times depth + 0.3\times human\) of which only the MPA component (-0.99) is relevant.

Hence the effect of MPA on fish biomass should be approximately \(-1.1\times -0.99 = 1.098\).

Step 4 above indicated that to yield unbiased causal effects of MPA on fish biomass, the model would need to condition on both depth and structural complexity.

form <- bf(biomass ~ factor(mpa) + depth + complexity)
## summarise data to help inform priors
dat |>
    group_by(mpa) |>
    summarise(median(biomass), sd(biomass),
        depth_sd = sd(biomass) / sd(depth),
        complexity_sd = sd(biomass) / sd(complexity)
    )
## define priors
priors <- prior(normal(-1, 4), class = "Intercept") +
  prior(normal(0, 5), class = "b") +
  prior(student_t(3, 0, 4), class = "sigma")
## fit the model
mod_brms1a <- brm(
  form,
  prior = priors,
  data = dat,
  refresh = 0,
  ## backend = "cmdstanr"
  backend = "rstan"
)
summary(mod_brms1a)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: biomass ~ factor(mpa) + depth + complexity 
   Data: dat (Number of observations: 10000) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept      0.02      0.03    -0.04     0.07 1.00     3620     3199
factormpa1     1.11      0.05     1.02     1.20 1.00     3331     3053
depth          0.21      0.02     0.18     0.24 1.00     4348     2398
complexity     1.64      0.02     1.60     1.69 1.00     3235     2958

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.66      0.01     1.63     1.68 1.00     6002     3105

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The estimate for the effect of MPA on fish biomass is 1.108 (1.018 - 1.2) which is consistent with the true value of \(1.089\).

If we fail to condition on both structural complexity and depth, a fork structure will exist resulting in confounding bias.

form <- bf(biomass ~ factor(mpa))
## summarise data to help inform priors
dat |>
  group_by(mpa) |>
  summarise(median(biomass), sd(biomass))
## define priors
priors <- prior(normal(-1, 4), class = "Intercept") +
  prior(normal(0, 5), class = "b") +
  prior(student_t(3, 0, 4), class = "sigma")
## fit the model
mod_brms1b <- brm(
  form,
  prior = priors,
  data = dat,
  refresh = 0,
  ## backend = "cmdstanr"
  backend = "rstan"
)
summary(mod_brms1b) 
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: biomass ~ factor(mpa) 
   Data: dat (Number of observations: 10000) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     -1.12      0.03    -1.18    -1.07 1.00     3905     2963
factormpa1     3.38      0.04     3.29     3.45 1.00     3536     2785

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     2.05      0.01     2.03     2.08 1.00     4932     3381

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The estimate for the effect of MPA on fish biomass is 3.375 (3.294 - 3.453) which is not consistent with the true value of \(1.089\).

If we correctly recognised that we need to condition on depth and structural complexity (leaving just a pipe structure), yet we erroneously also conditioned on fishing pressure, we would create a overcontrol bias.

form <- bf(biomass ~ factor(mpa) + depth + complexity + fishing)
## summarise data to help inform priors
dat |>
    group_by(mpa) |>
    summarise(median(biomass), sd(biomass),
        depth_sd = sd(biomass) / sd(depth),
        complexity_sd = sd(biomass) / sd(complexity),
        fishing_sd = sd(biomass) / sd(fishing)
    )
## define priors
priors <- prior(normal(-1, 4), class = "Intercept") +
  prior(normal(0, 5), class = "b") +
  prior(student_t(3, 0, 4), class = "sigma")
## fit the model
mod_brms1c <- brm(
  form,
  prior = priors,
  data = dat,
  refresh = 0,
  ## backend = "cmdstanr"
  backend = "rstan"
)
summary(mod_brms1c) 
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: biomass ~ factor(mpa) + depth + complexity + fishing 
   Data: dat (Number of observations: 10000) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept      0.01      0.02    -0.02     0.05 1.00     4234     3413
factormpa1    -0.09      0.03    -0.15    -0.03 1.00     3169     2854
depth         -0.04      0.01    -0.06    -0.02 1.00     5470     3348
complexity     1.64      0.01     1.61     1.67 1.00     3155     3077
fishing       -1.20      0.01    -1.22    -1.18 1.00     4498     3086

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.08      0.01     1.07     1.10 1.00     6757     2510

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The estimate for the effect of MPA on fish biomass is -0.092 (-0.154 - -0.032) which is not consistent with the true value of \(1.089\).

If we correctly recognised that we need to condition on depth and structural complexity (leaving just a pipe structure), yet we erroneously also conditioned on coral cover, we would create a collider bias.

form <- bf(biomass ~ factor(mpa) + depth + complexity + coral)
## summarise data to help inform priors
dat |>
    group_by(mpa) |>
    summarise(median(biomass), sd(biomass),
        depth_sd = sd(biomass) / sd(depth),
        complexity_sd = sd(biomass) / sd(complexity),
        coral_sd = sd(biomass) / sd(coral)
    )
## define priors
priors <- prior(normal(-1, 4), class = "Intercept") +
  prior(normal(0, 5), class = "b") +
  prior(student_t(3, 0, 4), class = "sigma")
## fit the model
mod_brms1d <- brm(
  form,
  prior = priors,
  data = dat,
  refresh = 0,
  ## backend = "cmdstanr"
  backend = "rstan"
)
summary(mod_brms1d) 
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: biomass ~ factor(mpa) + depth + complexity + coral 
   Data: dat (Number of observations: 10000) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     -0.01      0.01    -0.02     0.01 1.00     2789     2884
factormpa1    -0.11      0.01    -0.13    -0.09 1.00     2542     2671
depth          0.02      0.00     0.01     0.03 1.00     4086     2820
complexity     0.09      0.01     0.08     0.11 1.00     2568     2519
coral          0.38      0.00     0.37     0.38 1.00     4638     3913

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.39      0.00     0.39     0.40 1.00     3061     2412

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The estimate for the effect of MPA on fish biomass is -0.113 (-0.134 - -0.091) which is not consistent with the true value of \(1.089\).

To get more of a feel for these data, lets plot some of the relationships.

g1 <-
  dat |>
  ggplot(aes(y = biomass, x = complexity)) +
  geom_point()
g2 <-
  dat |>
  ggplot(aes(y = biomass, x = fishing)) +
  geom_point()
g3 <-
  dat |>
  ggplot(aes(y = fishing, x = factor(mpa))) +
  geom_boxplot()
g4 <-
  dat |>
  ggplot(aes(y = mpa, x = complexity)) +
  geom_point()

g1 + g2 + g3 + g4 & theme_bw()

Recall that there are two pathways between structural complexity and biomass and one of these pathways is a direct pathway (complexity -> biomass). Hence, we can calculate the direct effects, indirect effects and total effects of structural complexity on fish biomass.

Before we attempt to estimate each of these effects, lets calculate what they should be based on the parameters used to simulated the data. We can do this by decomposing the simulation formulas into their parts.

  • biomass was drawn from a normal distribution with a mean of \(-1.1\times fishing + -0.4\times human + 1.65\times complexity\) of which only the complexity component (1.65) is relevant

According to the adjusted sets we explored earlier, the direct effects of structural complexity on fish biomass can be achieved by conditioning on either MPA and depth (which we did when modelling MPA -> biomass) or fishing pressure and human gravity.

form <- bf(biomass ~ complexity + fishing + human)
## summarise data to help inform priors
dat |>
    summarise(median(biomass), sd(biomass),
        complexity_sd = sd(biomass) / sd(complexity),
        fishing_sd = sd(biomass) / sd(fishing),
        human_sd = sd(biomass) / sd(human)
    )
## define priors
priors <- prior(normal(-1, 4), class = "Intercept") +
  prior(normal(0, 5), class = "b") +
  prior(student_t(3, 0, 4), class = "sigma")
## fit the model
mod_brms2a <- brm(
  form,
  prior = priors,
  data = dat,
  refresh = 0,
  ## backend = "cmdstanr"
  backend = "rstan"
)
summary(mod_brms2a)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: biomass ~ complexity + fishing + human 
   Data: dat (Number of observations: 10000) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept      0.02      0.01    -0.00     0.04 1.00     4938     3194
complexity     1.65      0.01     1.63     1.67 1.00     4299     3238
fishing       -1.09      0.01    -1.11    -1.08 1.00     5000     3582
human         -0.41      0.01    -0.43    -0.39 1.00     4384     3082

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.01      0.01     0.99     1.02 1.00     4714     2926

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The estimate for the effect of MPA on fish biomass is 1.647 (1.626 - 1.668) which is consistent with the true value of \(1.65\).

summary(mod_brms1a)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: biomass ~ factor(mpa) + depth + complexity 
   Data: dat (Number of observations: 10000) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept      0.02      0.03    -0.04     0.07 1.00     3620     3199
factormpa1     1.11      0.05     1.02     1.20 1.00     3331     3053
depth          0.21      0.02     0.18     0.24 1.00     4348     2398
complexity     1.64      0.02     1.60     1.69 1.00     3235     2958

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.66      0.01     1.63     1.68 1.00     6002     3105

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
form <- bf(biomass ~ complexity)
## summarise data to help inform priors
dat |>
    summarise(median(biomass), sd(biomass),
        complexity_sd = sd(biomass) / sd(complexity)
    )
## define priors
priors <- prior(normal(-1, 4), class = "Intercept") +
  prior(normal(0, 5), class = "b") +
  prior(student_t(3, 0, 4), class = "sigma")
## fit the model
mod_brms2b <- brm(
  form,
  prior = priors,
  data = dat,
  refresh = 0,
  ## backend = "cmdstanr"
  backend = "rstan"
)
summary(mod_brms2b)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: biomass ~ complexity 
   Data: dat (Number of observations: 10000) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept      0.57      0.02     0.54     0.61 1.00     3877     3085
complexity     2.03      0.02     1.99     2.06 1.00     4001     3037

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.72      0.01     1.70     1.75 1.00     4433     2907

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
  • biomass was drawn from a normal distribution with a mean of \(-1.1\times fishing + -0.4\times human + 1.65\times complexity\) of which only the fishing component (-1.1) is relevant

  • fishing was drawn from a normal distribution with a mean of \(-0.99\times mpa + -0.2\times depth + 0.3\times human\) of which only the MPA component (-0.99) is relevant.

  • MPA was drawn from a binomial distribution with a probability of \(0.2\times depth + 2.8\times complexity\) (on the logit scale) of which only the complexity component (2.8) is relevant. Since 2.8 is on the logit scale, it does not represent a constant slope (unlike the other Gaussian parameters). We can approximate the linear slope by calculating the average slope.

    \[ \frac{dp}{dx} = p(1-p) \cdot \beta \] where \(p = \frac{1}{1 + e^{-\beta\times X}}\) and \(X\) is a vector of the predictor values.

p <- plogis(2.8 * dat$complexity)
dp_dx <- 2.8 * p * (1 - p)
mean(dp_dx)
[1] 0.3349129

Hence the indirect effect of complexity on fish biomass should be approximately \((-1.1\times -0.99)\times 0.335 = 0.365\)

References

Arif, Suchinta, and M. Aaron MacNeil. 2022. “Utilizing Causal Diagrams Across Quasi‐experimental Approaches.” Ecosphere 13 (4). https://doi.org/10.1002/ecs2.4009.