# Introduction to mediation analysis with structural equation modeling

## 1. What is mediation analysis?

In mediation, we consider an intermediate variable, called the *mediator*, that helps explain how or why an independent variable influences an outcome. In the context of a treatment study, it is often of great interest to identify and study the mechanisms by which an intervention achieves its effect. By investigating mediational processes that clarify how the treatment achieves the study outcome, not only can we further our understanding of the pathology of the disease and the mechanisms of treatment, but we may also be able to identify alternative, more efficient, intervention strategies. For example, a tobacco prevention program may teach participants how to stop taking smoking breaks at work (the intervention) which changes their social norms about tobacco use (the intermediate *mediator*) and subsequently leads to a reduction in smoking behavior (study outcome).[1]

With mediation analysis, we gain insight and acquire deep understanding about the mechanism of action of pharmacological and psychotherapeutic treatments. Such information provides an added dimension to understand the etiology of disease and the pathways of therapeutic effects, which can stimulate the identification of more efficacious and cost-efficient alternative therapies.

## 2. What is structural equation modeling?

Structural equation modeling (SEM) is a very general, very powerful multivariate technique. It uses a conceptual model, path diagram and system of linked regression-style equations to capture complex and dynamic relationships within a web of observed and unobserved variables. Although similar in appearance, SEM is fundamentally different from regression. In a regression model, there exists a clear distinction between dependent and independent variables. In SEM, however, such concepts only apply in relative terms since a dependent variable in one model equation can become an independent variable in other components of the SEM system.[2],[3] It is precisely this type of reciprocal role a variable plays that enables SEM to infer causal relationships.

SEM models include both *endogenous* and *exogenous* variables. Endogenous variables act as a dependent variable in at least one of the SEM equations; they are called endogenous variables rather than response variables because they may become independent variables in other equations within the SEM equations. Exogenous variables are always independent variables in the SEM equations. SEM equations model both the causal relationships between endogenous and exogenous variables, and the causal relationships among endogenous variables.

SEM models are best represented by path diagrams. A path diagram consists of nodes representing the variables and arrows showing relations among these variables. By convention, in a path diagram latent variables (e.g., depression) are represented by a circle or ellipse and observed variables (e.g., a score on a rating scale) are represented by a rectangle or square. Arrows are generally used to represent relationships among the variables. A single straight arrow indicates a causal relation from the base of the arrow to the head of the arrow. Two straight single-headed arrows in opposing directions connecting two variables indicate a reciprocal causal relationship. A curved two-headed arrow indicates there may be some association between the two variables. Error terms for a variable are inserted into the path diagram by drawing an arrow from the value of the error term to the variable with which the term is associated.

For example, in most path diagrams for cross-sectional data, error terms are not connected, indicating stochastic independence across the error terms. But if we suspect association between error terms – which is likely to occur in most longitudinal studies – the error terms should be connected by curved two-headed arrows. See Bollen[2] and Kowalski and Tu[3] for more details about modeling complex relationships involving latent constructs using SEM.

## 3. Advantages of using structural equation modeling instead of standard regression methods for mediation analysis

Baron and Kenny,[4] in the first paper addressing mediation analysis, tested the mediation process using a series of regression equations. However, mediation assumes both causality and a temporal ordering among the three variables under study (i.e. intervention, mediator and response). Since variables in a causal relationship can be both causes and effects, the standard regression paradigm is ill-suited for modeling such a relationship because of its *a priori* assignment of each variable as either a cause or an effect.[1],[5],[6] Structural equation modeling (SEM) provides a more appropriate inference framework for mediation analyses and for other types of causal analyses.

There are many advantages to using the SEM framework in the context of mediation analysis. When a model contains latent variables such as happiness, quality of life and stress, SEM allows for ease of interpretation and estimation. SEM simplifies testing of mediation hypotheses because it is designed, in part, to test these more complicated mediation models in a single analysis.[7] SEM can be used when extending a mediation process to multiple independent variables, mediators or outcomes. This contrasts with standard regression, in which ad hoc methods must be used for inference about indirect and total effects.[4],[8],[9] These ad hoc methods rely on combining the results of two or more equations to derive the asymptotic variance. This is especially problematic when there are different numbers of observations missing in the different regression equations representing a mediation process. Also, in standard regression, we handle missing data via listwise deletion since there is no built-in missing data mechanism when using ordinary least squares (OLS).

Another important advantage of SEM over standard regression methods is that the SEM analysis approach provides model fit information about the consistency of the hypothesized mediational model to the data and evidence for the plausibility of the causality assumptions[10],[11] made when constructing the mediation model. The standard regression procedure initially recommended by Baron and Kenny[4] has also been shown to be low powered.[7] Moreover, unlike standard regression approaches, SEM allows for ease of extension to longitudinal data within a single framework, corresponding with a study’s conceptual framework for clear hypothesis articulation.[12] Finally, Bollen and Pearl[10] note that even when the same equation is used in SEM and in regression analysis, the results will be different because they are based on completely different assumptions. Standard regression analysis implies a statistical relationship based on a conditional expected value, while SEM implies a functional relationship expressed via a conceptual model, path diagram, and mathematical equations. Thus, the causal relationships in a hypothesized mediation process, the simultaneous nature of the indirect and direct effects, and the dual role the mediator plays as both a cause for the outcome and an effect of the intervention are more appropriately expressed using structural equations than using regression analysis.

## 4. Use of SEM for mediation analysis

Figure 1 shows a path diagram for the causal relationships between the three variables in the smoking prevention example discussed earlier: prevention program *(x _{i})*, social norm

*(z*, and amount of smoking

_{i})*(y*. In this example, all variables that are effected by other variables – social norms and amount of smoking – are endogenous variables, while variables that only impart an effect on other variables without being effected by other variables – the prevention program – are exogenous variables. All three variables in this smoking prevention example are assumed to be all observed so rectangles (not circles) are used to represent the variables.

_{i})**Pathway of a mediation process for a tobacco prevention program**

The SEM for this mediation model for the *i*th subject (1 ≤ *i* ≤ *n*) is given by:

We assume the error terms *(ϵ _{zi},ϵ_{yi})* are uncorrelated, an important assumption for causal inference in performing mediation analysis.[10],[11] We also assume multivariate normality for the error terms; this is a necessary underlying condition of the definition of direct, indirect and total effects. Note that the two structural equations are linked together and inference about them is simultaneous, unlike two independent standard regression equations.

The *direct effect* is the pathway from the exogenous variable to the outcome while controlling for the mediator. Therefore, in our path diagram *γ _{xy}* is the direct effect. The

*indirect effect*describes the pathway from the exogenous variable to the outcome through the mediator. This path is represented through the product of

*β*and

_{xz}*γ*. Finally, the

_{zy}*total effect*is the sum of the direct and indirect effects of the exogenous variable on the outcome,

*γ*+

_{xy}*β*

_{xz}γ_{zy}.The primary hypothesis of interest in a mediation analysis is to see whether the effect of the independent variable (intervention) on the outcome can be mediated by a change in the mediating variable. In a full mediation process, the effect is 100% mediated by the mediator, that is, in the presence of the mediator, the pathway connecting the intervention to the outcome is completely broken so that the intervention has no direct effect on the outcome. In most applications, however, partial mediation is more common, in which case the mediator only mediates part of the effect of the intervention on the outcome, that is, the intervention has some residual direct effect even after the mediator is introduced into the model.

In terms of testing the primary hypothesis of interest, we start by examining a reduced regression equation without the mediator:

If we accept the null hypothesis (H_{0}: *γ* _{xy}*=

*0*) for this reduced regression equation, then

*x*and

*y*(i.e., the intervention and the outcome) are not related and we should not consider potential mediators. We then proceed to evaluate the SEM for the mediation model if we reject the null hypothesis for this reduced regression equation. Full mediation (i.e., the intervention has no direct effect on the outcome) corresponds to the null hypothesis, H

_{0}:

*γ*=

_{xy}*0*. If this null is rejected, it becomes of interest to assess partial mediation via the direct, indirect and total effects. Inference (standard errors and p-values) about such effects is easily performed using the Delta or Bootstrap methods.[8],[9],[13]

Significant advances have been made over the past few decades in the theory, applications and associated software development for fitting SEM models that can be used in the context of mediation analysis. For example, in addition to specialized packages such as LISREL,[14] MPlus,[15] EQS,[16] and Amos,[17] procedures for fitting SEM are also available from general-purposes statistical packages such as R, SAS, STATA and Statistica. These packages provide inference based on maximum likelihood, generalized least squares, and weighted least squares.

## 5. An example of mediation analysis using SEM to model the relationship of drinking to suicidal risk

Project MATCH[18] is a multisite treatment trial for alcohol use disorders that enrolled 1,726 participants (including 24% women) with a mean (sd) age of 40.2 (11.0) years. Previously, studies of alcohol dependent individuals established that drinking promotes depressive symptoms and depressive disorders and that depression is an important risk factor for suicidal thoughts and behavior.[19] Therefore, considering the context of the study and prior theory, mediation analysis was used to evaluate the hypothesis that greater drinking intensity leads to higher levels of depression which, in turn, leads to suicidal ideation.[19] In the model, drinking intensity was a latent construct based on three months of data about drinking behavior, while depression and suicidal ideation were measured using the Beck Depression Inventory.[20]

Mediation analysis with SEM was performed using MPlus software. Age, gender, race, treatment assignment, study arm, and baseline percent days abstinent were controlled for in the structural equations for each endogenous variable in the structural model. The outcome – the presence or absence of suicidal ideation – was analyzed via the probit link (which is used to transform outcome probabilities to the standard normal variable), which made it possible to interpret the indirect, direct and total effects on an interval scale. Subjects were assessed at baseline and at 3-, 9-, and 15-month follow-up, but in order to derive a single direct, indirect and total effect in the model (as in models of cross-sectional data) we constrained all model parameters at the three follow-up times to be equal and controlled for the baseline value of the outcome measure. Standardized estimates (between -1 and 1) were reported rather than raw estimates, so that estimates from different structural equations are on the same scale, simplifying interpretation.

In the regression equation without the mediator, the estimate of the causal path from drinking intensity to suicidality was significant .

The path diagram of Figure 2 of the mediation model includes the standardized estimates for the causal paths for the indirect and direct effects. Both estimated paths for the indirect effect were statistically significant, while the estimate of the direct effect from drinking intensity to suicidal ideation was close to zero and not significant. Therefore, potentially, depression fully mediates the path between drinking intensity and suicidal ideation. The model showed reasonably good model fit according to multiple SEM fit statistics and indices: *χ ^{2}*(

*df*=59)=218.29,

*p*≤0.001; Root Mean Square Error of Approximation (RMSEA)=0.042; Comparative fit index (CFI)=0.947; Tucker-Lewis index (TLI)=0.933. Rule of thumb guidelines are that CFI ≥0.95, TLI ≥0.95 and RMSEA ≤0.05 represent a good fitting model.

**Pathway of a mediation process for a clinical model of drinking and suicidal risk (*p<0.05)**

## 6. Other issues to consider when performing mediation analysis

Baron and Kenny[4] distinguished mediation from moderation, in which a third variable affects the strength or direction of the relationship between an independent variable and an outcome. In multi-group analyses a moderator is typically either part of an interaction term or a grouping variable. For example, if males are known to react differently than females to a particular intervention for lowering cholesterol, in a gender by treatment interaction effect, gender is a moderator. In mediated-moderation, such an interaction is used as an independent (i.e., exogenous) variable in the SEM path diagram.

Longitudinal data help capture both within-individual dynamics and between individual differences over time. Also, longitudinal data allow for the examination of whether changes in the mediator are more likely to precede changes in the outcome, presenting more accurate representations of the temporal order of change over time that lead to more accurate conclusions about mediation.[7] Latent growth modeling is an SEM extension for longitudinal data that can flexibly evaluate mediating relationships between multiple time-varying measures.[12] Autoregressive and multilevel models have also been used for longitudinal mediation analyses with SEM.

Causal inference methods, which use the language of counterfactuals and potential outcomes, have been used in mediation analysis.[21] These approaches address the issues of potential confounders of the mediator-outcome relationship and of potential interactions between the mediator and treatment. They also provide definitions for deriving effects for analyses involving mediators and outcomes that are not on an interval scale (i.e. count data, categorical data). These causal inference methods can be applied in the SEM framework.[22],[23] Imai and colleagues[11] proposed approaches to extend SEM by using causal inference methods to generate a more general definition, identification, estimation, and sensitivity analysis of causal mediation effects that are not based on any specific statistical model; they also introduced a R package for performing causal mediation analysis using their approaches.[11]

## 7. Conclusion

Mediation helps explain the mechanism through which an intervention influences an outcome and assumes both causal and temporal relations. When performed using strong prior theory and with appropriate context, mediation analysis helps provide a focus for future intervention research so more efficacious and cost-efficient alternative therapies may be developed. Structural equation modeling provides a very general, flexible framework for performing mediation analysis.

## Biography

Dr. Douglas Gunzler is a Senior Instructor of Medicine at the Center for Health Care Research and Policy, Case Western Reserve University. His research has focused on structural equation modeling and longitudinal analysis, emphasizing mediation analysis, missing data, multi-level modeling and distribution-free models, with applications in mental health and neurology. Dr. Gunzler received his PhD in Statistics from the Department of Biostatistics and Computational Biology at the University of Rochester in 2011.

## Footnotes

**Conflict of Interest:** The authors report no conflict of interest related to this manuscript.

**Funding:** Financial support for this study was provided by a grant from NIH/NCRR CTSA KL2TR000440. The funding agreement ensured the authors’ independence in designing the study, interpreting the data, writing, and publishing the report.