A Brief Introduction to Mixed-Effects Models

In mixed-effects models, additional variance components are introduced into the fixed-effects (traditional regression) structure. Broadly speaking, these can be thought of as associating different types of variance to different groupings – for example, by items or by subjects, with the remaining unexplained variance included as the error or “residual” term. These variance groupings can be applied to any term in the model: the intercept (e.g. allowing for the intercept to vary between test subjects, thus allowing for each subject to have a different baseline response) or to the various slopes (e.g. the differential response to experimental manipulations, thus accommodating different responses between subjects).

Selection of Random-Effects Structure

A major topic of debate in the application of mixed models to psycho- and neurolinguistic data is the structure of the random effects.

While Baayen, Davidson, and Bates (2008) recommend forward selection of the random-effects structure, starting from the minimal intercepts-only structure, Barr et al. (2013) recommend backwards selection from the maximal random-effects structure, and Barr (2013) takes this suggestion one step further and suggests including all interactions in the random-effects structure. In practice, Barr et al. (2013)’s suggestion is somewhat problematic as complex random-effect structures are costly to compute and often fail to converge on real data sets (D. Bates, Kliegl, et al. 2015). Moreover, the backward selection procedure suggested by Barr et al. potentially leads to issues with overparameterization (D. Bates, Kliegl, et al. 2015). Another suggestion common to the mixed model literature is to follow the random-effects structure that best models the experimental design (see for example the GLMM wiki) and use a parsimonious model (D. Bates, Kliegl, et al. 2015).

Content of Model Summaries

The full model summary, as produced by lme4 (D. Bates, Maechler, et al. 2015) and formatted by lmerOut for , includes the following components:

Statement of fitting method – maximum likelihood or REML (restricted or residual maximum likelihood).
Measures of fit, dependent on fitting method:
- ML:
  - AIC: Akaike Information Criterion (Akaike 1974) – goodness-of-fit measure penalized for number of parameters. Smaller is better, but there is no absolute “good” value.
  - BIC: Bayesian Information Criterion (Schwarz 1978) – goodness-of-fit measure penalized for number of parameters. Similar to AIC, but with larger penalty for number of parameters. Smaller is better, but there is no absolute “good” value.
  - logLik: Log-Likelihood – unpenalized goodness-of-fit measure. Always negative; larger (closer to zero) is better.
  - deviance – unpenalized goodness-of-fit measure, equal to -2 logLik. Similar to Residual Sum of Squares (RSS) in traditional linear regression. Smaller is better.
- REML: REML criterion at convergence – similar to deviance but dependent on the fixed-effects parameterization
Summary of variance components:
- Scaled residuals – summary of the distribution of the final residual error in the model
- Random effects – summary of the the different variance components by group on both the Variance and Standard Deviation scales. As residuals are technically a random effect, they are also listed here again.
- Number of observations and the number of groups in each grouping term.
Summary of fixed effects – comparable to traditional regression output with:
- Parameter estimates
- Standard Error
- \(t\)-values (ratio of estimate to error)
- but no \(p\)-values, as it is non trivial to estimate the degrees of freedom in general. See the R FAQ 7.35 “Why are p-values not displayed when using lmer()?” for more information. The package lmerTest implements a variant where the degrees of freedom are estimated via the Satterthwaite approximation and the variances optionally corrected by the Kenward-Roger approximation. Beyond doubts about the accuracy of such approximations, these corrections are not necessary models for typical ERP datasets, as the effective degrees of freedom is always high enough that we can use the fact that the \(t\) distribution converges asymptotically to the normal distribution with increasing degrees of freedom.

Fitting Method

Several methods for fitting mixed-effects models exist based on different approximation methods. For the non-generalized linear case (linear mixed model, LMM), the three most common methods are residualized maximum likelihood (REML), maximum likelihood estimation (ML or MLE) and Markov-Chain Monte Carlo (MCMC) methods. The MCMC class of methods has been widely adopted in Bayesian settings and is extremely flexible and accurate in the limit, but is comparatively slow and complex and requires the specification of a prior (Bolker et al. 2009). Likelihood-based methods are faster and directly interpretable in a frequentist framework. In some sense, likelihood-based methods attempt to find the best model in terms of parameter values for a given specification and dataset: the general form of the model is specified by the experimenter and this is “fitted” to the data such that the probability that the data arises from the model (\(P(D|M)\), i.e. the likelihood of the model given the data) is maximized.

Very coarsely, REML attempts to provide less biased estimates of the variance components (random-effects) in much the same way that Bessel’s correction (using \(n-1\) in the denominator instead of \(n\)) works for estimates of the population variance, \(\hat{\sigma}^2\). One of the paradoxes of basic statistics is that Bessel’s correction provides an unbiased estimate of the variance but a biased estimate of the standard deviation, \(\hat{\sigma}\). This is related to the basic observation that ML estimators are invariant under many common transformations (e.g. logarithms und square roots) – the square root of the ML estimate for the variance is the MLE of the standard deviation, but unbiased non-ML estimators such as Bessel’s correction, are not. Combined with the observation that the distribution of variance estimates is highly skewed and thus poorly summarized by traditional measures of location, and the necessity of unbiased estimates becomes questionable.^[1]

Moreover, the unbiased estimator comes at a cost. The numerical formulation for REML leads to its “likelihood” being dependent on the (formulation of the) fixed-effects model matrix and as such REML-fitted models are not in general comparable via likelihood-ratio tests (see below). This again highlights a fundamental issue with REML: it is not actually the “likelihood”, which is completely determined by the probability model (as noted by Douglas Bates on the R-SIG-ME mailing list), and as such it is somewhat misleading to label other optimization criteria as being some “form” of maximum likelihood.

Due to these issues with REML, full-model summaries and estimates of coefficients (including graphical presentations) are given using ML-fitted models. Nonetheless, REML-estimates are needed for computing \(F\)-tests with the Kenward-Roger variance correction and Satterthwaite degrees-of-freedom correction presented in the main text. The more anticonservative and much faster \(\chi^2\) tests can be computed with ML-estimates, in part because they are not dependent on variance estimates for the denominator.

Comparing Model Fit

It is often useful to determine which of several models best describes or “fits” the data. (Log) Likelihood provides a direct measure of model fit – likelihood is simply the probability that a given model would generate the exact dataset in question. (This is occasionally called the marginal likelihood of the model as it measures the likelihood of the model as a whole, i.e. over all conditions and terms, and not that of any particular effect.^[2]) However, it is always possible to increase the likelihood by adding additional parameters, e.g. by adding a per-observation parameter, thus perfectly modelling the observed data. Additional parameters have two potential costs: (1) loss of parsimony and (2) loss of predictive power. Following Occam’s Razor, we should find the most parsimonious solution; we also want to use our models to make inferences about not just the observed data but also about future data, and thus we want a model that maximizes predictive power (even at the cost of descriptive power, i.e. poorer fit on the observed data). Many measures have been suggested as a combined measure of model fit and model parsimony; two of the most popular for likelihood-based methods are the Akaike Information Criterion (AIC, Akaike 1974) and the Bayesian Information Criterion (BIC, Schwarz 1978). Both are based on the deviance (i.e. -2 logLik) penalized for the number of parameters, but BIC penalizes more heavily – the penalty for AIC increases linearly with number of parameters^[3] while the penalty for BIC increases multiplicatively with the number of parameters and the (logarithm of the) number of observations^[4]. There is no absolute nor bad for log likelihood, AIC or BIC, but in all cases the general rule of thumb is that closer to zero is better.

For nested models, there is also a significance test for comparing fits, as measured by likelihood, called the likelihood-ratio test (LRT). Likelihood ratios follow a \(\chi^2\) distribution asymptotically and can thus be compared using this distribution as the reference distribution. The \(\chi^2\) degrees of freedom is equal to the difference in model degrees of freedom. This is why this test is only valid for nested models – the difference in model degrees of freedom between non-nested models does not measure the restriction of one model to another (i.e. the creation of a simpler, special case), which is required for the asymptotic \(\chi^2\) distribution. Because the deviance is equal to twice the log likelihood, the \(\chi^2\) statistic for the LRT reduces to the difference between the deviances.^[5] The LRT can be overly conservative when the value of a parameter is on the edge of its parameter space, e.g. when testing (the addition of) variance components whose estimates are near zero as variance is per definition non negative (Stram and Lee 1994; Bates 2010; Pinheiro and Bates 2010).

Analogues to the Coefficient of Determination (\(R^2\))

In simple linear regression, the coefficient of determination, \(R^2\), can be interpreted as represented the percent of variance explained by the model. Conveniently, \(R^2\) is the square of Pearson’s correlation coefficient, \(r\), for the correlation between the dependent and independent measures. When extended to multiple regression, \(R^2\) can still be a useful tool but becomes more complex, in much the same way that correlation becomes more complex in a multivariate setting. Nonetheless, an adjusted (for parsimony/model complexity) \(R^2\) value is still often reported for multiple regression. Both \(R^2\) and adjusted \(R^2\) have a number of useful properties and a relatively straightforward interpretation.

In a mixed-effects context, however, there is no measure with all the properties of \(R^2\). Intuitively, this can be thought of as related to the complexity of correlation in a within-subjects design. Calculating correlation across trials without reference to subjects loses information and suffers from violations of independence. Calculating correlation within subjects and then averaging ignores issues related to Simpson’s Paradox. Both methods have advantages and disadvantages, but neither has all the properties and ease of interpretation of bivariate correlation. Similarly, several pseudo \(R^2\) measures have been proposed for mixed-effects models (e.g. Edwards et al. 2008; Gelman and Pardoe 2006; Heinzl, Waldhör, and Mittlböck 2005; Orelien and Edwards 2008; Xu 2003; Snijders and Bosker 1994; Hössjer 2008), but they do not enjoy the ease of interpretation of \(R^2\) in simple regression, especially in mixed-effects models with multiple, interacting predictors, as has been noted by Douglas Bates on R-SIG-mixed-models and elsewhere (see the GLMM FAQ for a discussion of the issues involved as well as links to previous, longer discussion). Finally, as Payne, Lee, and Federmeier (2015) noted in their sentence-level analyses, where the noise is already lower than in a naturalistic context, the inter-trial variability of the EEG is such that pseudo \(R^2\) measures do not offer much insight, even where the effects are strong.

Analysis of Deviance and Wald Tests for Linear Hypotheses

For large models, examining individual coefficients can be tedious and difficult. One possibility is repeated testing of nested models via likelihood-ratio tests, removing each predictor and its higher-level interactions one at a time to determine the (marginal) contribution of each predictor to model fit. This is a tedious and extremely computationally intensive process for models with multiple predictors.

Alternatively, we can use Type-II (marginal) Wald tests in an Analysis of Deviance. (Technically, the \(t\) tests on the coefficients are also Wald tests.) These tests measure the impact on the model of removing a particular term from the model, e.g. the change in deviance, and are, under certain assumptions, asymptotically equivalent to likelihood-ratio tests. As they measure the impact of a particular term (e.g. morphology) and not that of a particular coefficient or contrast (e.g. “nominative > mean”), they also provide a more compact way of examining the effect of a given manipulation across contrast levels. This is analogous to the \(F\)-tests in a traditional Analysis of Variance, instead of examining of the coefficients of the linear model that ANOVA is based upon.

The naive and computationally simple approach is to assume “infinite” denominator degrees of freedom for the \(F\) tests in an Analysis of Deviance. This is equivalent to the assumption that the coefficients follow a normal and not a \(t\) distribution. This yields \(\chi^2\) tests where the degrees of freedom are the number of coefficients in the model for a given term (i.e. the same as in the LRT). A more accurate but much more computationally complex approach uses the Kenward-Roger corrected estimates for the variance and the Satterthwaite approximation for the degrees of freedom. When there are sufficient numbers of both observations and levels of the grouping variables (for the random effects), the two methods yield similar results, but generally the \(\chi^2\) test statistic yields anticonversative estimates compared to the \(F\) test statistic.

Type-II Wald tests have a number of problems (cf. Fox 2016, 724–725, 737–738, and discussions on R-SIG-mixed-models), but even assuming that their results yield an anti-conservative estimate, they still allow of a much simpler summary of effect structure (cf. Bolker et al. 2009). A simple way to compensate for anti-conservative estimate is adopt a stricter significance threshold.

As the Wald tests presented here (Type-II) are marginal tests, they test the effect of completely removing a given term – and thus all of its interactions – from the model. As such, it is possible that the Wald test for a lower-order effect (e.g. main effect) is significant, although the corresponding \(t\)-value in the summary fails to achieve the \(|t| \geq 2\) threshold. Since it is problematic to interpret main effects in the presence of interactions anyway, this is not a large problem (cf. Venables 1998).^[6]

Bibliography

Akaike, Hirotugu. 1974. “A New Look at the Statistical Model Identification.” IEEE Transactions on Automatic Control 19 (6): 716–23. doi:10.1109/TAC.1974.1100705.

Baayen, R. H., D. J. Davidson, and D. M. Bates. 2008. “Mixed-Effects Modeling with Crossed Random Effects for Subjects and Items.” Journal of Memory and Language 59: 390–412.

Barr, Dale J. 2013. “Random Effects Structure for Testing Interactions in Linear Mixed-Effects Models.” Frontiers in Psychology 4 (328). doi:10.3389/fpsyg.2013.00328.

Barr, Dale J., Roger Levy, Christoph Scheepers, and Harry J. Tily. 2013. “Random Effects Structure for Confirmatory Hypothesis Testing: Keep It Maximal.” Journal of Memory and Language 68: 255–78. doi:10.1016/j.jml.2012.11.001.

Bates, Douglas M. 2010. lme4: Mixed-Effects Modeling with R. Draft. http://lme4.r-forge.r-project.org/lMMwR/lrgprt.pdf.

Bates, Douglas, Reinhold Kliegl, Shravan Vasishth, and Harald Baayen. 2015. “Parsimonius Mixed Models.” ArXiv, 1506.04967v1.

Bates, Douglas, Martin Maechler, Benjamin M. Bolker, and Steven Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” ArXiv, 1406.5823.

Bolker, Benjamin M, Mollie E Brooks, Connie J Clark, Shane W Geange, John R Poulsen, M Henry H Stevens, and Jada-Simone S White. 2009. “Generalized Linear Mixed Models: A Practical Guide for Ecology and Evolution.” Trends Ecol Evol 24 (3): 127–35. doi:10.1016/j.tree.2008.10.008.

Edwards, Lloyd J., Keith E. Muller, Russell D. Wolfinger, Bahjat F. Qaqish, and Oliver Schabenberger. 2008. “An \(R^2\) Statistic for Fixed Effects in the Linear Mixed Model.” Statistics in Medicine 27 (29). John Wiley & Sons, Ltd.: 6137–57. doi:10.1002/sim.3429.

Fox, John. 2016. Applied Regression Analysis and Generalized Linear Models. 3rd ed. Thousand Oaks, CA: Sage.

Fox, John, and Sanford Weisberg. 2011. An R Companion to Applied Regression. Second. Thousand Oaks CA: Sage. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion.

Gelman, Andrew, and Iain Pardoe. 2006. “Bayesian Measures of Explained Variance and Pooling in Multilevel (Hierarchical) Models.” Technometrics 48 (2): 241–51. doi:10.1198/004017005000000517.

Heinzl, Harald, Thomas Waldhör, and Martina Mittlböck. 2005. “Careful Use of Pseudo R-Squared Measures in Epidemiological Studies.” Statistics in Medicine 24 (18): 2867–72. doi:10.1002/sim.2168.

Hössjer, Ola. 2008. “On the Coefficient of Determination for Mixed Regression Models.” Journal of Statistical Planning and Inference 138 (10): 3022–38. doi:10.1016/j.jspi.2007.11.010.

Orelien, Jean G., and Lloyd J. Edwards. 2008. “Fixed-Effect Variable Selection in Linear Mixed Models Using \(R^2\) Statistics.” Computational Statistics and Data Analysis 52 (4): 1896–1907. doi:10.1016/j.csda.2007.06.006.

Payne, Brennan R., Chia-Lin Lee, and Kara D. Federmeier. 2015. “Revisiting the Incremental Effects of Context on Word Processing: Evidence from Single-Word Event-Related Brain Potentials.” Psychophysiology. doi:10.1111/psyp.12515.

Pinheiro, José, and Douglas Bates. 2010. Mixed-Effects Models in S and S-PLUS. Springer New York. https://books.google.de/books?id=3TVDAAAAQBAJ.

Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” Annals of Statistics 6 (2): 461–64.

Snijders, Tom A. B., and Roel J. Bosker. 1994. “Modeled Variance in Two-Level Models.” Sociological Methods and Research 22 (3): 342–63. doi:10.1177/0049124194022003004.

Stram, Daniel O., and Jae Won Lee. 1994. “Variance Components Testing in the Longitudinal Mixed Effects Model.” Biometrics 50 (4): 1171–77. doi:10.2307/2533455.

Venables, W N. 1998. “Exegeses on Linear Models.” Washington, DC.

Xu, Ronghui. 2003. “Measuring Explained Variation in Linear Mixed Effects Models.” Statistics in Medicine 22 (22): 3527–41. doi:10.1002/sim.1572.

Notes

1. This is a brief summary of a longer comment by Douglas Bates on the R-SIG-ME mailing list.↩︎

2. See below for an explanation of the term “marginal”.↩︎

3. AIC = \(-2 \log L + 2p\), where \(L\) is the likelihood and \(p\) is the number of parameters. ↩︎

4. BIC = \(-2 \log L + p\log n\), where \(L\) is the likelihood, \(p\) is the number of parameters and \(n\) is the number of observations.↩︎

5. Recall that \(\log\frac{a}{b} = \log a - \log b\).↩︎

6. Indeed, this is how Type-II (marginal) – tests differ from Type-I (sequential) and Type-III (non-sequential, i.e. not sensitive to order and keeping higher-order interactions when testing a lower-order term).↩︎