FaST(er) linear mixed models

Mixed models
Genetics
Open project
Author

Tom Michoel

Published

April 2, 2025

Background

Mixed models and random effects models have become popular in computational biology to model the correlation structure of data where samples are not independent (for instance due to spatial, temporal, or population). Much of the current popularity is thanks to the FaST-LMM algorithm described in Lippert et al. (2011). Although there have been further improvements, the core algorithm has remained unchanged. It is based on two principles:

  1. Reducing the likelihood function to a function depending on a single paramater by using analytic solutions for the other parameters given the free parameter.
  2. Using the singular value decomposition to speed up the likelihood evaluations.

Here I want to describe a simple improvement by rewriting the likelihood function in a form that is amenable to automatic differentiation and therefore the use of more efficient optimization algorithms than the grid-based search in the original algorithm, which itself was based on earlier work by Kang et al. (2008). The method presented here is implemented in the FaSTLMMlight package, and the same method description is also included in the package documentation.

I have not done a head to head comparison for speed between the original and new algorithm (which would require reimplementing the original one to exclude differences due to programming language choice). Anyone interested in taking this forward, let me know!

If you haven’t read and fully understood the supplementary text of the FaST-LMM paper, then this note is probably not for you. For a thorough review of parameter estimation in mixed models in general, this paper is highly recommended.

Setup

We consider the setup of Lippert et al. (2011), where:

  • \(y\in\mathbb{R}^n\) is a response vector with values for \(n\) samples,
  • \(X\in\mathbb{R}^{n\times d}\) is a matrix with data of \(d\) covariates (fixed effects) in the same \(n\) samples,
  • \(K\in\mathbb{R}^{n\times n}\) is a positive semi-definite sample similarity matrix, scaled such that \(\mathrm{tr}(K)=n\),
  • \(\mu\) is an (unknown) overall mean parameter,
  • \(\beta\in\mathbb{R}^d\) is the (unknown) vector of fixed effect weights,
  • \(\sigma^2\) is the (unknown) variance explained by \(K\),
  • \(\sigma_e^2\) is the (unknown) residual error variance,
  • \(\delta = \frac{\sigma_e^2}{\sigma^2}\) is the variance ratio,

and \(y\) is distributed as

\[ y \sim N\bigl(\mu + X\beta, \sigma^2 K + \sigma_e^2 I\bigr) = N\bigl( \mu + X\beta, \sigma^2 (K + \delta I)\bigr) \]

where \(N\) denotes a multivariate normal distribution.

By adding a column of ones to \(X\) we can absorb the parameter \(\mu\) in the vector of fixed effect weights \(\beta\) and are left with the unknown parameters \((\beta,\sigma^2,\delta)\). These parameters are estimated by maximum-likelihood (or restricted maximum-likelihood, see below), that is, by minimizing the negative log-likelihood function

\[ \mathcal{L} = \log\det\bigl[ \sigma^2 (K + \delta I) \bigr] + \frac1{\sigma^2} \bigl\langle y-X\beta, (K + \delta I)^{-1} (y-X\beta)\bigr\rangle, \]

where \(\langle u,v\rangle=u^Tv=\sum_{i=1}^n u_i v_i\) is the usual inner product on \(\mathbb{R}^n\); note that for any matrix \(A\in\mathbb{R}^{n\times n}\) and vectors \(u,v \in \mathbb{R}^{n}\), \(\langle u,Av\rangle=\mathrm{tr}(vu^TA)\).

Analytic expressions for the maximum-likelihood estimates \(\hat{\beta}\) and \(\hat{\sigma}^2\) as a function of \(\delta\) are easy to obtain:

\[ \begin{aligned} \hat\beta &= \bigl[ X^T(K+\delta I)^{-1}X\bigr]^{-1} X^T(K+\delta I)^{-1} y \\ \hat{\sigma}^2 &= \frac1n \bigl\langle y-X\hat\beta, (K+\delta I)^{-1} (y-X\hat\beta)\bigr\rangle, \end{aligned} \]

Plugging these expressions into the negative log-likelihood results in a (non-convex) function of the parameter \(\delta\), which upto an additive constant is given by:

\[ \mathcal{L}(\delta) = \log\det (K+\delta I) + n \log \hat{\sigma}^2 \]

In Lippert et al. (2011), the eigenvalue decomposition of \(K\) is used to increase the efficiency of evaluating \(\mathcal{L}(\delta)\), and thereby speed up the process of finding the maximum-likelihood estimate \(\hat\delta\). However, their method still expresses the evaluation of \(\hat\beta(\delta)\) as the solution of a linear system (if the number of covariates \(d>1\)). This implies that the gradient of \(\mathcal{L}(\delta)\) cannot be computed using automatic differentiation, which means that only gradient-free optimization algorithms can be used (Lippert et al. (2011) used a basic grid-based search).

Instead, FaSTLMMlight first uses the singular value decomposition of the fixed effects matrix \(X\) to express the negative log-likelihood on the space orthogonal to the columns of \(X\) (in other words, uses restricted maximum likelihood). Then the spectral decomposition of \(K\) on the restricted space is used in the same manner as in the original FaST-LMM method, which results in a restricted negative log-likelihood function \(\mathcal{L}_R(\delta)\) whose gradient can be evaluated using automatic differentiation.

SVD of the fixed effects

We assume that the number of fixed effects is less than the number of samples, \(d < n\), and that the matrix \(X\) of fixed effects has full rank, \(\mathrm{rank}(X)=d\). We can then write the “thin” singular value decomposition of \(X\) as \(X = V_1 \Gamma W^T\), where \(V_1\in\mathbb{R}^{n\times d}\) with \(V_1^TV_1=I\), \(\Gamma \in\mathbb{R}^{d\times d}\) diagonal with \(\gamma_i=\Gamma_{ii}>0\) for all \(i\), and \(W\in\mathbb{R}^{d\times d}\) with \(W^TW=WW^T=I\). Furthermore there exists \(V_2\in\mathbb{R}^{n\times (n-d)}\) with \(V_2^TV_2=I\), \(V_1^TV_2=0\) and \(V_1V_1^T+V_2V_2^T=I\), i.e. the matrix \(V=(V_1, V_2)\) is unitary, and \(V_1V_1^T\) and \(V_2V_2^T\) are the orthogonal projection matrices on the range and (left) null space of \(X\), respectively.

Using the singular value decomposition of \(X\), we can write any matrix \(A\in\mathbb{R}^{n\times n}\) and vector \(y\in\mathbb{R}^n\) in block matrix/vector notation as

\[ \begin{aligned} A &= \begin{pmatrix} A_{11} & A_{12}\\ A_{21} & A_{22} \end{pmatrix},\;\text{where}\quad A_{ij} = V_i^T A V_j\\ y &= \begin{pmatrix} y_1\\ y_2 \end{pmatrix} ,\;\text{where}\quad y_i = V_i^T y. \end{aligned} \]

Using standard results for the inverse and determinant of a block matrix, we have

\[ \log\det(A) = \log\det(A_{11}) + \log\det(A_{22}-A_{21}A_{11}^{-1}A_{12}) = \log \det(A_{11}) - \log\det([A^{-1}]_{22}) \]

Furthermore, we have

\[ X (X^T A X)^{-1} X^T = V_1\Gamma W^T (W\Gamma V_1^T A V_1\Gamma W^T)^{-1} W \Gamma V_1^T = V_1 (V_1^T A V_1)^{-1} V_1^T = V_1 (A_{11})^{-1} V_1^T. \]

Using this for the specific case where \(A=(K+\delta I)^{-1}\), together with the identity \(y=(V_1V_1^T+V_2V_2^T)y = V_1y_1 + V_2y_2\), in the equation

\[ \bigl\langle y-X\hat\beta, A (y-X\hat\beta)\bigr\rangle= \langle y,A y\rangle - \langle y, A X (X^TAX)^{-1} X^T A y\rangle = \bigl\langle y, \bigl(A - A X (X^TAX)^{-1} X^T A\bigr) y \bigr\rangle, \]

gives

\[ \begin{aligned} &\bigl\langle y-X\hat\beta, A (y-X\hat\beta)\bigr\rangle\\ \qquad&= \begin{pmatrix} y_1^T & y_2^T \end{pmatrix} \begin{pmatrix} A_{11} & A_{12}\\ A_{21} & A_{22} \end{pmatrix} \begin{pmatrix} y_1\\ y_2 \end{pmatrix} -\begin{pmatrix} y_1^T & y_2^T \end{pmatrix} \begin{pmatrix} A_{11} (A_{11})^{-1} A_{11} & A_{11} (A_{11})^{-1} A_{12}\\ A_{21}A_{11} (A_{11})^{-1} & A_{21} (A_{11})^{-1}A_{12} \end{pmatrix} \begin{pmatrix} y_1\\ y_2 \end{pmatrix} \\ \qquad&= \begin{pmatrix} y_1^T & y_2^T \end{pmatrix} \begin{pmatrix} 0 & 0\\ 0 & A_{22} - A_{21} (A_{11})^{-1}A_{12} \end{pmatrix} \begin{pmatrix} y_1\\ y_2 \end{pmatrix}\\ \qquad&= \langle y_2,[(A^{-1})_{22}]^{-1} y_2\rangle \end{aligned} \]

Because using the maximum-likelihood estimates \(\hat\beta\) has the effect of projecting out the fixed effects from the response \(y\) in the residual variance, it is common to also remove the contribution of the fixed effect space from the determinant term in the log-likelihood, i.e. replace

\[ \begin{aligned} \log\det(K+\delta I) = -\log\det\bigl[(K+\delta I)^{-1}\bigr] = -\log\det(A) \to \log\det([A^{-1}]_{22}) = \log\det\bigl[K_{22}+\delta I\bigr], \end{aligned} \]

and consider the residual or restricted negative log-likelihood function

\[ \mathcal{L}_R(\sigma^2,\delta) = \log\det\bigl[\sigma^2(K_{22}+\delta I)\bigr] + \frac1{\sigma^2} \langle y_2,(K_{22}+\delta I)^{-1} y_2\rangle. \]

This results in the restricted maximum-likelihood estimate

\[ \hat\sigma^2 = \frac1{n-d} \langle y_2,(K_{22}+\delta I)^{-1} y_2\rangle, \]

which is \(n/(n-d)\) times the unrestricted maximum-likelihood estimate. Plugging this in the restricted negative log-likelihood function gives, upto an additive constant

\[ \mathcal{L}_R(\delta) = \log\det\bigl(K_{22}+\delta I\bigr) + (n-d)\log \hat\sigma^2. \]

Because the sample size \(n\) may be large, we scale this function by \(n-d\) to obtain

\[ \ell_R(\delta) = \frac1{n-d} \mathcal{L}_R(\delta) = \frac1{n-d}\log\det\bigl(K_{22}+\delta I\bigr) + \log \hat\sigma^2, \]

which needs to be minimized to obtain the restricted maximum-likelihood estimate \(\hat\delta\).

Fixed effects weights

The maximum-likelihood estimate \(\hat\beta\) for the fixed effects weights can also be expressed conveniently using the same block matrix notation. Still writing \(A=(K+\delta I)^{-1}\) we have

\[ \begin{aligned} \hat \beta &= (X^TAX)^{-1} X^T A y \\ &= W \Gamma^{-1} (A_{11})^{-1} \Gamma^{-1} W^T W \Gamma V_1^T A y\\ &= W \Gamma^{-1} (A_{11})^{-1} (Ay)_1\\ &= W \Gamma^{-1} (A_{11})^{-1} (A_{11}y_1 + A_{12}y_2)\\ &= W \Gamma^{-1} ( y_1 + (A_{11})^{-1} A_{12} y_2 ) \end{aligned} \]

Again using properties of the inverse of a block matrix, we have \((A_{11})^{-1} A_{12} = - B_{12} (B_{22})^{-1}\) where \(B=A^{-1}=K+\delta I\), or

\[ \hat \beta = W \Gamma^{-1} ( y_1 - K_{12} (K_{22}+\delta I)^{-1} y_2 ) \tag{1}\]

Spectral decomposition of the kernel matrix and data rotaton

We first consider the scenario where \(K\) is defined by a square semi-positive definite matrix and \(K_{22}\) has been computed. Following Lippert et al. (2011), we call this the “full rank” scenario, although neither \(K\) nor \(K_{22}\) actually has to have full rank.

The spectral decomposition of \(K_{22}\) can be written as \(K_{22}=U \Lambda U^T\), with \(U\in\mathbb{R}^{(n-d)\times (n-d)}\) unitary and \(\Lambda\in\mathbb{R}^{(n-d)\times (n-d)}\) diagonal with non-negative diagonal elements \(\lambda_i=\Lambda_{ii}\geq 0\), which we assume are ordered, \(\lambda_1\geq\lambda_2\geq \dots \geq \lambda_{n-d}\geq 0\)

The columns of \(U\) are the eigenvectors of \(K_{22}\), and we denote them as \(u_i \in\mathbb{R}^{n-d}\), with \(K_{22} u_i=\lambda_i u_i\). The matrix \(U\) can be used to rotate the data \(y_2\):

\[ \tilde{y} = U^T y_2 \in \mathbb{R}^{n-d} \]

with components

\[ \tilde{y}_i = \langle u_i,y_2\rangle,\quad i=1,\dots,n-d \]

Likelihood function and variance parameter estimation

These results allow to express the terms in \(\mathcal{\ell}_R\)

\[ \begin{aligned} \frac1{n-d}\log\det(K_{22}+\delta I) &= \frac1{n-d}\sum_{i=1}^{n-d} \log(\lambda_i+\delta)\\ \hat\sigma^2 = \frac1{n-d}\langle y_2,(K_{22}+\delta I)^{-1} y_2\rangle &= \frac1{n-d}\sum_{i=1}^{n-d} \frac1{\lambda_i+\delta} \langle y_2,u_i\rangle^2 = \frac1{n-d}\sum_{i=1}^{n-d} \frac{\tilde{y}_i^2}{\lambda_i+\delta} \end{aligned} \]

Note the crucial difference with the original FaST-LMM method. There the eigenvalue decomposition of the full matrix \(K\) is used to facilitate the computation of the negative log-likelihood \(\mathcal{L}\), which still involves the solution of a linear system to compute \(\hat\beta\). By working in the restricted space orthogonal to the fixed effects and using the eigenvalue decomposition of the reduced matrix \(K_{22}\), we have obtained a restricted negative log-likelihood \(\ell_R\) which, given \(\lambda\) and \(\tilde y\) is trivial to evaluate and differentiate by automatic differentiation. In FaSTLMMlight, the optimization of the function \(\ell_R\) is done using the LBFGS algorithm. Because \(\delta\) must be greater than zero, we write \(\delta\) as the “softplus” function of an unconstrained variable \(x\), that is \(\delta=\log(1+e^x)\), and optimize with respect to \(x\).

Fixed effects weights using the rotated basis

The fixed effects weights need to be computed only once, after \(\hat\delta\) has been estimated. \(\hat\beta(\hat\delta)\) is computed using Equation 1. Using the eigendecomposition of \(K_{22}\) and the rotated data \(\tilde y = U^Ty_2\) we obtain

\[ \begin{aligned} \hat \beta(\hat\delta) &= W \Gamma^{-1} ( y_1 - K_{12} (K_{22}+\hat\delta I)^{-1} y_2 )\\ &= W \Gamma^{-1} ( y_1 - K_{12} U(\Lambda+\hat\delta I)^{-1} U^Ty_2 )\\ &= W \Gamma^{-1} ( y_1 - K_{12} U (\Lambda+\hat\delta I)^{-1} \tilde y ) \end{aligned} \]

The components of the vector \((\Lambda+\hat\delta I) \tilde y\) are easily computed elementwise as \(\tilde y_i/(\lambda_i+\hat\delta)\).