Parameter estimation in random effects models with many samples

Mixed models

Genomics

Author

Tom Michoel

Published

April 3, 2025

Background

A few years ago, Ammar Malik and I wrote a paper on learning latent variance components in gene expression data. The method was based on a result for estimating the variance parameters in random effects models when a large number of samples from the model are available (Theorem 1 on page 8 of the supplementary material). “Samples” in this case correspond to genes, and the result depended on a spectral condition for the gene expression covariance matrix. Later on I was able to prove the theorem without this condition, but shifting priorities and lack of a compelling use case (beyond what we had already published) meant this new result was never published. I present it here in case someone else finds a use for it.

For a general introduction to parameter estimation in linear mixed models and random effects models, this review paper is highly recommended.

The model

Consider a random effects models

\[ \mathbf{y} = \mathbf{Z} \mathbf{u} + \mathbf{e} \tag{1}\]

where $\mathbf{y}\in\mathbb{R}^{n}$ is a vector of responses, $\mathbf{Z}\in \mathbb{R}^{n\times d}$ is a (known) design matrix of the random effects, and $\mathbf{e}\in\mathbb{R}^{n}$ is a vector of random errors. We assume $\mathbb{E}(\mathbf{u})=0$, $\mathbb{E}(\mathbf{e})=0$, and

\[ \mathrm{Cov}(\mathbf{u},\mathbf{e}) = \begin{pmatrix} \mathbf{B} & 0\\ 0 & \sigma^2 \mathbb{1} \end{pmatrix} \]

with $\mathbf{B}\in \mathbb{R}^{d\times d}$ a positive definite matrix, $\mathbb{1}$ the identity matrix, and $\sigma^2>0$. These assumptions imply that

\[ \mathbf{y} \sim \mathcal{N}\left(0, \mathbf{K}\right) \tag{2}\]

with

\[ \mathbf{K} = \mathbf{Z}\mathbf{B}\mathbf{Z}^{T} + \sigma^2\mathbf{I} \]

The columns of $\mathbf{Z}$ are called the confounding factors, that is, the factors that give rise to the non-zero covariance between the entries of $\mathbf{y}$.

The goal is to find maximum-likelihood estimates for the variance parameters $\mathbf{B}$ and $\sigma^2$.

Usually, this type of model is used to model the correlation structure between $n$ non-independent measurements¹ $y_i$ of some variable of interest, for instance to account for population structure in genetic association studies. Note that this is a rather odd situation, statistically speaking, because we are then estimating the parameters of a distribution (the multivariate normal $\mathcal{N}(0,\mathbf{K})$) from a single sample (the vector $\mathbf{y}$) from this distribution!

What if instead we have a more conventional situation where we have $m>n$ independent samples of this distribution, that is, if we have a matrix $\mathbf{Y}\in\mathbb{R}^{n\times m}$ where each column is a sample from the distribution in Equation 2? It turns out that in this case maximum-likelihood solutions for $\mathbf{B}$ and $\sigma^2$ can be found analytically.

The situation just described arises in computational biology, when we measure expression of a (very!) large number $m$ of genes in a set of $n$ related individuals. In this case, we can, plausibly or implausibly², assume that all genes have the same correlation structure among the measurements, that is, all of them are sampled from Equation 2 with the same covariance matrix $\mathbf{K}$.

Likelihood function

Under the assumption that the columns of $\mathbf{Y}$ are independent samples of Equation 2, the likelihood of observing $\mathbf{Y}$ given covariate data $\mathbf{Z}$ and values for the hyper-parameters $\Theta=\{\mathbf{B}, \sigma^2\}$, is given by

\[ p(\mathbf{Y}\mid \mathbf{Z},\Theta) = \prod_{i=1}^m p(y_i\mid 0, \mathbf{K}) \]

Hence, the negative log-likelihood (which needs to be minimized) is, upto an additive constant, and divided by $m/2$:

\[ \mathcal{L}= \frac2{m}\Bigl[ \frac{m}2\log\det(\mathbf{K}) + \frac12 \sum_{i=1}^m \langle y_i,\mathbf{K}^{-1}y_i\rangle \Bigr] = \log\det(\mathbf{K}) + \mathop{\mathrm{tr}}\bigl(\mathbf{K}^{-1} \mathbf{C}\bigr) \]

where

\[ \mathbf{C}= \frac{\mathbf{Y}\mathbf{Y}^{T}}{m} \]

is the empirical covariance matrix.

Preliminary results

We start by repeating two well-known preliminary results that were also included in our supplementary material.

The first result concerns linear transformations of normally distributed variables and can be found in most textbooks or on wikipedia:

Lemma 1 Let $\mathbf{x}\in\mathbb{R}^{n}$ be a random, normally distributed vector,

\[ p(\mathbf{x}) = \mathcal{N}(\mu,\Psi), \]

with $\mu\in\mathbb{R}^{n}$, and $\Psi\in\mathbb{R}^{n\times n}$ a positive definite covariance matrix. For any linear transformation $y=\mathbf{M} x$ with $\mathbf{M}\in\mathbf{R}^{m\times n}$, we have

\[ p(\mathbf{y}) = \mathcal{N}(\mathbf{M}\mu,\mathbf{M}\Psi\mathbf{M}^{T}). \]

If the linear transformation $\mathbf{y}=\mathbf{M} \mathbf{x}$ in this Lemma is overdetermined, that is, if $m>n$, then the transformed covariance matrix $\Psi'=\mathbf{M}\Psi\mathbf{M}^{T}$ will have a lower rank $n$ than its dimension $m$, that is, $\Psi'\in\mathbb{R}^{m\times m}$ is a positive semi-definite matrix (i.e., has one or more zero eigenvalues). Thus we can extend the definition of normal distributions to include degenerate distributions with positive semi-definite covariance matrix, by interpreting them as the distributions of overdetermined linear combinations of normally distributed vectors. A degenerate one-dimensional normal distribution is simply defined as a $\delta$-distribution, that is, for $x\in\mathbb{R}$

\[ p(x) = \mathcal{N}(\mu,0) = \delta(x-\mu), \]

which can be derived as a limit $\sigma^2\to 0$ of normal distribution density functions $\mathcal{N}(\mu,\sigma^2)$.

The second result is one that is attributed to von Neumann

Lemma 2 Let $\mathbf{P},\mathbf{Q}\in\mathbb{R}^{n\times n}$ be two positive definite matrices. Then

\[ \mathrm{tr}(\mathbf{P}^{-1}\mathbf{Q}) \geq \sum_{i=1}^n \pi^{-1}_i \chi_i,% \chi_{n-i+1}, \tag{3}\]

where $\pi_1\geq \dots \geq \pi_n$ and $\chi_1\geq\dots\geq \chi_n$ are the ordered eigenvalues of $\mathbf{P}$ and $\mathbf{Q}$, respectively, and equality in Equation 3 is achieved if and only if the eigenvector of $\mathbf{P}$ corresponding to $\pi_i$ is equal to the eigenvector of $\mathbf{Q}$ corresponding to $\chi_{n-i+1}$, $i=1,\dots,n$.

Singular value decomposition of the design matrix

Note first of all that we may assume the set of confounding factors $\{z_1,\dots, z_d\}$ (the columns of $\mathbf{Z}$) to be linearly independent, because if not, the expression in Equation 1 can be rearranged in terms of a linearly independent subset of factors whose coefficients are still normally distributed due to Lemma 1. Linear independence of $\{z_1,\dots, z_d\}$ implies that we must have $d\leq n$ and $\mathop{\mathrm{rank}}(\mathbf{Z})=d$.

The singular value decomposition allows to decompose $\mathbf{Z}$ as $= ^{T} $, where $\mathbf{U}\in\mathbb{R}^{n\times n}$, $\mathbf{U}^{T}\mathbf{U}=\mathbf{U}\mathbf{U}^{T}=\mathbb{1}$, $\boldsymbol\Gamma\in\mathbb{R}^{n\times d}$ diagonal with $\gamma_k^2\equiv\Gamma_{kk}>0$ for $k\in\{1,\dots,d\}$ [this uses $\mathop{\mathrm{rank}}(\mathbf{Z})=d$], and $\mathbf{V}\in\mathbb{R}^{d\times d}$, $\mathbf{V}^{T}\mathbf{V}=\mathbf{V}\mathbf{V}^{T}=\mathbb{1}$. There is also a `thin’ SVD, $\mathbf{Z}= \mathbf{U}_1\boldsymbol\Gamma_1\mathbf{V}^{T}$, where $\mathbf{U}_1\in\mathbb{R}^{n\times d}$, $\mathbf{U}_1^{T}\mathbf{U}_1=\mathbb{1}$, $\boldsymbol\Gamma_1\in\mathbb{R}^{d\times d}$ diagonal with diagonal elements $\gamma_k^2$. In block matrix notation, $\mathbf{U}=(\mathbf{U}_1, \mathbf{U}_2)$ and

\[ \mathbf{Z}= \begin{pmatrix} \mathbf{U}_1& \mathbf{U}_2 \end{pmatrix} \begin{pmatrix} \boldsymbol\Gamma_1\\ 0 \end{pmatrix} \mathbf{V} \tag{4}\]

Note that unitarity of $\mathbf{U}$ implies $\mathbf{U}_1^{T}\mathbf{U}_2=0$.

Denote by $\mathcal{H}_Z$ the space spanned by the columns (i.e. covariate vectors) of $\mathbf{Z}$. The projection matrix $\mathbf{P}_Z$ onto $\mathcal{H}_Z$ is given by

\[ \mathbf{P}_Z= \mathbf{Z}(\mathbf{Z}^{T}\mathbf{Z})^{-1}\mathbf{Z}^{T} = \mathbf{U}_1\boldsymbol\Gamma_1 \mathbf{V}^{T} (\mathbf{V}\boldsymbol\Gamma_1^{-2}\mathbf{V}^{T}) \mathbf{V}\boldsymbol\Gamma_1 \mathbf{U}_1^{T} = \mathbf{U}_1\mathbf{U}_1^{T}. \]

Using the basis of column vectors of $\mathbf{U}$, we can write any matrix $\mathbf{M}\in\mathbb{R}^{n\times n}$ as a partitioned matrix

\[ \mathbf{U}^{T}\mathbf{M}\mathbf{U}= \begin{pmatrix} \mathbf{M}_{11} & \mathbf{M}_{12}\\ \mathbf{M}_{21} & \mathbf{M}_{22} \end{pmatrix} \tag{5}\]

where

\[ \mathbf{M}_{ij} = \mathbf{U}_i^{T} \mathbf{M}\mathbf{U}_j \]

The following results for partitioned matrices are derived easily or can be found in Horn & Johnson’s Matrix Analysis or on wikipedia:

\[ \begin{aligned} \mathop{\mathrm{tr}}(\mathbf{M}) &= \mathop{\mathrm{tr}}(\mathbf{M}_{11}) + \mathop{\mathrm{tr}}(\mathbf{M}_{22})\\ \det(\mathbf{M}) &= \det\bigl(\mathbf{M}_{11}-\mathbf{M}_{12}\mathbf{M}_{22}^{-1}\mathbf{M}_{21}\bigr) \, \det(\mathbf{M}_{22}) \end{aligned} \tag{6}\]

Main result

The following result solves our problem:

Theorem 1 Let $\mathbf{C}\in\mathbb{R}^{n\times n}$ be a positive definite matrix. Then the maximum-likelihood solution

\[ \hat{\mathbf{K}} = \mathop{\mathrm{argmin}}_{\{\mathbf{K}\colon \mathbf{K}= \mathbf{Z}\mathbf{B}\mathbf{Z}^{T} +\sigma^2\mathbb{1}\}} \log\det \mathbf{K}+ \mathop{\mathrm{tr}}\bigl(\mathbf{K}^{-1}\mathbf{C}\bigr), \]

subject to $\mathbf{B}$ being positive semi-definite and $\sigma^2\geq 0$, is given by

\[ \begin{aligned} \hat{\mathbf{B}} &= \mathbf{V}\boldsymbol\Gamma_1^{-1} \mathbf{W}\mathop{\mathrm{diag}}(\lambda_1-\hat{\sigma}^2, \dots,\lambda_{d'} -\hat{\sigma}^2, 0,\dots,0) \mathbf{W}^{T} \boldsymbol\Gamma_1^{-1}\mathbf{V}^{T} \\ \hat{\sigma}^2 &= \frac{\mathop{\mathrm{tr}}(\mathbf{C}) - \sum_{i=1}^{d'}\lambda_i}{n-d'} \end{aligned} \]

where $\mathbf{W}$ and $\lambda_1\geq\dots\geq \lambda_d$ are the orthogonal matrix of eigenvectors and the eigenvalues of $\mathbf{C}_{11}$, and the index $d'$ is defined as

\[ d' = \max\biggl\{k \mid k=0, \text{ or }1\leq k\leq d \text{ and } \lambda_k > \frac{\mathop{\mathrm{tr}}(\mathbf{C}) - \sum_{i=1}^{k}\lambda_i}{n-k}\biggr\}. \]

Proof. Using Equation 4, we can write

\[ \begin{aligned} \mathbf{K}= \mathbf{Z}\mathbf{B}\mathbf{Z}^{T} + \sigma^2\mathbb{1}&= \mathbf{U}_1\boldsymbol\Gamma_1\mathbf{V}^{T}\mathbf{B}\mathbf{V}\boldsymbol\Gamma_1\mathbf{U}_1^{T} + \sigma^2 (\mathbf{U}_1\mathbf{U}_1^{T} + \mathbf{U}_2\mathbf{U}_2^{T})\\ &= \mathbf{U}_1\boldsymbol\Gamma_1\mathbf{V}^{T} \bigl( \mathbf{B}+ \sigma^2 \mathbf{V}\boldsymbol\Gamma_1^{-2}\mathbf{V}^{T}\bigr) \mathbf{V}\boldsymbol\Gamma_1\mathbf{U}_1^{T} + \sigma^2 \mathbf{U}_2\mathbf{U}_2^{T} \end{aligned} \]

Hence, in the block matrix notation of Equation 5, we have \[ \begin{aligned} \mathbf{K}_{11} &= \boldsymbol\Gamma_1\mathbf{V}^{T} \bigl( \mathbf{B}+ \sigma^2 \mathbf{V}\boldsymbol\Gamma_1^{-2}\mathbf{V}^{T}\bigr) \mathbf{V}\boldsymbol\Gamma_1 \label{eq:4}\\ \mathbf{K}_{22} &= \sigma^2 \mathbb{1}\label{eq:11}\\ \mathbf{K}_{12} &= \mathbf{K}_{21} = 0.\label{eq:12} \end{aligned} \tag{7}\]

Note that $\mathbf{B}$ is positive semi-definite if and only if for all $v\in\mathbb{R}^d$

\[ 0\leq \langle v, \mathbf{B}v\rangle = \langle w, (\mathbf{K}_{11}-\sigma^2\mathbb{1}) w\rangle, \]

where $w=\boldsymbol\Gamma_1^{-1}\mathbf{V}v$. Because $\mathbf{V}$ is unitary and $\boldsymbol\Gamma_1$ diagonal with strictly positive elements, $\langle v, \hat{\mathbf{B}} v\rangle\geq0$ for all $v\in\mathbb{R}^d$ if and only if $\langle w, (\mathbf{K}_{11}-\sigma^2\mathbb{1}) w\rangle\geq 0$ for all $w\in\mathbb{R}^d$. In other words, $\mathbf{B}$ is positive semi-definite if and only if

\[ \lambda_{\min}(\mathbf{K}_{11})\geq \sigma^2, \tag{8}\]

where $\lambda_{\min}(\cdot)$ denotes the smallest eigenvalue of a matrix.

It further follows from Equation 7 that

\[ \mathbf{K}^{-1} = \begin{pmatrix} \mathbf{K}_{11}^{-1} & 0\\ 0 & \mathbf{K}_{22}^{-1} \end{pmatrix} \]

and, using Equation 6,

\[ \begin{aligned} \log\det(\mathbf{K}) &= \log\det(\mathbf{K}_{11}) + \log\det(\mathbf{K}_{22}) = \log\det(\mathbf{K}_{11}) + (n-d) \log(\sigma^2)\\ \mathop{\mathrm{tr}}(\mathbf{K}^{-1}\mathbf{C}) &= \mathop{\mathrm{tr}}(\mathbf{K}_{11}^{-1}\mathbf{C}_{11}) + \mathop{\mathrm{tr}}(\mathbf{K}_{22}^{-1}\mathbf{C}_{22}) = \mathop{\mathrm{tr}}(\mathbf{K}_{11}^{-1}\mathbf{C}_{11}) + \frac{\mathop{\mathrm{tr}}(\mathbf{C}_{22})}{\sigma^2} \end{aligned} \]

Let $\mathbf{C}_{11}$ have eigenvalues $\lambda_1\geq \dots\geq \lambda_d$ with corresponding eigenvectors $w_1,\dots, w_d\in \mathbb{R}^d$, which we collect in a unitary matrix $\mathbf{W}\in\mathbb{R}^{d\times d}$. Applying Lemma 2 to the term $\mathop{\mathrm{tr}}(\mathbf{K}_{11}^{-1}\mathbf{C}_{11})$, it follows that for the minimizer $\hat{\mathbf{K}}$, $\hat{\mathbf{K}}_{11}$ must have eigenvalues $\kappa_1\geq \dots\geq \kappa_d$ with the same eigevectors $u_1,\dots, u_d$ as $\mathbf{C}_{11}$. Expressing the negative log-likelihood in terms of these eigenvalues results in a function

\[ F(\kappa_1,\dots,\kappa_d,\sigma^2) = \sum_{i=1}^d f_i(\kappa_i) + (n-d) f(\sigma^2) \tag{9}\]

where

\[ \begin{aligned} f_i(x) &= \log(x) + \frac{\lambda_i}x\\ f(x) &= \log(x) + \frac{\mathop{\mathrm{tr}}(\mathbf{C}_{22})}{(n-d)x} \end{aligned} \]

are functions defined for $x\geq0$. Their first and second derivatives are

\[ \begin{aligned} f'_i(x)&= \frac1{x} - \frac{\lambda_i}{x^2} & f'(x) &= \frac1x - \frac{\mathop{\mathrm{tr}}(\mathbf{C}_{22})}{(n-d)x^2}\label{eq:47}\\ f''_i(x) &= -\frac1{x^2} + \frac{2\lambda_i}{x^3} & f''(x) &= -\frac1{x^2} + \frac{2\mathop{\mathrm{tr}}(\mathbf{C}_{22})}{(n-d)x^3}. \end{aligned} \tag{10}\]

From this we see that $f_i$ is convex on $[0,2\lambda_i]$, with a minimum at $x=\lambda_i$ and strictly increasing for $x>2\lambda_i$; $f$ is convex on $[0,2\mathop{\mathrm{tr}}(\mathbf{C}_{22})/(n-d)]$, with a minimum at $x=\mathop{\mathrm{tr}}(\mathbf{C}_{22})/(n-d)$ and strictly increasing for $x>2\mathop{\mathrm{tr}}(\mathbf{C}_{22})/(n-d)$

Hence it follows that $\mathcal{F}$ has an absolute minimizer at

\[ \begin{aligned} \kappa_i &= \lambda_i,\quad i=1,\dots,d\label{eq:49}\\ \sigma^2 &= \frac{\mathop{\mathrm{tr}}(\mathbf{C}_{22})}{(n-d)}\label{eq:50} \end{aligned} \tag{11}\]

If $\lambda_d$, the smallest eigenvalue of $\mathbf{C}_{11}$, is greater than $\frac{\mathop{\mathrm{tr}}(\mathbf{C}_{22})}{n-d}$, then this solution satisfies the constraint in Equation 8, and hence the minimizer $\hat{\mathbf{K}}_{11}$ has the same eigenvalues and eigenvectors as $\mathbf{C}_{11}$, that is, $\hat{\mathbf{K}}_{11} = \mathbf{C}_{11}$.

To solve the general case, we need to consider the constrained optimization problem:

\[ \begin{aligned} \text{minimize} \qquad& \mathcal{F}(\kappa_1,\dots,\kappa_d,\sigma^2)\\ \text{subject to} \qquad & \kappa_i \geq \sigma^2, \quad i=1,\dots,d \nonumber\\ & \sigma^2\geq 0 \nonumber \end{aligned} \tag{12}\]

Because $\mathcal{F}$ is not convex over the entire feasible region defined by the constraints, we cannot directly apply results from convex optimization. Nevertheless the constrained optimization problem can be solved using the properties in Equation 10. First we show that if the absolute minizer in Equation 11 does not belong to the feasible set, then the constrained minimizer lies on the boundary of the feasible set determined by the lower bounds in Equation 12, which is characterized by “segments” of the form

\[ \mathcal{S}_k=\{(\kappa_1,\dots,\kappa_d,\sigma^2)\mid \kappa_1\geq\dots\geq\kappa_{k}>\kappa_{k+1}=\dots=\kappa_d= \sigma^2\} \]

for $0\leq k<d$, where the case $k=0$ is interpreted as all $\kappa_i$ being equal to $\sigma^2$. We also write

\[ \overline{\mathcal{S}}_k = \{(\kappa_1,\dots,\kappa_d,\sigma^2)\mid \kappa_1\geq\dots\geq\kappa_{k}\geq\kappa_{k+1}=\dots=\kappa_d= \sigma^2\}, \]

such that

\[ \mathcal{S}_{k-1} \subset \overline{\mathcal{S}}_k, \quad k=1,\dots,d-1 \tag{13}\]

Now, let $x_1\geq\dots\geq x_d>\sigma^2$ be a point in the interior of the feasible region. If $\lambda_k\geq\sigma^2>\lambda_{k+1}$ for some $k=0,\dots,d-1$, where the case $k=0$ is interpreted as $\sigma^2>\lambda_1$, then Equation 10 implies that $f_i(x_i)\geq f(\lambda_i)$ for $i=1,\dots,k$ and $f_i(x_i)> f_i(\sigma^2)$ for $i=k+1,\dots,d$. Hence

\[ \mathcal{F}(x_1,\dots,x_d,\sigma^2) > \mathcal{F}(\lambda_1,\dots,\lambda_k,\sigma^2,\dots,\sigma^2) \]

and $(\lambda_1,\dots,\lambda_k,\sigma^2,\dots,\sigma^2)\in\overline{\mathcal{S}}_k$.

If $\lambda_1\geq\dots\geq\lambda_d>\sigma^2$, then $f_i(x_i)\geq f(\lambda_i)$ for $i=1,\dots,d$. Because the absolute minimizer in Equation 11 does not belong to the feasible region by assumption, we must have that $\sigma^2<\lambda_d< \mathop{\mathrm{tr}}(\mathbf{C}_{22})/(n-d)$. Because $f$ is decreasing in the interval $[0,\mathop{\mathrm{tr}}(\mathbf{C}_{22})/(n-d)]$ it follows that $f(\sigma^2)>f(\lambda_d)$. Hence \[ \mathcal{F}(x_1,\dots,x_d,\sigma^2) > \mathcal{F}(\lambda_1,\dots,\lambda_d,\lambda_d), \] and $(\lambda_1,\dots,\lambda_d,\lambda_d)\in\overline{\mathcal{S}}_{d-1}$.

Hence, given a point in the interior of the feasible region, we can always find a point on the boundary with lower value for $\mathcal{F}$, and therefore the minimizer of $\mathcal{F}$ must lie on the boundary.

We now show that there exists a unique minimizer on the boundary. Evaluating points on the boundary segment $\mathcal{S}_k$ results in \[ \mathcal{F}(\kappa_1,\dots,\kappa_{k}, \kappa_{k+1}=\sigma^2,\dots,\kappa_d=\sigma^2,\sigma^2) =\sum_{i=1}^{k} f_i(\kappa_i) + \sum_{i=k+1}^d f_i(\sigma^2) + (n-d) f(\sigma^2) \]

Minimizing w.r.t. $\kappa_i$, $i=1,\dots, k$, and $\sigma^2$ results in the solution

\[ \begin{aligned} \kappa_i &= \lambda_i, \qquad i=1,\dots, k\label{eq:20}\\ \sigma^2 &= \frac{\sum_{i=k+1}^d\lambda_i + \mathop{\mathrm{tr}}(\mathbf{C}_{22})}{n-k} = \frac{\mathop{\mathrm{tr}}(\mathbf{C}) - \sum_{i=1}^{k}\lambda_i}{n-k}, \end{aligned} \tag{14}\]

where the last equality results from $\mathop{\mathrm{tr}}(\mathbf{C})=\mathop{\mathrm{tr}}(\mathbf{C}_{11})+\mathop{\mathrm{tr}}(\mathbf{C}_{22})$. This solution belongs to $\mathcal{S}_k$ if and only if $\lambda_{k}> \frac{\mathop{\mathrm{tr}}(\mathbf{C}) - \sum_{i=1}^{k}\lambda_i}{n-k}$. If this is not the case, the minimum along this segment lies on the boundary of the segment $\overline{\mathcal{S}}_k$, that is, the intersection with another segment that has one or more additional $\kappa_i=\sigma^2$. Because the value at such an intersection is necessarily larger than or equal to the minimum along the new segment, it follows that the solution to the optimization problem lies on a segment with a feasible minimizer. That there always exists such a segment, follows from the fact that $\mathcal{S}_0$ has a minimizer \[ \sigma^2 = \frac{\sum_{i=1}^d\lambda_i + \mathop{\mathrm{tr}}(\mathbf{C}_{22})}{n} = \frac{\mathop{\mathrm{tr}}(\mathbf{C})}n \]

which is always in the feasible set.

Let us now define

\[ d' = \max\biggl\{k \mid k=0, \text{ or }1\leq k\leq d \text{ and } \lambda_k > \frac{\mathop{\mathrm{tr}}(\mathbf{C}) - \sum_{i=1}^{k}\lambda_i}{n-k}\biggr\} \]

that is, $d'$ is the largest index for which the minimizer on the corresponding boundary segment is in the feasible set, where the value $k=0$ is again interpreted as the case where all $\kappa_i=\sigma^2$, and the case $k=d$ corresponds to the case where the absolute minimizer of $\mathcal{F}$ is in the feasible set. We now show that the solution in Equation 14 for $k=d'$ solves the constrained optimization problem for $\mathcal{F}$.

Define

\[ \tau_k = \frac{\mathop{\mathrm{tr}}(\mathbf{C}) - \sum_{i=1}^{k}\lambda_i}{n-k},\qquad 0\leq k\leq d. \]

We trivially find $(n-k+1)\tau_{k-1} - (n-k)\tau_{k} = \lambda_{k}$, or

\[ \begin{aligned} (n-k+1)(\tau_{k-1}-\tau_{k}) &= \lambda_{k}-\tau_k \\ (n-k)(\tau_{k-1}-\tau_{k}) &= \lambda_{k}-\tau_{k-1} \leq \lambda_{k-1}-\tau_{k-1} \end{aligned} \tag{15}\]

Let us start with $k=d$. If $d'=d$, the absolute minimizer is feasible and hence also solves the constrained optimization problem. If $d'<d$, then $\lambda_d\leq\tau_d$, and by Equation 15, $\tau_{d-1}\leq\tau_{d}$. Iterating this argument, we obtain a decreasing sequence $\tau_d\geq \tau_{d-1} \geq \dots \geq \tau_{d'}$, whereas the corresponding eigenvalues are increasing, $\lambda_d\leq \lambda_{d-1}\leq\dots\leq\lambda_{d'}$. Hence $d'$ is the first index (counting backwards from $k=d$) where the sequences intersect, and if this does not happen, $d'=0$. We have a corresponding feasible minimizer on $\mathcal{S}_{d'}$ given by

\[ \begin{aligned} \hat\kappa_i &= \begin{cases} \lambda_i,\quad i=1,\dots,d'\\ \tau_{d'},\quad i=d'+1,\dots, d \end{cases} \\ \hat{\sigma}^2 &= \tau_{d'} \end{aligned} \tag{16}\]

Continuing with indices smaller than $d'$, the first equation in Equation 15 now implies that $\tau_{d'-1}>\tau_{d'}$, and plugging this into the second equation gives $\lambda_{d'-1}>\tau_{d'-1}$. Iterating this argument, we find that both sequences are increasing, but in such a way that $\lambda_k>\tau_k$ for $k=1,\dots,d'$, that is, we have feasible minimizers for each segment $\mathcal{S}_k$, $k\leq d'$. Using Equation 13, this implies that

\[ \min_{\mathcal{S}_{k-1}} \mathcal{F}\geq \min_{\overline{\mathcal{S}}_k} \mathcal{F}= \min_{\mathcal{S}_{k}} \mathcal{F} \]

for $k=1,\dots,d'$. In other words, the minima for decreasing values of $k$ form an increasing sequence, and hence the absolute minimizer for $\mathcal{F}$ must be the minimizer on $\mathcal{S}_{d'}$ given by Equation 16.

Hence we have found that

\[ \begin{aligned} \hat{\mathbf{K}}_{11} &= \mathbf{W}\mathop{\mathrm{diag}}(\lambda_1,\dots,\lambda_{d'},\tau_{d'},\dots,\tau_{d'}) \mathbf{W}^{T}\\ \hat{\sigma}^2 &= \tau_{d'}, \end{aligned} \]

where $\mathbf{W}$ is the orthogonal matrix of eigenvectors of $\mathbf{C}_{11}$ as above, and $\mathop{\mathrm{diag}}(\cdot)$ denotes a diagonal matrix with the values in the argument on the diagonal. Plugging this into Equation 7 gives

\[ \begin{aligned} \hat{\mathbf{B}} &= \mathbf{V}\boldsymbol\Gamma_1^{-1}\hat{\mathbf{K}}_{11}\boldsymbol\Gamma_1^{-1}\mathbf{V}^{T} - \tau_{d'}\mathbb{1}\\ &= \mathbf{V}\boldsymbol\Gamma_1^{-1} \mathbf{W}\mathop{\mathrm{diag}}(\lambda_1-\tau_{d'}, \dots,\lambda_{d'} - \tau_{d'}, 0,\dots,0) \mathbf{W}^{T} \boldsymbol\Gamma_1^{-1}\mathbf{V}^{T} \end{aligned} \]

concluding the proof of the theorem.

Footnotes

I’m trying to avoid using the word “samples” here, because I want to reserve it for samples of the multivariate normal $\mathcal{N}(0,\mathbf{K})$.↩︎
In fact, this assumption is rather implausible, because not all genes are equally affected by population structure, and they are definitely not independent (functionally related genes tend to be coexpressed). However, a plausible approach is to estimate $\mathbf{K}$ as described by borrowing information across genes, and then fit a gene-specific model $y_i\sim\mathcal{N}(0,\tau_i^2 \hat{\mathbf{K}} + \sigma_i^2)$. Simultaneously fitting gene-gene and inter-individual correlation structure using mixed models is still an open problem as far as I know.↩︎