Genetic finemapping and path analysis
\[ \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\cov}{Cov} \]
Background
Statistical finemapping in GWAS is a technique used to pinpoint which genetic variants are truly causal for a particular phenotype. Many finemapping methods and tools exist, based on statistical models with varying degrees of complexity. Here I want to show how the basic principle of statistical finemapping can be understood using Sewall Wright’s method of path coefficients. For introductions to the method, see the wiki page on path analysis, Judea Pearl’s excellent Linear Models: A Useful “Microscope” for Causal Analysis, or this introductory page from the TETRAD group.
The method of path coefficients is based on path tracing rules. I find Pearl’s description of the rules to be the clearest:
Wright’s method consists of equating the (standardized) covariance between any pair of variables \(X\) and \(Y\) to the sum of products of path coefficients and error variances along all d-connected paths between \(X\) and \(Y\). A path is d-connected if it does not traverse any collider (i.e., head-to-head arrows). The method is valid for standardized variables, namely variables normalized to have mean zero and unit variance. For non-standardized variables the method needs to be modified slightly, multiplying the product associated with a path \(p\) by the variance of the path that acts as the “root” for path \(p\).
To apply this method to the finemapping problem, consider an outcome variable \(Y\) and a set of associated variables \(X_1, X_2, \dots, X_n\). In GWAS, the \(X_i\) are SNPs. We assume:
- A subset of \(X_i\) have a direct causal effect on \(Y\) (indicated by solid arrows in the diagram below).
- All \(X_i\) are mutually correlated (due to linkage disequilibrium, indicated by bidirected dashed arrows).
- There are no causal relations among the \(X_i\).
We also assume for simplicity that all variables are standardized to zero mean and unit variance. Hence the covariances between variable are equal to their correlation coefficients. We write \(\cov(X_i,X_j)=\rho_{ij}\).
The example causal diagram below shows a situation with four candidate SNPs, of which two are causal, and is a graphical representation of a linear structural equation model with correlated error terms:
Single causal variant
Let’s assume for a moment that there is only one causal variable \(X_c\) with causal effect \(a_c\). By the path tracing rules, the covariance between \(X_c\) and \(Y\) is simply
\[ \cov(X_c,Y) = a_c \]
For any other variable \(X_i\), there is only one allowed path that connects it to the outcome \(Y\), namely \(X_i \leftrightarrow X_c \to Y\). Hence
\[ \cov(X_i,Y) = \cov(X_i,X_c) a_c = \rho_{ic} a_c,\qquad i\neq c \]
Because \(|\rho_{ic}|<1\) for \(i\neq c\) it follows immediately that
\[ |\cov(X_i,Y)| \leq |\cov(X_c,Y)| \]
with equality only if \(i=c\). Hence, under the hypothesis of a single causal variant, the causal variant is the variant with maximum correlation with the outcome.
Multiple causal variants
In general there may be multiple causal variants. Assume the causal variants are \(X_{c_1}, \dots, X_{c_k}\), where both \(k\) and the identities \(c_1,\dots, c_k\) are unknown. To compute the covariances with \(Y\), first consider a causal variant \(c_j\). It has a direct path \(X_{c_j}\to Y\) and indirect paths \(X_{c_j} \leftrightarrow X_{c_{j'}} \to Y\):
\[ \cov(X_{c_j}, Y) = a_{c_j} + \sum_{j'\neq j} a_{c_{j'}} \rho_{c_j, c_{j'}} \]
For a non-causal variant, the contributing paths are $\(X_{i} \leftrightarrow X_{c_{j}} \to Y\) and
\[ \cov(X_{i}, Y) = \sum_{j} a_{c_{j}} \rho_{i, c_{j}} \tag{1}\]
Since \(\rho_{ii}=1\), it is obvious that this equation is in fact valid for causal and non-causal variants alike. Now it is no longer the case that the variant(s) with largest absolute correlation with the outcome are the causal ones.
For fixed \(c_j\), we can see \(\rho_{i,c_j}\) as a spatial profile which decreases (in absolute value) with increasing distance between variants \(i\) and \(c_j\), with a single maximum at \(i=c_j\). The covariance profile with the outcome is then a superposition of such profiles, and the causal variants will appear as local maxima in this profile only if they are sufficiently well separated on the genome.
Using the vector and matrix notation \(\mathbf{b} = (\cov(X_{i}, Y))_i\), \(\mathbf{a}=(a_j)_j\), and \(\mathbf{R} = (\rho_{ij})_{i,j}\), where it is understood that \(a_i=0\) for the non-causal SNPs, we can write Equation 1 as
\[ \mathbf{b} = \mathbf{R} \mathbf{a} \]
In general, we don’t know the true \(\mathbf{b}\) and \(\mathbf{R}\) and must do with finite-sample estimates \(\hat{\mathbf{b}}\) and \(\hat{\mathbf{R}}\) and minimize a loss function \(\mathcal{L}\) subject to a sparsity constraint \(\mathcal{S}(\mathbf{a})\) on the causal effects:
\[ \hat{\mathbf{a}} = \argmin_{\mathbf{a}} \left[\mathcal{L}( \hat{\mathbf{b}} - \hat{\mathbf{R}} \mathbf{a} ) + \mathcal{S}(\mathbf{a})\right] \tag{2}\]
In other words, we have reduced the problem to a standard feature selection problem.
Although I have not done an exhaustive literature review, it appears that most if not all existing methods use squared error loss for the loss function \(\mathcal{L}\), with variation between methods mainly in the choice of sparsity constraint \(\mathcal{S}\). Squared error loss corresponds to maximum-likelihood estimation under the assumption of normally distributed errors, but this is not a requirement for path analysis – only linearity of the structural equations is. Hence other error models and loss functions could be used as well. Some useful comments on the relative merits of different sparsity constraints are in this review paper. An exhaustive list of methods is here.
The structural equations for Figure 1 are simple enough that they immediately lead to Equation 1 without using path analysis. The main benefit of path analysis is to create a habit to think graphically, and iteratively build and reason easily about more complex models. Likewise, expressing all prior knowledge about a system in a causal graph, and then applying path analysis can be a first step to determine if a causal effect or variable could be identified, before trying Pearl’s more general, but also more difficult, do-calculus, or applying the principles of targeted learning.