Causal inference

Introduction

The primary use of BioFindr is for causal inference from genomics and transcriptomics data. In causal inference, cis-acting eQTLs are used as causal anchors or instrumental variables to orient the direction of causality between coexpressed genes, see the Findr paper for more details. As in coexpression analysis and association analysis, the significance of a causal effect is computed using a gene-specific background distribution. This is again achieved by modelling the distribution of test statistics between a given gene \(A\) and all other genes \(B\) as a mixture distribution of real and null (random) correlations. The relative weight of each component reflects the prior probability of finding a non-null \(B\) gene for a given \(A\) gene, and is fitted for every \(A\) gene separately.

Unlike in coexpression analysis and association analysis, causal inference cannot be performed using a single statistical test, but requires the combination of multiple tests. In BioFindr, six likelihood ratio tests are implemented, which can be combined in multiple ways for causal inference. Tests are combined by addition or mulitplication of the posterior probabilities of individial tests, an approach first proposed in this paper.

We will illustrate how to run causal inference with BioFindr using preprocessed data from the GEUVADIS study. See the installation instructions for the steps you need to take to reproduce this tutorial.

Set up the environment

using DrWatson
quickactivate(@__DIR__)

using DataFrames
using Arrow

using BioFindr

Load data

Expression data

BioFindr expects that expression data are stored as floating-point numbers in a DataFrame where columns correspond to variables (genes) and rows to samples, see the coexpression analysis tutorial for more details.

This tutorial uses two tables of expression data from the same set of samples, one for mRNA expression data called dt, and one for microRNA (miRNA) expression data called dm:

dt = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dt.arrow")));
dm = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dm.arrow")));

Genotype data

BioFindr expects that genotype data are stored as integer numbers in a DataFrame where columns correspond to variables (genetic variants) and rows to samples. BioFindr uses a categorical model to associate genetic variation to variation in gene expression, hence how different genotypes (e.g. heterozygous vs. homozygous) are encoded as integers does not matter. Future versions will support scientific types for representing genotype data.

This tutorial uses two tables of expression data from the same set of samples, one with genotypes for mRNA eQTLs called dgt, and one for microRNA (miRNA) eQTLs called dgm:

dgt = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dgt.arrow")));
dgm = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dgm.arrow")));

eQTL mapping data

The preprocessed GEUVADIS data has been organized such that each column of the genotype data is the strongest eQTLs for the corresponding column in the matching expression data. Usually however, eQTL mapping data will be available in the form of a table with variant IDs, gene IDs, and various eQTL associaion statistics (see the original GEUVADIS file for an example). BioFindr expects that such a table is read into a DataFrame, and that only the best (most strongly associated) eQTL is kept for each gene, that is, genes appear only once in the eQTL mapping table. Let’s artificially create such tables for our data:

dpt = DataFrame(SNP_ID = names(dgt), GENE_ID=names(dt)[1:ncol(dgt)]);
dpm = DataFrame(SNP_ID = names(dgm), GENE_ID=names(dm)[1:ncol(dgm)]);

Run BioFindr

Subset-to-all causal inference

In the most common scenario, we have a dataset of gene expression values, with significant cis-eQTL instruments for a subset of genes (“eGenes”). We can then infer causal interactions from the eGenes to all other genes (including other eGenes). For instance, to infer causal microRNA \(\to\) microRNA interaction, run:

dP = findr(dm, dgm, dpm; FDR=0.25)
84×4 DataFrame
59 rows omitted
Row Source Target Probability qvalue
String String Float64 Float64
1 hsa-miR-550a-2-5p hsa-miR-550a-3-5p 1.0 0.0
2 hsa-miR-96-5p hsa-miR-182-5p 1.0 4.81948e-13
3 hsa-miR-550a-2-5p hsa-miR-550a-2-3p 1.0 2.53784e-11
4 hsa-miR-3130-1-5p hsa-miR-3130-2-5p 1.0 1.04578e-10
5 hsa-miR-550a-2-3p hsa-miR-550a-3-5p 1.0 1.5224e-10
6 hsa-miR-550a-2-3p hsa-miR-550a-2-5p 1.0 4.87261e-10
7 hsa-miR-183-5p hsa-miR-182-5p 1.0 5.45614e-9
8 hsa-miR-30a-3p hsa-miR-30a-5p 1.0 4.29665e-8
9 hsa-miR-574-5p hsa-miR-574-3p 0.999994 6.6706e-7
10 hsa-miR-335-5p hsa-miR-335-3p 0.999991 1.51939e-6
11 hsa-miR-30a-5p hsa-miR-30a-3p 0.999986 2.61404e-6
12 hsa-miR-200a-3p hsa-miR-200b-3p 0.999981 3.95989e-6
13 hsa-miR-2116-5p hsa-miR-2116-3p 0.999885 1.24838e-5
73 hsa-miR-641-5p hsa-miR-26a-2-3p 0.578209 0.218109
74 hsa-miR-1304-3p hsa-miR-619-5p 0.574428 0.220912
75 hsa-miR-3176-3p hsa-miR-143-3p 0.572199 0.223671
76 hsa-miR-96-5p hsa-miR-155-3p 0.57146 0.226367
77 hsa-miR-3176-3p hsa-miR-30b-5p 0.55876 0.229157
78 hsa-miR-3176-3p hsa-miR-4524a-5p 0.557952 0.231886
79 hsa-miR-1304-3p hsa-miR-4421-3p 0.556114 0.23457
80 hsa-miR-641-5p hsa-miR-629-5p 0.550412 0.237258
81 hsa-miR-641-5p hsa-let-7d-5p 0.541004 0.239995
82 hsa-miR-4482-1-3p hsa-miR-30a-3p 0.537326 0.242711
83 hsa-miR-3176-3p hsa-miR-138-1-3p 0.537207 0.245362
84 hsa-miR-193b-3p hsa-miR-181c-3p 0.528992 0.248049

BioFindr computes a posterior probability of a non-zero causal interaction for every pair pf Source and Target genes (columns of dm). The possible Source genes are the eGenes, the subset of genes with cis-eQTLs (columns of dgm), as defined by the eQTL-to-gene mapping in dpm.

By default the output is sorted by decreasing Probability. The optional parameter FDR can be used to limit the output to the set of pairs that has a global false discovery rate (FDR) less than a desired value (here set to 10%). The qvalue column in the output can be used for further filtering of the output, see the coexpression analysis tutorial for more details.

Note the order of input arguments: first dm, the expression data, then dgm, the genotype data, and then dpm, the eQTL mapping of variants to eGenes.

By default, BioFindr assumes that the first column of dpm is a list of variant names that can be found in the column names of dgm, and that the second column of dpm is a list of gene names that can be found in the column names of dm:

dpm
55×2 DataFrame
30 rows omitted
Row SNP_ID GENE_ID
String String
1 rs201731283 hsa-miR-4804-5p
2 rs73933236 hsa-miR-641-5p
3 rs17269517 hsa-miR-2116-3p
4 rs2273626 hsa-miR-4707-3p
5 rs174559 hsa-miR-1908-5p
6 rs7095504 hsa-miR-2110-5p
7 rs2583391 hsa-miR-1255a-5p
8 rs5743580 hsa-miR-574-3p
9 rs143756085 hsa-miR-1270-2-5p
10 rs17269517_1 hsa-miR-2116-5p
11 rs11191968 hsa-miR-4482-1-3p
12 rs8141807 hsa-miR-3667-5p
13 rs648571 hsa-miR-5680-3p
44 rs1378942 hsa-miR-4513-5p
45 rs768533_1 hsa-miR-3130-1-5p
46 rs78562044 hsa-miR-30a-3p
47 rs35671783 hsa-miR-1304-3p
48 rs2537621 hsa-miR-589-5p
49 rs6462386 hsa-miR-550a-2-3p
50 rs745666 hsa-miR-3615-3p
51 rs817478 hsa-miR-4423-5p
52 rs73054305 hsa-miR-642a-5p
53 rs3807344_1 hsa-miR-335-3p
54 rs4393680 hsa-miR-4741-3p
55 rs2583392 hsa-miR-1255a-3p

If your eQTL mapping DataFrame has the relevant columns in a different place, you need to use the optional arguments colX (for the gene expression IDs) and colG (for the eQTL genotype IDs) to specify either the relevant columns index or name:

dP = findr(dm, dgm, dpm; colX=2, colG=1, FDR=0.1);

or

dP = findr(dm, dgm, dpm; colX="GENE_ID", colG="SNP_ID", FDR=0.1);

Any other columns in the eQTL mapping DataFrame (such as association summary statistics) are ignored.

Bipartite causal inference

In some applications, we have multiple omics datasets from the same samples, and may be interested in causal inference from one set of variables to another one. In the GEUVADIS study for instance, we may be interested in inferring causal interactions between microRNAs and target genes. To infer interactions from microRNAs to mRNAs, run:

dP = findr(dt, dm, dgm, dpm; FDR=0.1)
36×4 DataFrame
11 rows omitted
Row Source Target Probability qvalue
String String Float64 Float64
1 hsa-miR-335-5p ENSG00000128510.5 1.0 0.0
2 hsa-miR-335-3p ENSG00000128510.5 1.0 0.0
3 hsa-miR-3150b-3p ENSG00000245080.3 1.0 3.4931e-8
4 hsa-miR-345-5p ENSG00000258504.1 1.0 1.39759e-7
5 hsa-miR-345-5p ENSG00000197119.7 0.999992 1.70218e-6
6 hsa-let-7b-5p ENSG00000241990.1 0.999987 3.63495e-6
7 hsa-miR-574-3p ENSG00000197712.6 0.999967 7.86251e-6
8 hsa-miR-574-5p ENSG00000197712.6 0.999965 1.12617e-5
9 hsa-let-7b-5p ENSG00000197182.8 0.999911 1.99429e-5
10 hsa-miR-1270-2-5p ENSG00000231205.6 0.998541 0.0001638
11 hsa-miR-3150b-3p ENSG00000175895.3 0.997247 0.000399176
12 hsa-miR-335-5p ENSG00000106484.8 0.991704 0.00105721
13 hsa-miR-335-3p ENSG00000106484.8 0.991704 0.00161402
25 hsa-miR-335-5p ENSG00000158516.7 0.848667 0.0513298
26 hsa-miR-1270-2-5p ENSG00000237440.2 0.838225 0.0555776
27 hsa-miR-574-5p ENSG00000186767.5 0.831116 0.0597742
28 hsa-miR-30a-5p ENSG00000169220.12 0.820983 0.0640328
29 hsa-let-7b-5p ENSG00000165071.10 0.809671 0.0683879
30 hsa-miR-574-5p ENSG00000154803.7 0.789247 0.0731334
31 hsa-miR-181c-5p ENSG00000116157.5 0.789079 0.0775781
32 hsa-miR-181c-5p ENSG00000129465.11 0.782729 0.0819435
33 hsa-miR-1908-5p ENSG00000134824.9 0.778177 0.0861823
34 hsa-miR-345-5p ENSG00000131116.5 0.776538 0.0902199
35 hsa-miR-345-5p ENSG00000253239.1 0.761621 0.0944531
36 hsa-miR-200b-3p ENSG00000247679.2 0.743472 0.0989551

Causal test combinations

As indicated above, causal inference requires the combination of multiple statistical tests, and BioFindr supports multiple such combinations, each with their own strengths and weaknesses. Comparisons between the different methods can be found in the original BioFindr paper and in a follow-up paper. In brief, the following combinations are available:

Instrumental variables (default)

The instrumental variables (IV) test combination infers a causal \(A\to B\) interaction if there is a genetic association \(E\to B\), where \(E\) is the causal anchor (best cis-eQTL for \(A\)), and if \(A\) and \(B\) are not independently associated with \(E\) (no pleiotropic effect). This test combination has been shown to provide a good balance between false negative and false positive rate and has a clear genetic interpretation. It is therefore recommended as the default combination. To use the instrumental variable test combination, you don’t need to do anything (it’s the default):

dP = findr(dm, dgm, dpm; FDR=0.1);

or, you can explicitly set the optional parameter combination="IV":

dP = findr(dm, dgm, dpm; FDR=0.1, combination="IV");

Mediation

The mediation test infers a causal \(A\to B\) interaction if there is a genetic association \(E\to B\) and if this association disappears after conditioning on \(A\). If these two tests are true, the causal graph must be \(E\to A\to B\). As a result, this test combination has a very low false positive rate, but in most applications it is hampered by a high false negative rate: if \(A\) and \(B\) are jointly regulated by a third factor \(C\), the mediation test will fail even in the presence of a true \(A\to B\) interaction. To use the mediation test combination, set the optional parameter combination="mediation":

dP_med = findr(dm, dgm, dpm; FDR=0.1, combination="mediation");

Original BioFindr combination

In the original BioFindr paper, a new test combination was proposed that combines the instrumental variable test combination with a new relevance test. On simulated data, this new combination further reduced the false negative rate compared to the IV combination. at the expense of an increased false positive rate, for an overall improved performance. However, the relevance test does not have a clear genetic or causal interpretation, and on real data, performance is generally equivalent to the IV combination, and therefore the IV combination is now recommended as the default. To use the original BioFindr test combination, set the optional parameter combination="orig":

dP_orig = findr(dm, dgm, dpm; FDR=0.1, combination="orig");