Causal inference

Introduction

The primary use of BioFindr is for causal inference from genomics and transcriptomics data. In causal inference, cis-acting eQTLs are used as causal anchors or instrumental variables to orient the direction of causality between coexpressed genes, see the Findr paper for more details. As in coexpression analysis and association analysis, the significance of a causal effect is computed using a gene-specific background distribution. This is again achieved by modelling the distribution of test statistics between a given gene \(A\) and all other genes \(B\) as a mixture distribution of real and null (random) correlations. The relative weight of each component reflects the prior probability of finding a non-null \(B\) gene for a given \(A\) gene, and is fitted for every \(A\) gene separately.

Unlike in coexpression analysis and association analysis, causal inference cannot be performed using a single statistical test, but requires the combination of multiple tests. In BioFindr, six likelihood ratio tests are implemented, which can be combined in multiple ways for causal inference. Tests are combined by addition or mulitplication of the posterior probabilities of individial tests, an approach first proposed in this paper.

We will illustrate how to run causal inference with BioFindr using preprocessed data from the GEUVADIS study. See the installation instructions for the steps you need to take to reproduce this tutorial.

Set up the environment

using DrWatson
quickactivate(@__DIR__)

using DataFrames
using Arrow

using BioFindr

Load data

Expression data

BioFindr expects that expression data are stored as floating-point numbers in a DataFrame where columns correspond to variables (genes) and rows to samples, see the coexpression analysis tutorial for more details.

This tutorial uses two tables of expression data from the same set of samples, one for mRNA expression data called dt, and one for microRNA (miRNA) expression data called dm:

dt = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dt.arrow")));
dm = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dm.arrow")));

Genotype data

BioFindr expects that genotype data are stored as integer numbers in a DataFrame where columns correspond to variables (genetic variants) and rows to samples. BioFindr uses a categorical model to associate genetic variation to variation in gene expression, hence how different genotypes (e.g. heterozygous vs. homozygous) are encoded as integers does not matter. Future versions will support scientific types for representing genotype data.

This tutorial uses two tables of expression data from the same set of samples, one with genotypes for mRNA eQTLs called dgt, and one for microRNA (miRNA) eQTLs called dgm:

dgt = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dgt.arrow")));
dgm = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dgm.arrow")));

eQTL mapping data

The preprocessed GEUVADIS data has been organized such that each column of the genotype data is the strongest eQTLs for the corresponding column in the matching expression data. Usually however, eQTL mapping data will be available in the form of a table with variant IDs, gene IDs, and various eQTL associaion statistics (see the original GEUVADIS file for an example). BioFindr expects that such a table is read into a DataFrame, and that only the best (most strongly associated) eQTL is kept for each gene, that is, genes appear only once in the eQTL mapping table. Let’s artificially create such tables for our data:

dpt = DataFrame(SNP_ID = names(dgt), GENE_ID=names(dt)[1:ncol(dgt)]);
dpm = DataFrame(SNP_ID = names(dgm), GENE_ID=names(dm)[1:ncol(dgm)]);

Run BioFindr

Subset-to-all causal inference

In the most common scenario, we have a dataset of gene expression values, with significant cis-eQTL instruments for a subset of genes (“eGenes”). We can then infer causal interactions from the eGenes to all other genes (including other eGenes). For instance, to infer causal microRNA \(\to\) microRNA interaction, run:

dP = findr(dm, dgm, dpm; FDR=0.25)

84×4 DataFrame

59 rows omitted

Row	Source	Target	Probability	qvalue
	String	String	Float64	Float64
1	hsa-miR-550a-2-5p	hsa-miR-550a-3-5p	1.0	0.0
2	hsa-miR-96-5p	hsa-miR-182-5p	1.0	4.81948e-13
3	hsa-miR-550a-2-5p	hsa-miR-550a-2-3p	1.0	2.53784e-11
4	hsa-miR-3130-1-5p	hsa-miR-3130-2-5p	1.0	1.04578e-10
5	hsa-miR-550a-2-3p	hsa-miR-550a-3-5p	1.0	1.5224e-10
6	hsa-miR-550a-2-3p	hsa-miR-550a-2-5p	1.0	4.87261e-10
7	hsa-miR-183-5p	hsa-miR-182-5p	1.0	5.45614e-9
8	hsa-miR-30a-3p	hsa-miR-30a-5p	1.0	4.29665e-8
9	hsa-miR-574-5p	hsa-miR-574-3p	0.999994	6.6706e-7
10	hsa-miR-335-5p	hsa-miR-335-3p	0.999991	1.51939e-6
11	hsa-miR-30a-5p	hsa-miR-30a-3p	0.999986	2.61404e-6
12	hsa-miR-200a-3p	hsa-miR-200b-3p	0.999981	3.95989e-6
13	hsa-miR-2116-5p	hsa-miR-2116-3p	0.999885	1.24838e-5
⋮	⋮	⋮	⋮	⋮
73	hsa-miR-641-5p	hsa-miR-26a-2-3p	0.578209	0.218109
74	hsa-miR-1304-3p	hsa-miR-619-5p	0.574428	0.220912
75	hsa-miR-3176-3p	hsa-miR-143-3p	0.572199	0.223671
76	hsa-miR-96-5p	hsa-miR-155-3p	0.57146	0.226367
77	hsa-miR-3176-3p	hsa-miR-30b-5p	0.55876	0.229157
78	hsa-miR-3176-3p	hsa-miR-4524a-5p	0.557952	0.231886
79	hsa-miR-1304-3p	hsa-miR-4421-3p	0.556114	0.23457
80	hsa-miR-641-5p	hsa-miR-629-5p	0.550412	0.237258
81	hsa-miR-641-5p	hsa-let-7d-5p	0.541004	0.239995
82	hsa-miR-4482-1-3p	hsa-miR-30a-3p	0.537326	0.242711
83	hsa-miR-3176-3p	hsa-miR-138-1-3p	0.537207	0.245362
84	hsa-miR-193b-3p	hsa-miR-181c-3p	0.528992	0.248049

BioFindr computes a posterior probability of a non-zero causal interaction for every pair pf Source and Target genes (columns of dm). The possible Source genes are the eGenes, the subset of genes with cis-eQTLs (columns of dgm), as defined by the eQTL-to-gene mapping in dpm.

By default the output is sorted by decreasing Probability. The optional parameter FDR can be used to limit the output to the set of pairs that has a global false discovery rate (FDR) less than a desired value (here set to 10%). The qvalue column in the output can be used for further filtering of the output, see the coexpression analysis tutorial for more details.

Note the order of input arguments: first dm, the expression data, then dgm, the genotype data, and then dpm, the eQTL mapping of variants to eGenes.

By default, BioFindr assumes that the first column of dpm is a list of variant names that can be found in the column names of dgm, and that the second column of dpm is a list of gene names that can be found in the column names of dm:

dpm

55×2 DataFrame

30 rows omitted

Row	SNP_ID	GENE_ID
	String	String
1	rs201731283	hsa-miR-4804-5p
2	rs73933236	hsa-miR-641-5p
3	rs17269517	hsa-miR-2116-3p
4	rs2273626	hsa-miR-4707-3p
5	rs174559	hsa-miR-1908-5p
6	rs7095504	hsa-miR-2110-5p
7	rs2583391	hsa-miR-1255a-5p
8	rs5743580	hsa-miR-574-3p
9	rs143756085	hsa-miR-1270-2-5p
10	rs17269517_1	hsa-miR-2116-5p
11	rs11191968	hsa-miR-4482-1-3p
12	rs8141807	hsa-miR-3667-5p
13	rs648571	hsa-miR-5680-3p
⋮	⋮	⋮
44	rs1378942	hsa-miR-4513-5p
45	rs768533_1	hsa-miR-3130-1-5p
46	rs78562044	hsa-miR-30a-3p
47	rs35671783	hsa-miR-1304-3p
48	rs2537621	hsa-miR-589-5p
49	rs6462386	hsa-miR-550a-2-3p
50	rs745666	hsa-miR-3615-3p
51	rs817478	hsa-miR-4423-5p
52	rs73054305	hsa-miR-642a-5p
53	rs3807344_1	hsa-miR-335-3p
54	rs4393680	hsa-miR-4741-3p
55	rs2583392	hsa-miR-1255a-3p

If your eQTL mapping DataFrame has the relevant columns in a different place, you need to use the optional arguments colX (for the gene expression IDs) and colG (for the eQTL genotype IDs) to specify either the relevant columns index or name:

dP = findr(dm, dgm, dpm; colX=2, colG=1, FDR=0.1);

dP = findr(dm, dgm, dpm; colX="GENE_ID", colG="SNP_ID", FDR=0.1);

Any other columns in the eQTL mapping DataFrame (such as association summary statistics) are ignored.

Bipartite causal inference

In some applications, we have multiple omics datasets from the same samples, and may be interested in causal inference from one set of variables to another one. In the GEUVADIS study for instance, we may be interested in inferring causal interactions between microRNAs and target genes. To infer interactions from microRNAs to mRNAs, run:

dP = findr(dt, dm, dgm, dpm; FDR=0.1)

36×4 DataFrame

11 rows omitted

Row	Source	Target	Probability	qvalue
	String	String	Float64	Float64
1	hsa-miR-335-5p	ENSG00000128510.5	1.0	0.0
2	hsa-miR-335-3p	ENSG00000128510.5	1.0	0.0
3	hsa-miR-3150b-3p	ENSG00000245080.3	1.0	3.4931e-8
4	hsa-miR-345-5p	ENSG00000258504.1	1.0	1.39759e-7
5	hsa-miR-345-5p	ENSG00000197119.7	0.999992	1.70218e-6
6	hsa-let-7b-5p	ENSG00000241990.1	0.999987	3.63495e-6
7	hsa-miR-574-3p	ENSG00000197712.6	0.999967	7.86251e-6
8	hsa-miR-574-5p	ENSG00000197712.6	0.999965	1.12617e-5
9	hsa-let-7b-5p	ENSG00000197182.8	0.999911	1.99429e-5
10	hsa-miR-1270-2-5p	ENSG00000231205.6	0.998541	0.0001638
11	hsa-miR-3150b-3p	ENSG00000175895.3	0.997247	0.000399176
12	hsa-miR-335-5p	ENSG00000106484.8	0.991704	0.00105721
13	hsa-miR-335-3p	ENSG00000106484.8	0.991704	0.00161402
⋮	⋮	⋮	⋮	⋮
25	hsa-miR-335-5p	ENSG00000158516.7	0.848667	0.0513298
26	hsa-miR-1270-2-5p	ENSG00000237440.2	0.838225	0.0555776
27	hsa-miR-574-5p	ENSG00000186767.5	0.831116	0.0597742
28	hsa-miR-30a-5p	ENSG00000169220.12	0.820983	0.0640328
29	hsa-let-7b-5p	ENSG00000165071.10	0.809671	0.0683879
30	hsa-miR-574-5p	ENSG00000154803.7	0.789247	0.0731334
31	hsa-miR-181c-5p	ENSG00000116157.5	0.789079	0.0775781
32	hsa-miR-181c-5p	ENSG00000129465.11	0.782729	0.0819435
33	hsa-miR-1908-5p	ENSG00000134824.9	0.778177	0.0861823
34	hsa-miR-345-5p	ENSG00000131116.5	0.776538	0.0902199
35	hsa-miR-345-5p	ENSG00000253239.1	0.761621	0.0944531
36	hsa-miR-200b-3p	ENSG00000247679.2	0.743472	0.0989551

Causal test combinations

As indicated above, causal inference requires the combination of multiple statistical tests, and BioFindr supports multiple such combinations, each with their own strengths and weaknesses. Comparisons between the different methods can be found in the original BioFindr paper and in a follow-up paper. In brief, the following combinations are available:

Instrumental variables (default)

The instrumental variables (IV) test combination infers a causal \(A\to B\) interaction if there is a genetic association \(E\to B\), where \(E\) is the causal anchor (best cis-eQTL for \(A\)), and if \(A\) and \(B\) are not independently associated with \(E\) (no pleiotropic effect). This test combination has been shown to provide a good balance between false negative and false positive rate and has a clear genetic interpretation. It is therefore recommended as the default combination. To use the instrumental variable test combination, you don’t need to do anything (it’s the default):

dP = findr(dm, dgm, dpm; FDR=0.1);

or, you can explicitly set the optional parameter combination="IV":

dP = findr(dm, dgm, dpm; FDR=0.1, combination="IV");

Mediation

The mediation test infers a causal \(A\to B\) interaction if there is a genetic association \(E\to B\) and if this association disappears after conditioning on \(A\). If these two tests are true, the causal graph must be \(E\to A\to B\). As a result, this test combination has a very low false positive rate, but in most applications it is hampered by a high false negative rate: if \(A\) and \(B\) are jointly regulated by a third factor \(C\), the mediation test will fail even in the presence of a true \(A\to B\) interaction. To use the mediation test combination, set the optional parameter combination="mediation":

dP_med = findr(dm, dgm, dpm; FDR=0.1, combination="mediation");

Original BioFindr combination

In the original BioFindr paper, a new test combination was proposed that combines the instrumental variable test combination with a new relevance test. On simulated data, this new combination further reduced the false negative rate compared to the IV combination. at the expense of an increased false positive rate, for an overall improved performance. However, the relevance test does not have a clear genetic or causal interpretation, and on real data, performance is generally equivalent to the IV combination, and therefore the IV combination is now recommended as the default. To use the original BioFindr test combination, set the optional parameter combination="orig":

dP_orig = findr(dm, dgm, dpm; FDR=0.1, combination="orig");