using DrWatson
quickactivate(@__DIR__)
using DataFrames
using Arrow
using BioFindr
Causal inference
Introduction
The primary use of BioFindr is for causal inference from genomics and transcriptomics data. In causal inference, cis-acting eQTLs are used as causal anchors or instrumental variables to orient the direction of causality between coexpressed genes, see the Findr paper for more details. As in coexpression analysis and association analysis, the significance of a causal effect is computed using a gene-specific background distribution. This is again achieved by modelling the distribution of test statistics between a given gene \(A\) and all other genes \(B\) as a mixture distribution of real and null (random) correlations. The relative weight of each component reflects the prior probability of finding a non-null \(B\) gene for a given \(A\) gene, and is fitted for every \(A\) gene separately.
Unlike in coexpression analysis and association analysis, causal inference cannot be performed using a single statistical test, but requires the combination of multiple tests. In BioFindr, six likelihood ratio tests are implemented, which can be combined in multiple ways for causal inference. Tests are combined by addition or mulitplication of the posterior probabilities of individial tests, an approach first proposed in this paper.
We will illustrate how to run causal inference with BioFindr using preprocessed data from the GEUVADIS study. See the installation instructions for the steps you need to take to reproduce this tutorial.
Set up the environment
Load data
Expression data
BioFindr expects that expression data are stored as floating-point numbers in a DataFrame where columns correspond to variables (genes) and rows to samples, see the coexpression analysis tutorial for more details.
This tutorial uses two tables of expression data from the same set of samples, one for mRNA expression data called dt
, and one for microRNA (miRNA) expression data called dm
:
= DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dt.arrow")));
dt = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dm.arrow"))); dm
Genotype data
BioFindr expects that genotype data are stored as integer numbers in a DataFrame where columns correspond to variables (genetic variants) and rows to samples. BioFindr uses a categorical model to associate genetic variation to variation in gene expression, hence how different genotypes (e.g. heterozygous vs. homozygous) are encoded as integers does not matter. Future versions will support scientific types for representing genotype data.
This tutorial uses two tables of expression data from the same set of samples, one with genotypes for mRNA eQTLs called dgt
, and one for microRNA (miRNA) eQTLs called dgm
:
= DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dgt.arrow")));
dgt = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dgm.arrow"))); dgm
eQTL mapping data
The preprocessed GEUVADIS data has been organized such that each column of the genotype data is the strongest eQTLs for the corresponding column in the matching expression data. Usually however, eQTL mapping data will be available in the form of a table with variant IDs, gene IDs, and various eQTL associaion statistics (see the original GEUVADIS file for an example). BioFindr expects that such a table is read into a DataFrame, and that only the best (most strongly associated) eQTL is kept for each gene, that is, genes appear only once in the eQTL mapping table. Let’s artificially create such tables for our data:
= DataFrame(SNP_ID = names(dgt), GENE_ID=names(dt)[1:ncol(dgt)]);
dpt = DataFrame(SNP_ID = names(dgm), GENE_ID=names(dm)[1:ncol(dgm)]); dpm
Run BioFindr
Subset-to-all causal inference
In the most common scenario, we have a dataset of gene expression values, with significant cis-eQTL instruments for a subset of genes (“eGenes”). We can then infer causal interactions from the eGenes to all other genes (including other eGenes). For instance, to infer causal microRNA \(\to\) microRNA interaction, run:
= findr(dm, dgm, dpm; FDR=0.25) dP
Row | Source | Target | Probability | qvalue |
---|---|---|---|---|
String | String | Float64 | Float64 | |
1 | hsa-miR-550a-2-5p | hsa-miR-550a-3-5p | 1.0 | 0.0 |
2 | hsa-miR-96-5p | hsa-miR-182-5p | 1.0 | 4.81948e-13 |
3 | hsa-miR-550a-2-5p | hsa-miR-550a-2-3p | 1.0 | 2.53784e-11 |
4 | hsa-miR-3130-1-5p | hsa-miR-3130-2-5p | 1.0 | 1.04578e-10 |
5 | hsa-miR-550a-2-3p | hsa-miR-550a-3-5p | 1.0 | 1.5224e-10 |
6 | hsa-miR-550a-2-3p | hsa-miR-550a-2-5p | 1.0 | 4.87261e-10 |
7 | hsa-miR-183-5p | hsa-miR-182-5p | 1.0 | 5.45614e-9 |
8 | hsa-miR-30a-3p | hsa-miR-30a-5p | 1.0 | 4.29665e-8 |
9 | hsa-miR-574-5p | hsa-miR-574-3p | 0.999994 | 6.6706e-7 |
10 | hsa-miR-335-5p | hsa-miR-335-3p | 0.999991 | 1.51939e-6 |
11 | hsa-miR-30a-5p | hsa-miR-30a-3p | 0.999986 | 2.61404e-6 |
12 | hsa-miR-200a-3p | hsa-miR-200b-3p | 0.999981 | 3.95989e-6 |
13 | hsa-miR-2116-5p | hsa-miR-2116-3p | 0.999885 | 1.24838e-5 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
73 | hsa-miR-641-5p | hsa-miR-26a-2-3p | 0.578209 | 0.218109 |
74 | hsa-miR-1304-3p | hsa-miR-619-5p | 0.574428 | 0.220912 |
75 | hsa-miR-3176-3p | hsa-miR-143-3p | 0.572199 | 0.223671 |
76 | hsa-miR-96-5p | hsa-miR-155-3p | 0.57146 | 0.226367 |
77 | hsa-miR-3176-3p | hsa-miR-30b-5p | 0.55876 | 0.229157 |
78 | hsa-miR-3176-3p | hsa-miR-4524a-5p | 0.557952 | 0.231886 |
79 | hsa-miR-1304-3p | hsa-miR-4421-3p | 0.556114 | 0.23457 |
80 | hsa-miR-641-5p | hsa-miR-629-5p | 0.550412 | 0.237258 |
81 | hsa-miR-641-5p | hsa-let-7d-5p | 0.541004 | 0.239995 |
82 | hsa-miR-4482-1-3p | hsa-miR-30a-3p | 0.537326 | 0.242711 |
83 | hsa-miR-3176-3p | hsa-miR-138-1-3p | 0.537207 | 0.245362 |
84 | hsa-miR-193b-3p | hsa-miR-181c-3p | 0.528992 | 0.248049 |
BioFindr computes a posterior probability of a non-zero causal interaction for every pair pf Source and Target genes (columns of dm
). The possible Source genes are the eGenes, the subset of genes with cis-eQTLs (columns of dgm
), as defined by the eQTL-to-gene mapping in dpm
.
By default the output is sorted by decreasing Probability. The optional parameter FDR can be used to limit the output to the set of pairs that has a global false discovery rate (FDR) less than a desired value (here set to 10%). The qvalue column in the output can be used for further filtering of the output, see the coexpression analysis tutorial for more details.
Note the order of input arguments: first dm
, the expression data, then dgm
, the genotype data, and then dpm
, the eQTL mapping of variants to eGenes.
By default, BioFindr assumes that the first column of dpm
is a list of variant names that can be found in the column names of dgm
, and that the second column of dpm
is a list of gene names that can be found in the column names of dm
:
dpm
Row | SNP_ID | GENE_ID |
---|---|---|
String | String | |
1 | rs201731283 | hsa-miR-4804-5p |
2 | rs73933236 | hsa-miR-641-5p |
3 | rs17269517 | hsa-miR-2116-3p |
4 | rs2273626 | hsa-miR-4707-3p |
5 | rs174559 | hsa-miR-1908-5p |
6 | rs7095504 | hsa-miR-2110-5p |
7 | rs2583391 | hsa-miR-1255a-5p |
8 | rs5743580 | hsa-miR-574-3p |
9 | rs143756085 | hsa-miR-1270-2-5p |
10 | rs17269517_1 | hsa-miR-2116-5p |
11 | rs11191968 | hsa-miR-4482-1-3p |
12 | rs8141807 | hsa-miR-3667-5p |
13 | rs648571 | hsa-miR-5680-3p |
⋮ | ⋮ | ⋮ |
44 | rs1378942 | hsa-miR-4513-5p |
45 | rs768533_1 | hsa-miR-3130-1-5p |
46 | rs78562044 | hsa-miR-30a-3p |
47 | rs35671783 | hsa-miR-1304-3p |
48 | rs2537621 | hsa-miR-589-5p |
49 | rs6462386 | hsa-miR-550a-2-3p |
50 | rs745666 | hsa-miR-3615-3p |
51 | rs817478 | hsa-miR-4423-5p |
52 | rs73054305 | hsa-miR-642a-5p |
53 | rs3807344_1 | hsa-miR-335-3p |
54 | rs4393680 | hsa-miR-4741-3p |
55 | rs2583392 | hsa-miR-1255a-3p |
If your eQTL mapping DataFrame has the relevant columns in a different place, you need to use the optional arguments colX
(for the gene expression IDs) and colG
(for the eQTL genotype IDs) to specify either the relevant columns index or name:
= findr(dm, dgm, dpm; colX=2, colG=1, FDR=0.1); dP
or
= findr(dm, dgm, dpm; colX="GENE_ID", colG="SNP_ID", FDR=0.1); dP
Any other columns in the eQTL mapping DataFrame (such as association summary statistics) are ignored.
Bipartite causal inference
In some applications, we have multiple omics datasets from the same samples, and may be interested in causal inference from one set of variables to another one. In the GEUVADIS study for instance, we may be interested in inferring causal interactions between microRNAs and target genes. To infer interactions from microRNAs to mRNAs, run:
= findr(dt, dm, dgm, dpm; FDR=0.1) dP
Row | Source | Target | Probability | qvalue |
---|---|---|---|---|
String | String | Float64 | Float64 | |
1 | hsa-miR-335-5p | ENSG00000128510.5 | 1.0 | 0.0 |
2 | hsa-miR-335-3p | ENSG00000128510.5 | 1.0 | 0.0 |
3 | hsa-miR-3150b-3p | ENSG00000245080.3 | 1.0 | 3.4931e-8 |
4 | hsa-miR-345-5p | ENSG00000258504.1 | 1.0 | 1.39759e-7 |
5 | hsa-miR-345-5p | ENSG00000197119.7 | 0.999992 | 1.70218e-6 |
6 | hsa-let-7b-5p | ENSG00000241990.1 | 0.999987 | 3.63495e-6 |
7 | hsa-miR-574-3p | ENSG00000197712.6 | 0.999967 | 7.86251e-6 |
8 | hsa-miR-574-5p | ENSG00000197712.6 | 0.999965 | 1.12617e-5 |
9 | hsa-let-7b-5p | ENSG00000197182.8 | 0.999911 | 1.99429e-5 |
10 | hsa-miR-1270-2-5p | ENSG00000231205.6 | 0.998541 | 0.0001638 |
11 | hsa-miR-3150b-3p | ENSG00000175895.3 | 0.997247 | 0.000399176 |
12 | hsa-miR-335-5p | ENSG00000106484.8 | 0.991704 | 0.00105721 |
13 | hsa-miR-335-3p | ENSG00000106484.8 | 0.991704 | 0.00161402 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
25 | hsa-miR-335-5p | ENSG00000158516.7 | 0.848667 | 0.0513298 |
26 | hsa-miR-1270-2-5p | ENSG00000237440.2 | 0.838225 | 0.0555776 |
27 | hsa-miR-574-5p | ENSG00000186767.5 | 0.831116 | 0.0597742 |
28 | hsa-miR-30a-5p | ENSG00000169220.12 | 0.820983 | 0.0640328 |
29 | hsa-let-7b-5p | ENSG00000165071.10 | 0.809671 | 0.0683879 |
30 | hsa-miR-574-5p | ENSG00000154803.7 | 0.789247 | 0.0731334 |
31 | hsa-miR-181c-5p | ENSG00000116157.5 | 0.789079 | 0.0775781 |
32 | hsa-miR-181c-5p | ENSG00000129465.11 | 0.782729 | 0.0819435 |
33 | hsa-miR-1908-5p | ENSG00000134824.9 | 0.778177 | 0.0861823 |
34 | hsa-miR-345-5p | ENSG00000131116.5 | 0.776538 | 0.0902199 |
35 | hsa-miR-345-5p | ENSG00000253239.1 | 0.761621 | 0.0944531 |
36 | hsa-miR-200b-3p | ENSG00000247679.2 | 0.743472 | 0.0989551 |
Causal test combinations
As indicated above, causal inference requires the combination of multiple statistical tests, and BioFindr supports multiple such combinations, each with their own strengths and weaknesses. Comparisons between the different methods can be found in the original BioFindr paper and in a follow-up paper. In brief, the following combinations are available:
Instrumental variables (default)
The instrumental variables (IV) test combination infers a causal \(A\to B\) interaction if there is a genetic association \(E\to B\), where \(E\) is the causal anchor (best cis-eQTL for \(A\)), and if \(A\) and \(B\) are not independently associated with \(E\) (no pleiotropic effect). This test combination has been shown to provide a good balance between false negative and false positive rate and has a clear genetic interpretation. It is therefore recommended as the default combination. To use the instrumental variable test combination, you don’t need to do anything (it’s the default):
= findr(dm, dgm, dpm; FDR=0.1); dP
or, you can explicitly set the optional parameter combination="IV"
:
= findr(dm, dgm, dpm; FDR=0.1, combination="IV"); dP
Mediation
The mediation test infers a causal \(A\to B\) interaction if there is a genetic association \(E\to B\) and if this association disappears after conditioning on \(A\). If these two tests are true, the causal graph must be \(E\to A\to B\). As a result, this test combination has a very low false positive rate, but in most applications it is hampered by a high false negative rate: if \(A\) and \(B\) are jointly regulated by a third factor \(C\), the mediation test will fail even in the presence of a true \(A\to B\) interaction. To use the mediation test combination, set the optional parameter combination="mediation"
:
= findr(dm, dgm, dpm; FDR=0.1, combination="mediation"); dP_med
Original BioFindr combination
In the original BioFindr paper, a new test combination was proposed that combines the instrumental variable test combination with a new relevance test. On simulated data, this new combination further reduced the false negative rate compared to the IV combination. at the expense of an increased false positive rate, for an overall improved performance. However, the relevance test does not have a clear genetic or causal interpretation, and on real data, performance is generally equivalent to the IV combination, and therefore the IV combination is now recommended as the default. To use the original BioFindr test combination, set the optional parameter combination="orig"
:
= findr(dm, dgm, dpm; FDR=0.1, combination="orig"); dP_orig