Association analysis

Introduction

While BioFindr is developed primarily for causal inference from genomics and transcriptomics data, association analysis between genomics and transcriptomics data is also possible. In association analysis, genetic effects on the transcriptome are measured by testing if genes are differentially expressed in different groups of samples defined by the genotype of a genetic variant of interest. In BioFindr, significance of association is computed using a categorical model and a variant-specific background distribution. Similar to what was done in the coexpression analysis tutorial, this is achieved by modelling the distribution of association values between a given variant \(A\) and all genes \(B\) as a mixture distribution of real and null (random) associations. The relative weight of each component then reflects the prior probability of finding a non-null \(B\) gene for a given variant \(A\), and is fitted for every \(A\) separately.

We will illustrate how to run association analysis with BioFindr using preprocessed data from the GEUVADIS study. See the installation instructions for the steps you need to take to reproduce this tutorial.

Set up the environment

We begin by setting up the environment and loading some necessary packages.

using DrWatson
quickactivate(@__DIR__)

using DataFrames
using Arrow

using BioFindr

Precompiling Arrow
  ✓ TimeZones
  ✓ TimeZones → TimeZonesRecipesBaseExt
  ✓ Arrow
  3 dependencies successfully precompiled in 21 seconds. 33 already precompiled.
Precompiling BioFindr
  ✓ MetaGraphsNext
  ✓ BioFindr
  2 dependencies successfully precompiled in 9 seconds. 126 already precompiled.

Load data

Expression data

BioFindr expects that expression data are stored as floating-point numbers in a DataFrame where columns correspond to variables (genes) and rows to samples, see the coexpression analysis tutorial for more details.

This tutorial uses two tables of expression data from the same set of samples, one for mRNA expression data called dt, and one for microRNA (miRNA) expression data called dm:

dt = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dt.arrow")));
dm = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dm.arrow")));

Genotype data

BioFindr expects that genotype data are stored as integer numbers in a DataFrame where columns correspond to variables (genetic variants) and rows to samples. Since BioFindr uses a categorical association model, it does not matter how different genotypes (e.g. heterozygous vs. homozygous) are encoded as integers. Future versions will support scientific types for representing genotype data.

This tutorial uses two tables of genotype data from the same set of samples as the expression data, one with genotypes for mRNA eQTLs called dgt, and one for microRNA (miRNA) eQTLs called dgm:

dgt = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dgt.arrow")));
dgm = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dgm.arrow")));

Run BioFindr

Assume we are interested in identifying mRNA genes whose expression levels are associated to microRNA eQTLs. We run:

dP = findr(dt, dgm, FDR=0.05)

245×4 DataFrame

220 rows omitted

Row	Source	Target	Probability	qvalue
	String	String	Float64	Float64
1	rs71532525	ENSG00000229358.2	1.0	0.0
2	rs6987762	ENSG00000245080.3	1.0	0.0
3	rs3807344	ENSG00000128510.5	1.0	0.0
4	rs6462386	ENSG00000229358.2	1.0	0.0
5	rs3807344_1	ENSG00000128510.5	1.0	0.0
6	rs9616333	ENSG00000212939.2	1.0	4.48436e-11
7	rs8141807	ENSG00000212939.2	1.0	2.89877e-10
8	rs5743580	ENSG00000174130.7	1.0	3.65605e-9
9	rs73236618	ENSG00000174130.7	1.0	1.30597e-8
10	rs5743580	ENSG00000174125.3	1.0	2.22497e-8
11	rs73236618	ENSG00000174125.3	1.0	4.03598e-8
12	rs6462386	ENSG00000237004.2	1.0	5.69644e-8
13	rs71532525	ENSG00000237004.2	1.0	8.03159e-8
⋮	⋮	⋮	⋮	⋮
234	rs768533	ENSG00000177239.9	0.904976	0.0477775
235	rs768533_1	ENSG00000177239.9	0.904976	0.0479786
236	rs768533	ENSG00000237172.3	0.904966	0.048178
237	rs768533_1	ENSG00000237172.3	0.904966	0.0483757
238	indel:3I_20_62542026	ENSG00000182718.11	0.903654	0.0485772
239	rs4926170	ENSG00000241129.2	0.90332	0.0487785
240	rs80283015	ENSG00000145349.12	0.903025	0.0489793
241	rs768533	ENSG00000228976.1	0.902361	0.0491812
242	rs768533_1	ENSG00000228976.1	0.902361	0.0493815
243	rs80283015	ENSG00000088992.13	0.902234	0.0495806
244	rs30221	ENSG00000249719.1	0.902213	0.0497781
245	rs80283015	ENSG00000231249.1	0.901757	0.049976

BioFindr computes a posterior probability of non-zero association for every Source variant (columns of dgm) and Target gene (columns of dt). By default the output is sorted by decreasing Probability. The optional parameter FDR can be used to limit the output to the set of pairs that has a global false discovery rate (FDR) less than a desired value (here set to 5%). The qvalue column in the output can be used for further filtering of the output, see the coexpression analysis tutorial for further details.

Note the order of the arguments. The first argument dt is the Target DataFrame, and the second argument the Source DataFrame.