Association analysis

Introduction

While BioFindr is developed primarily for causal inference from genomics and transcriptomics data, association analysis between genomics and transcriptomics data is also possible. In association analysis, genetic effects on the transcriptome are measured by testing if genes are differentially expressed in different groups of samples defined by the genotype of a genetic variant of interest. In BioFindr, significance of association is computed using a categorical model and a variant-specific background distribution. Similar to what was done in the coexpression analysis tutorial, this is achieved by modelling the distribution of association values between a given variant \(A\) and all genes \(B\) as a mixture distribution of real and null (random) associations. The relative weight of each component then reflects the prior probability of finding a non-null \(B\) gene for a given variant \(A\), and is fitted for every \(A\) separately.

We will illustrate how to run association analysis with BioFindr using preprocessed data from the GEUVADIS study. See the installation instructions for the steps you need to take to reproduce this tutorial.

Set up the environment

We begin by setting up the environment and loading some necessary packages.

using DrWatson
quickactivate(@__DIR__)

using DataFrames
using Arrow

using BioFindr
Precompiling Arrow
  ✓ TimeZones
  ✓ TimeZones → TimeZonesRecipesBaseExt
  ✓ Arrow
  3 dependencies successfully precompiled in 21 seconds. 33 already precompiled.
Precompiling BioFindr
  ✓ MetaGraphsNext
  ✓ BioFindr
  2 dependencies successfully precompiled in 9 seconds. 126 already precompiled.

Load data

Expression data

BioFindr expects that expression data are stored as floating-point numbers in a DataFrame where columns correspond to variables (genes) and rows to samples, see the coexpression analysis tutorial for more details.

This tutorial uses two tables of expression data from the same set of samples, one for mRNA expression data called dt, and one for microRNA (miRNA) expression data called dm:

dt = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dt.arrow")));
dm = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dm.arrow")));

Genotype data

BioFindr expects that genotype data are stored as integer numbers in a DataFrame where columns correspond to variables (genetic variants) and rows to samples. Since BioFindr uses a categorical association model, it does not matter how different genotypes (e.g. heterozygous vs. homozygous) are encoded as integers. Future versions will support scientific types for representing genotype data.

This tutorial uses two tables of genotype data from the same set of samples as the expression data, one with genotypes for mRNA eQTLs called dgt, and one for microRNA (miRNA) eQTLs called dgm:

dgt = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dgt.arrow")));
dgm = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dgm.arrow")));

Run BioFindr

Assume we are interested in identifying mRNA genes whose expression levels are associated to microRNA eQTLs. We run:

dP = findr(dt, dgm, FDR=0.05)
245×4 DataFrame
220 rows omitted
Row Source Target Probability qvalue
String String Float64 Float64
1 rs71532525 ENSG00000229358.2 1.0 0.0
2 rs6987762 ENSG00000245080.3 1.0 0.0
3 rs3807344 ENSG00000128510.5 1.0 0.0
4 rs6462386 ENSG00000229358.2 1.0 0.0
5 rs3807344_1 ENSG00000128510.5 1.0 0.0
6 rs9616333 ENSG00000212939.2 1.0 4.48436e-11
7 rs8141807 ENSG00000212939.2 1.0 2.89877e-10
8 rs5743580 ENSG00000174130.7 1.0 3.65605e-9
9 rs73236618 ENSG00000174130.7 1.0 1.30597e-8
10 rs5743580 ENSG00000174125.3 1.0 2.22497e-8
11 rs73236618 ENSG00000174125.3 1.0 4.03598e-8
12 rs6462386 ENSG00000237004.2 1.0 5.69644e-8
13 rs71532525 ENSG00000237004.2 1.0 8.03159e-8
234 rs768533 ENSG00000177239.9 0.904976 0.0477775
235 rs768533_1 ENSG00000177239.9 0.904976 0.0479786
236 rs768533 ENSG00000237172.3 0.904966 0.048178
237 rs768533_1 ENSG00000237172.3 0.904966 0.0483757
238 indel:3I_20_62542026 ENSG00000182718.11 0.903654 0.0485772
239 rs4926170 ENSG00000241129.2 0.90332 0.0487785
240 rs80283015 ENSG00000145349.12 0.903025 0.0489793
241 rs768533 ENSG00000228976.1 0.902361 0.0491812
242 rs768533_1 ENSG00000228976.1 0.902361 0.0493815
243 rs80283015 ENSG00000088992.13 0.902234 0.0495806
244 rs30221 ENSG00000249719.1 0.902213 0.0497781
245 rs80283015 ENSG00000231249.1 0.901757 0.049976

BioFindr computes a posterior probability of non-zero association for every Source variant (columns of dgm) and Target gene (columns of dt). By default the output is sorted by decreasing Probability. The optional parameter FDR can be used to limit the output to the set of pairs that has a global false discovery rate (FDR) less than a desired value (here set to 5%). The qvalue column in the output can be used for further filtering of the output, see the coexpression analysis tutorial for further details.

Note the order of the arguments. The first argument dt is the Target DataFrame, and the second argument the Source DataFrame.