Coexpression analysis

Introduction

While [BioFindr] is developed primarily for causal inference from genomics and transcriptomics data, coexpression analysis of transcriptomics data alone is also possible. In coexpression analysis, pairwise correlation is used as a simple measure for the probability of two genes being functionally related either through direct or indirect regulation, or through coregulation by a third factor. In BioFindr, significance of pairwise correlation is computed using a gene-specific background distribution, allowing for genes having different biological roles. For instance, it is known that many biological networks are scale-free, where a small number of so-called “hub genes” have a high number of interaction partners while most other genes only have few. In BioFindr, this is accomodated by modelling the distribution of correlation values between a given gene \(A\) and all other genes \(B\) as a mixture distribution of real and null (random) correlations. The relative weight of each component reflects the prior probability of finding a non-null \(B\) gene for a given \(A\) gene, and is fitted for every \(A\) gene separately.

We will illustrate how to run coexpression analysis with BioFindr using preprocessed data from the GEUVADIS study. See the installation instructions for the steps you need to take to reproduce this tutorial.

Set up the environment

We begin by setting up the environment and loading some necessary packages.

using DrWatson
quickactivate(@__DIR__)

using DataFrames
using Arrow
using Markdown

using BioFindr

Load expression data

BioFindr expects that expression data are stored in a DataFrame where columns correspond to variables (genes) and rows to samples. An expression DataFrame should not contain any other columns (e.g. gene annotation) than gene expression columns, and gene expression data should be stored as floating-point numbers. Internally, BioFindr operates on matrices, and if you have an expression DataFrame df, then Matrix(df) should return a matrix of floats.

At the moment, BioFindr does not support count-based expression data being provided as a DataFrame of integers. This is not an intrinsic limitation of the method, but simply to distinguish expression data from integer-valued genotype data. Future versions will remove this limitation by supporting scientific types.

This tutorial uses two tables of expression data from the same set of samples, one for mRNA expression data called dt, and one for microRNA (miRNA) expression data called dm:

dt = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dt.arrow")));
dm = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dm.arrow")));

The mRNA data has expression data for 23722 genes in 360 samples, while miRNA data is available for 674 miRNAs in the same 360 samples.

We can confirm that the data frames are of the right type:

[typeof(Matrix(dt)) typeof(Matrix(dm))]
1×2 Matrix{DataType}:
 Matrix{Float64}  Matrix{Float64}

Run BioFindr

All-vs-all coexpression analysis

The simplest type of coexpression analysis tests for non-zero correlation among all possible pairs in a gene expression dataset. Let’s do this for the miRNA data:

dP_mir_all = findr(dm, FDR=0.05)
28378×4 DataFrame
28353 rows omitted
Row Source Target Probability qvalue
String String Float64 Float64
1 hsa-miR-574-3p hsa-miR-574-5p 1.0 0.0
2 hsa-miR-4482-1-3p hsa-miR-1304-3p 1.0 0.0
3 hsa-miR-3667-5p hsa-miR-3667-3p 1.0 0.0
4 hsa-miR-550a-2-5p hsa-miR-550a-2-3p 1.0 0.0
5 hsa-miR-550a-2-5p hsa-miR-550a-3-5p 1.0 0.0
6 hsa-miR-3130-2-5p hsa-miR-3130-1-5p 1.0 0.0
7 hsa-let-7b-5p hsa-let-7b-3p 1.0 0.0
8 hsa-miR-96-5p hsa-miR-182-5p 1.0 0.0
9 hsa-miR-96-5p hsa-miR-1271-5p 1.0 0.0
10 hsa-miR-3667-3p hsa-miR-3667-5p 1.0 0.0
11 hsa-miR-181c-5p hsa-miR-181c-3p 1.0 0.0
12 hsa-miR-181c-5p hsa-miR-181d-5p 1.0 0.0
13 hsa-miR-335-5p hsa-miR-335-3p 1.0 0.0
28367 hsa-miR-410-3p hsa-miR-7-3-5p 0.84803 0.0499585
28368 hsa-miR-152-3p hsa-miR-181a-1-5p 0.848014 0.0499621
28369 hsa-let-7f-2-3p hsa-miR-4446-3p 0.848008 0.0499657
28370 hsa-miR-181a-1-3p hsa-miR-99b-5p 0.847993 0.0499693
28371 hsa-miR-330-5p hsa-miR-296-5p 0.84799 0.0499729
28372 hsa-miR-1307-5p hsa-miR-32-3p 0.847985 0.0499765
28373 hsa-miR-221-3p hsa-miR-424-5p 0.847972 0.0499801
28374 hsa-miR-27b-5p hsa-miR-4424-3p 0.847958 0.0499837
28375 hsa-miR-629-5p hsa-miR-5096-5p 0.847952 0.0499873
28376 hsa-miR-181c-3p hsa-miR-374b-5p 0.847946 0.0499909
28377 hsa-miR-3688-2-3p hsa-miR-1260b-5p 0.847943 0.0499945
28378 hsa-let-7d-5p hsa-miR-142-3p 0.847937 0.0499981

BioFindr computes a posterior probability of non-zero correlation for every Source and Target gene pair. By default the output is sorted by decreasing Probability and self-interactions are excluded. The optional parameter FDR can be used to limit the output to the set of pairs that has a global false discovery rate (FDR) less than a desired value. The qvalue column in the output can be used for further filtering of the output. Say you ran findr with an FRD threshold of 5% as above. If you now want to restrict the output to an FDR threshold of 1%, you can do:

filter!(row -> row.qvalue <= 0.01, dP_mir_all)
14556×4 DataFrame
14531 rows omitted
Row Source Target Probability qvalue
String String Float64 Float64
1 hsa-miR-574-3p hsa-miR-574-5p 1.0 0.0
2 hsa-miR-4482-1-3p hsa-miR-1304-3p 1.0 0.0
3 hsa-miR-3667-5p hsa-miR-3667-3p 1.0 0.0
4 hsa-miR-550a-2-5p hsa-miR-550a-2-3p 1.0 0.0
5 hsa-miR-550a-2-5p hsa-miR-550a-3-5p 1.0 0.0
6 hsa-miR-3130-2-5p hsa-miR-3130-1-5p 1.0 0.0
7 hsa-let-7b-5p hsa-let-7b-3p 1.0 0.0
8 hsa-miR-96-5p hsa-miR-182-5p 1.0 0.0
9 hsa-miR-96-5p hsa-miR-1271-5p 1.0 0.0
10 hsa-miR-3667-3p hsa-miR-3667-5p 1.0 0.0
11 hsa-miR-181c-5p hsa-miR-181c-3p 1.0 0.0
12 hsa-miR-181c-5p hsa-miR-181d-5p 1.0 0.0
13 hsa-miR-335-5p hsa-miR-335-3p 1.0 0.0
14545 hsa-miR-4662a-5p hsa-miR-486-5p 0.961667 0.0099779
14546 hsa-miR-181b-1-3p hsa-miR-3609-3p 0.961654 0.00997985
14547 hsa-miR-23a-5p hsa-miR-150-5p 0.961633 0.0099818
14548 hsa-miR-500a-3p hsa-miR-138-1-3p 0.961624 0.00998375
14549 hsa-miR-425-3p hsa-miR-200c-3p 0.961621 0.00998571
14550 hsa-miR-125a-5p hsa-miR-140-3p 0.961608 0.00998766
14551 hsa-miR-342-5p hsa-miR-335-5p 0.961607 0.00998961
14552 hsa-miR-3615-3p hsa-miR-744-5p 0.9616 0.00999156
14553 hsa-miR-320d-1-3p hsa-miR-365a-5p 0.961596 0.00999351
14554 hsa-miR-23a-5p hsa-miR-425-5p 0.961593 0.00999547
14555 hsa-miR-26b-3p hsa-miR-4687-3p 0.961592 0.00999742
14556 hsa-miR-589-5p hsa-miR-644b-3p 0.96159 0.00999937

Note that the filter! command modifies the input DataFrame in-place, that is, the rows not matching the selection criteria are deleted. Use filter to return a new DataFrame with the selected rows.

Finally, remember that the output of coexpression analysis in BioFindr is not symmetric, that is

\[ P(Source, Target) \neq P(Target, Source) \]

This is because the posterior probabilities are estimated using a Source-specific background distribution, accounting for the fact that different genes may have a different number of non-null interaction partners a priori. See the Findr paper for details.

Bipartite coexpression analysis

Since BioFindr’s posterior probabilities are Source gene-specific, we may be interested in computing probabilities only for a subset of Source genes, or using different Source and Target gene sets.

As an example of the first scenario, assume we are interested in finding microRNAs that are significantly correlated with microRNAs from the Mir-200 family. First find the Mir-200 microRNAs:

mir200 = names(dm)[startswith.(names(dm),"hsa-miR-200")]
3-element Vector{String}:
 "hsa-miR-200b-3p"
 "hsa-miR-200a-3p"
 "hsa-miR-200c-3p"

Then run

dP_mir200_mir = findr(dm, colnames=mir200, FDR=0.01)
47×4 DataFrame
22 rows omitted
Row Source Target Probability qvalue
String String Float64 Float64
1 hsa-miR-200a-3p hsa-miR-200b-3p 1.0 1.39181e-9
2 hsa-miR-200b-3p hsa-miR-200a-3p 1.0 1.55254e-9
3 hsa-miR-200c-3p hsa-miR-93-5p 0.999998 5.68454e-7
4 hsa-miR-200c-3p hsa-miR-140-3p 0.999984 4.44472e-6
5 hsa-miR-200b-3p hsa-miR-429-3p 0.999918 1.99874e-5
6 hsa-miR-200c-3p hsa-miR-16-1-5p 0.999792 5.13862e-5
7 hsa-miR-200c-3p hsa-miR-664-3p 0.999784 7.49279e-5
8 hsa-miR-200c-3p hsa-miR-17-5p 0.999641 0.000110395
9 hsa-miR-200b-3p hsa-miR-582-3p 0.999246 0.00018188
10 hsa-miR-200c-3p hsa-miR-769-5p 0.999242 0.000239459
11 hsa-miR-200c-3p hsa-miR-130b-5p 0.999139 0.000295921
12 hsa-miR-200c-3p hsa-miR-425-5p 0.999052 0.000350225
13 hsa-miR-200c-3p hsa-miR-103a-2-3p 0.999001 0.000400161
36 hsa-miR-200b-3p hsa-miR-769-3p 0.983749 0.00628075
37 hsa-miR-200c-3p hsa-miR-378a-3p 0.983227 0.00656431
38 hsa-miR-200b-3p hsa-miR-26b-5p 0.981109 0.00688869
39 hsa-miR-200c-3p hsa-miR-619-5p 0.980948 0.00720056
40 hsa-miR-200c-3p hsa-miR-500a-5p 0.97978 0.00752605
41 hsa-miR-200c-3p hsa-miR-4677-3p 0.979595 0.00784016
42 hsa-miR-200b-3p hsa-miR-328-3p 0.979019 0.00815303
43 hsa-miR-200b-3p hsa-miR-9-1-5p 0.978721 0.00845829
44 hsa-miR-200c-3p hsa-miR-141-5p 0.978487 0.00875499
45 hsa-miR-200c-3p hsa-miR-30e-3p 0.977855 0.00905255
46 hsa-miR-200a-3p hsa-miR-26b-5p 0.976313 0.0093707
47 hsa-miR-200c-3p hsa-miR-25-3p 0.971038 0.00978752

The parameter colnames must be a vector of strings containing a subset of variable names of the input DataFrame dm to be used as Source genes.

As an example of the second scenario, we may be interested in finding genes that are significantly correlated with all or a subset of microRNAs:

dP_mir_mrna = findr(dt, dm, FDR=0.01)
56268×4 DataFrame
56243 rows omitted
Row Source Target Probability qvalue
String String Float64 Float64
1 hsa-miR-3150b-3p ENSG00000245080.3 1.0 0.0
2 hsa-let-7b-5p ENSG00000197182.8 1.0 0.0
3 hsa-let-7b-5p ENSG00000241990.1 1.0 0.0
4 hsa-miR-335-5p ENSG00000106484.8 1.0 0.0
5 hsa-miR-335-5p ENSG00000128510.5 1.0 0.0
6 hsa-miR-335-5p ENSG00000220884.2 1.0 0.0
7 hsa-miR-618-5p ENSG00000111052.3 1.0 0.0
8 hsa-miR-335-3p ENSG00000106484.8 1.0 0.0
9 hsa-miR-335-3p ENSG00000128510.5 1.0 0.0
10 hsa-miR-335-3p ENSG00000220884.2 1.0 0.0
11 hsa-let-7b-3p ENSG00000241990.1 1.0 0.0
12 hsa-miR-138-1-3p ENSG00000261786.1 1.0 0.0
13 hsa-miR-138-1-5p ENSG00000261786.1 1.0 0.0
56257 hsa-miR-3909-3p ENSG00000143028.7 0.973468 0.00999676
56258 hsa-miR-193b-3p ENSG00000198833.5 0.973467 0.00999706
56259 hsa-miR-148a-3p ENSG00000112697.10 0.973467 0.00999735
56260 hsa-miR-425-3p ENSG00000072195.9 0.973467 0.00999764
56261 hsa-miR-210-5p ENSG00000013503.4 0.973466 0.00999794
56262 hsa-miR-625-5p ENSG00000214121.3 0.973465 0.00999823
56263 hsa-miR-1301-3p ENSG00000231752.1 0.973465 0.00999853
56264 hsa-miR-582-5p ENSG00000231752.1 0.973463 0.00999882
56265 hsa-miR-132-3p ENSG00000158406.2 0.973463 0.00999911
56266 hsa-miR-196a-2-5p ENSG00000229446.1 0.973462 0.00999941
56267 hsa-miR-29b-1-5p ENSG00000188010.8 0.973461 0.0099997
56268 hsa-miR-125a-5p ENSG00000047932.9 0.973461 0.01

Note the order of the arguments: here we tested all microRNAs as \(A\) or Source genes (dm argument) against all mRNA transcripts as \(B\) or Target genes (dt argument), that is, background distributions are fitted for each microRNA (column of dm) from the log-likelihood ratios for all 23,722 mRNAs (columns of dt).