Coexpression analysis

Introduction

While [BioFindr] is developed primarily for causal inference from genomics and transcriptomics data, coexpression analysis of transcriptomics data alone is also possible. In coexpression analysis, pairwise correlation is used as a simple measure for the probability of two genes being functionally related either through direct or indirect regulation, or through coregulation by a third factor. In BioFindr, significance of pairwise correlation is computed using a gene-specific background distribution, allowing for genes having different biological roles. For instance, it is known that many biological networks are scale-free, where a small number of so-called “hub genes” have a high number of interaction partners while most other genes only have few. In BioFindr, this is accomodated by modelling the distribution of correlation values between a given gene \(A\) and all other genes \(B\) as a mixture distribution of real and null (random) correlations. The relative weight of each component reflects the prior probability of finding a non-null \(B\) gene for a given \(A\) gene, and is fitted for every \(A\) gene separately.

We will illustrate how to run coexpression analysis with BioFindr using preprocessed data from the GEUVADIS study. See the installation instructions for the steps you need to take to reproduce this tutorial.

Set up the environment

We begin by setting up the environment and loading some necessary packages.

using DrWatson
quickactivate(@__DIR__)

using DataFrames
using Arrow
using Markdown

using BioFindr

Load expression data

BioFindr expects that expression data are stored in a DataFrame where columns correspond to variables (genes) and rows to samples. An expression DataFrame should not contain any other columns (e.g. gene annotation) than gene expression columns, and gene expression data should be stored as floating-point numbers. Internally, BioFindr operates on matrices, and if you have an expression DataFrame df, then Matrix(df) should return a matrix of floats.

At the moment, BioFindr does not support count-based expression data being provided as a DataFrame of integers. This is not an intrinsic limitation of the method, but simply to distinguish expression data from integer-valued genotype data. Future versions will remove this limitation by supporting scientific types.

This tutorial uses two tables of expression data from the same set of samples, one for mRNA expression data called dt, and one for microRNA (miRNA) expression data called dm:

dt = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dt.arrow")));
dm = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dm.arrow")));

The mRNA data has expression data for 23722 genes in 360 samples, while miRNA data is available for 674 miRNAs in the same 360 samples.

We can confirm that the data frames are of the right type:

[typeof(Matrix(dt)) typeof(Matrix(dm))]

1×2 Matrix{DataType}:
 Matrix{Float64}  Matrix{Float64}

Run BioFindr

All-vs-all coexpression analysis

The simplest type of coexpression analysis tests for non-zero correlation among all possible pairs in a gene expression dataset. Let’s do this for the miRNA data:

dP_mir_all = findr(dm, FDR=0.05)

28378×4 DataFrame

28353 rows omitted

Row	Source	Target	Probability	qvalue
	String	String	Float64	Float64
1	hsa-miR-574-3p	hsa-miR-574-5p	1.0	0.0
2	hsa-miR-4482-1-3p	hsa-miR-1304-3p	1.0	0.0
3	hsa-miR-3667-5p	hsa-miR-3667-3p	1.0	0.0
4	hsa-miR-550a-2-5p	hsa-miR-550a-2-3p	1.0	0.0
5	hsa-miR-550a-2-5p	hsa-miR-550a-3-5p	1.0	0.0
6	hsa-miR-3130-2-5p	hsa-miR-3130-1-5p	1.0	0.0
7	hsa-let-7b-5p	hsa-let-7b-3p	1.0	0.0
8	hsa-miR-96-5p	hsa-miR-182-5p	1.0	0.0
9	hsa-miR-96-5p	hsa-miR-1271-5p	1.0	0.0
10	hsa-miR-3667-3p	hsa-miR-3667-5p	1.0	0.0
11	hsa-miR-181c-5p	hsa-miR-181c-3p	1.0	0.0
12	hsa-miR-181c-5p	hsa-miR-181d-5p	1.0	0.0
13	hsa-miR-335-5p	hsa-miR-335-3p	1.0	0.0
⋮	⋮	⋮	⋮	⋮
28367	hsa-miR-410-3p	hsa-miR-7-3-5p	0.84803	0.0499585
28368	hsa-miR-152-3p	hsa-miR-181a-1-5p	0.848014	0.0499621
28369	hsa-let-7f-2-3p	hsa-miR-4446-3p	0.848008	0.0499657
28370	hsa-miR-181a-1-3p	hsa-miR-99b-5p	0.847993	0.0499693
28371	hsa-miR-330-5p	hsa-miR-296-5p	0.84799	0.0499729
28372	hsa-miR-1307-5p	hsa-miR-32-3p	0.847985	0.0499765
28373	hsa-miR-221-3p	hsa-miR-424-5p	0.847972	0.0499801
28374	hsa-miR-27b-5p	hsa-miR-4424-3p	0.847958	0.0499837
28375	hsa-miR-629-5p	hsa-miR-5096-5p	0.847952	0.0499873
28376	hsa-miR-181c-3p	hsa-miR-374b-5p	0.847946	0.0499909
28377	hsa-miR-3688-2-3p	hsa-miR-1260b-5p	0.847943	0.0499945
28378	hsa-let-7d-5p	hsa-miR-142-3p	0.847937	0.0499981

BioFindr computes a posterior probability of non-zero correlation for every Source and Target gene pair. By default the output is sorted by decreasing Probability and self-interactions are excluded. The optional parameter FDR can be used to limit the output to the set of pairs that has a global false discovery rate (FDR) less than a desired value. The qvalue column in the output can be used for further filtering of the output. Say you ran findr with an FRD threshold of 5% as above. If you now want to restrict the output to an FDR threshold of 1%, you can do:

filter!(row -> row.qvalue <= 0.01, dP_mir_all)

14556×4 DataFrame

14531 rows omitted

Row	Source	Target	Probability	qvalue
	String	String	Float64	Float64
1	hsa-miR-574-3p	hsa-miR-574-5p	1.0	0.0
2	hsa-miR-4482-1-3p	hsa-miR-1304-3p	1.0	0.0
3	hsa-miR-3667-5p	hsa-miR-3667-3p	1.0	0.0
4	hsa-miR-550a-2-5p	hsa-miR-550a-2-3p	1.0	0.0
5	hsa-miR-550a-2-5p	hsa-miR-550a-3-5p	1.0	0.0
6	hsa-miR-3130-2-5p	hsa-miR-3130-1-5p	1.0	0.0
7	hsa-let-7b-5p	hsa-let-7b-3p	1.0	0.0
8	hsa-miR-96-5p	hsa-miR-182-5p	1.0	0.0
9	hsa-miR-96-5p	hsa-miR-1271-5p	1.0	0.0
10	hsa-miR-3667-3p	hsa-miR-3667-5p	1.0	0.0
11	hsa-miR-181c-5p	hsa-miR-181c-3p	1.0	0.0
12	hsa-miR-181c-5p	hsa-miR-181d-5p	1.0	0.0
13	hsa-miR-335-5p	hsa-miR-335-3p	1.0	0.0
⋮	⋮	⋮	⋮	⋮
14545	hsa-miR-4662a-5p	hsa-miR-486-5p	0.961667	0.0099779
14546	hsa-miR-181b-1-3p	hsa-miR-3609-3p	0.961654	0.00997985
14547	hsa-miR-23a-5p	hsa-miR-150-5p	0.961633	0.0099818
14548	hsa-miR-500a-3p	hsa-miR-138-1-3p	0.961624	0.00998375
14549	hsa-miR-425-3p	hsa-miR-200c-3p	0.961621	0.00998571
14550	hsa-miR-125a-5p	hsa-miR-140-3p	0.961608	0.00998766
14551	hsa-miR-342-5p	hsa-miR-335-5p	0.961607	0.00998961
14552	hsa-miR-3615-3p	hsa-miR-744-5p	0.9616	0.00999156
14553	hsa-miR-320d-1-3p	hsa-miR-365a-5p	0.961596	0.00999351
14554	hsa-miR-23a-5p	hsa-miR-425-5p	0.961593	0.00999547
14555	hsa-miR-26b-3p	hsa-miR-4687-3p	0.961592	0.00999742
14556	hsa-miR-589-5p	hsa-miR-644b-3p	0.96159	0.00999937

Note that the filter! command modifies the input DataFrame in-place, that is, the rows not matching the selection criteria are deleted. Use filter to return a new DataFrame with the selected rows.

Finally, remember that the output of coexpression analysis in BioFindr is not symmetric, that is

\[ P(Source, Target) \neq P(Target, Source) \]

This is because the posterior probabilities are estimated using a Source-specific background distribution, accounting for the fact that different genes may have a different number of non-null interaction partners a priori. See the Findr paper for details.

Bipartite coexpression analysis

Since BioFindr’s posterior probabilities are Source gene-specific, we may be interested in computing probabilities only for a subset of Source genes, or using different Source and Target gene sets.

As an example of the first scenario, assume we are interested in finding microRNAs that are significantly correlated with microRNAs from the Mir-200 family. First find the Mir-200 microRNAs:

mir200 = names(dm)[startswith.(names(dm),"hsa-miR-200")]

3-element Vector{String}:
 "hsa-miR-200b-3p"
 "hsa-miR-200a-3p"
 "hsa-miR-200c-3p"

Then run

dP_mir200_mir = findr(dm, colnames=mir200, FDR=0.01)

47×4 DataFrame

22 rows omitted

Row	Source	Target	Probability	qvalue
	String	String	Float64	Float64
1	hsa-miR-200a-3p	hsa-miR-200b-3p	1.0	1.39181e-9
2	hsa-miR-200b-3p	hsa-miR-200a-3p	1.0	1.55254e-9
3	hsa-miR-200c-3p	hsa-miR-93-5p	0.999998	5.68454e-7
4	hsa-miR-200c-3p	hsa-miR-140-3p	0.999984	4.44472e-6
5	hsa-miR-200b-3p	hsa-miR-429-3p	0.999918	1.99874e-5
6	hsa-miR-200c-3p	hsa-miR-16-1-5p	0.999792	5.13862e-5
7	hsa-miR-200c-3p	hsa-miR-664-3p	0.999784	7.49279e-5
8	hsa-miR-200c-3p	hsa-miR-17-5p	0.999641	0.000110395
9	hsa-miR-200b-3p	hsa-miR-582-3p	0.999246	0.00018188
10	hsa-miR-200c-3p	hsa-miR-769-5p	0.999242	0.000239459
11	hsa-miR-200c-3p	hsa-miR-130b-5p	0.999139	0.000295921
12	hsa-miR-200c-3p	hsa-miR-425-5p	0.999052	0.000350225
13	hsa-miR-200c-3p	hsa-miR-103a-2-3p	0.999001	0.000400161
⋮	⋮	⋮	⋮	⋮
36	hsa-miR-200b-3p	hsa-miR-769-3p	0.983749	0.00628075
37	hsa-miR-200c-3p	hsa-miR-378a-3p	0.983227	0.00656431
38	hsa-miR-200b-3p	hsa-miR-26b-5p	0.981109	0.00688869
39	hsa-miR-200c-3p	hsa-miR-619-5p	0.980948	0.00720056
40	hsa-miR-200c-3p	hsa-miR-500a-5p	0.97978	0.00752605
41	hsa-miR-200c-3p	hsa-miR-4677-3p	0.979595	0.00784016
42	hsa-miR-200b-3p	hsa-miR-328-3p	0.979019	0.00815303
43	hsa-miR-200b-3p	hsa-miR-9-1-5p	0.978721	0.00845829
44	hsa-miR-200c-3p	hsa-miR-141-5p	0.978487	0.00875499
45	hsa-miR-200c-3p	hsa-miR-30e-3p	0.977855	0.00905255
46	hsa-miR-200a-3p	hsa-miR-26b-5p	0.976313	0.0093707
47	hsa-miR-200c-3p	hsa-miR-25-3p	0.971038	0.00978752

The parameter colnames must be a vector of strings containing a subset of variable names of the input DataFrame dm to be used as Source genes.

As an example of the second scenario, we may be interested in finding genes that are significantly correlated with all or a subset of microRNAs:

dP_mir_mrna = findr(dt, dm, FDR=0.01)

56268×4 DataFrame

56243 rows omitted

Row	Source	Target	Probability	qvalue
	String	String	Float64	Float64
1	hsa-miR-3150b-3p	ENSG00000245080.3	1.0	0.0
2	hsa-let-7b-5p	ENSG00000197182.8	1.0	0.0
3	hsa-let-7b-5p	ENSG00000241990.1	1.0	0.0
4	hsa-miR-335-5p	ENSG00000106484.8	1.0	0.0
5	hsa-miR-335-5p	ENSG00000128510.5	1.0	0.0
6	hsa-miR-335-5p	ENSG00000220884.2	1.0	0.0
7	hsa-miR-618-5p	ENSG00000111052.3	1.0	0.0
8	hsa-miR-335-3p	ENSG00000106484.8	1.0	0.0
9	hsa-miR-335-3p	ENSG00000128510.5	1.0	0.0
10	hsa-miR-335-3p	ENSG00000220884.2	1.0	0.0
11	hsa-let-7b-3p	ENSG00000241990.1	1.0	0.0
12	hsa-miR-138-1-3p	ENSG00000261786.1	1.0	0.0
13	hsa-miR-138-1-5p	ENSG00000261786.1	1.0	0.0
⋮	⋮	⋮	⋮	⋮
56257	hsa-miR-3909-3p	ENSG00000143028.7	0.973468	0.00999676
56258	hsa-miR-193b-3p	ENSG00000198833.5	0.973467	0.00999706
56259	hsa-miR-148a-3p	ENSG00000112697.10	0.973467	0.00999735
56260	hsa-miR-425-3p	ENSG00000072195.9	0.973467	0.00999764
56261	hsa-miR-210-5p	ENSG00000013503.4	0.973466	0.00999794
56262	hsa-miR-625-5p	ENSG00000214121.3	0.973465	0.00999823
56263	hsa-miR-1301-3p	ENSG00000231752.1	0.973465	0.00999853
56264	hsa-miR-582-5p	ENSG00000231752.1	0.973463	0.00999882
56265	hsa-miR-132-3p	ENSG00000158406.2	0.973463	0.00999911
56266	hsa-miR-196a-2-5p	ENSG00000229446.1	0.973462	0.00999941
56267	hsa-miR-29b-1-5p	ENSG00000188010.8	0.973461	0.0099997
56268	hsa-miR-125a-5p	ENSG00000047932.9	0.973461	0.01

Note the order of the arguments: here we tested all microRNAs as \(A\) or Source genes (dm argument) against all mRNA transcripts as \(B\) or Target genes (dt argument), that is, background distributions are fitted for each microRNA (column of dm) from the log-likelihood ratios for all 23,722 mRNAs (columns of dt).