using DrWatson
quickactivate(@__DIR__)
using DataFrames
using Arrow
using Markdown
using BioFindrCoexpression analysis
Introduction
While [BioFindr] is developed primarily for causal inference from genomics and transcriptomics data, coexpression analysis of transcriptomics data alone is also possible. In coexpression analysis, pairwise correlation is used as a simple measure for the probability of two genes being functionally related either through direct or indirect regulation, or through coregulation by a third factor. In BioFindr, significance of pairwise correlation is computed using a gene-specific background distribution, allowing for genes having different biological roles. For instance, it is known that many biological networks are scale-free, where a small number of so-called “hub genes” have a high number of interaction partners while most other genes only have few. In BioFindr, this is accomodated by modelling the distribution of correlation values between a given gene \(A\) and all other genes \(B\) as a mixture distribution of real and null (random) correlations. The relative weight of each component reflects the prior probability of finding a non-null \(B\) gene for a given \(A\) gene, and is fitted for every \(A\) gene separately.
We will illustrate how to run coexpression analysis with BioFindr using preprocessed data from the GEUVADIS study. See the installation instructions for the steps you need to take to reproduce this tutorial.
Set up the environment
We begin by setting up the environment and loading some necessary packages.
Load expression data
BioFindr expects that expression data are stored in a DataFrame where columns correspond to variables (genes) and rows to samples. An expression DataFrame should not contain any other columns (e.g. gene annotation) than gene expression columns, and gene expression data should be stored as floating-point numbers. Internally, BioFindr operates on matrices, and if you have an expression DataFrame df, then Matrix(df) should return a matrix of floats.
At the moment, BioFindr does not support count-based expression data being provided as a DataFrame of integers. This is not an intrinsic limitation of the method, but simply to distinguish expression data from integer-valued genotype data. Future versions will remove this limitation by supporting scientific types.
This tutorial uses two tables of expression data from the same set of samples, one for mRNA expression data called dt, and one for microRNA (miRNA) expression data called dm:
dt = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dt.arrow")));
dm = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dm.arrow")));The mRNA data has expression data for 23722 genes in 360 samples, while miRNA data is available for 674 miRNAs in the same 360 samples.
We can confirm that the data frames are of the right type:
[typeof(Matrix(dt)) typeof(Matrix(dm))]1×2 Matrix{DataType}:
Matrix{Float64} Matrix{Float64}
Run BioFindr
All-vs-all coexpression analysis
The simplest type of coexpression analysis tests for non-zero correlation among all possible pairs in a gene expression dataset. Let’s do this for the miRNA data:
dP_mir_all = findr(dm, FDR=0.05)| Row | Source | Target | Probability | qvalue |
|---|---|---|---|---|
| String | String | Float64 | Float64 | |
| 1 | hsa-miR-574-3p | hsa-miR-574-5p | 1.0 | 0.0 |
| 2 | hsa-miR-4482-1-3p | hsa-miR-1304-3p | 1.0 | 0.0 |
| 3 | hsa-miR-3667-5p | hsa-miR-3667-3p | 1.0 | 0.0 |
| 4 | hsa-miR-550a-2-5p | hsa-miR-550a-2-3p | 1.0 | 0.0 |
| 5 | hsa-miR-550a-2-5p | hsa-miR-550a-3-5p | 1.0 | 0.0 |
| 6 | hsa-miR-3130-2-5p | hsa-miR-3130-1-5p | 1.0 | 0.0 |
| 7 | hsa-let-7b-5p | hsa-let-7b-3p | 1.0 | 0.0 |
| 8 | hsa-miR-96-5p | hsa-miR-182-5p | 1.0 | 0.0 |
| 9 | hsa-miR-96-5p | hsa-miR-1271-5p | 1.0 | 0.0 |
| 10 | hsa-miR-3667-3p | hsa-miR-3667-5p | 1.0 | 0.0 |
| 11 | hsa-miR-181c-5p | hsa-miR-181c-3p | 1.0 | 0.0 |
| 12 | hsa-miR-181c-5p | hsa-miR-181d-5p | 1.0 | 0.0 |
| 13 | hsa-miR-335-5p | hsa-miR-335-3p | 1.0 | 0.0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 28367 | hsa-miR-410-3p | hsa-miR-7-3-5p | 0.84803 | 0.0499585 |
| 28368 | hsa-miR-152-3p | hsa-miR-181a-1-5p | 0.848014 | 0.0499621 |
| 28369 | hsa-let-7f-2-3p | hsa-miR-4446-3p | 0.848008 | 0.0499657 |
| 28370 | hsa-miR-181a-1-3p | hsa-miR-99b-5p | 0.847993 | 0.0499693 |
| 28371 | hsa-miR-330-5p | hsa-miR-296-5p | 0.84799 | 0.0499729 |
| 28372 | hsa-miR-1307-5p | hsa-miR-32-3p | 0.847985 | 0.0499765 |
| 28373 | hsa-miR-221-3p | hsa-miR-424-5p | 0.847972 | 0.0499801 |
| 28374 | hsa-miR-27b-5p | hsa-miR-4424-3p | 0.847958 | 0.0499837 |
| 28375 | hsa-miR-629-5p | hsa-miR-5096-5p | 0.847952 | 0.0499873 |
| 28376 | hsa-miR-181c-3p | hsa-miR-374b-5p | 0.847946 | 0.0499909 |
| 28377 | hsa-miR-3688-2-3p | hsa-miR-1260b-5p | 0.847943 | 0.0499945 |
| 28378 | hsa-let-7d-5p | hsa-miR-142-3p | 0.847937 | 0.0499981 |
BioFindr computes a posterior probability of non-zero correlation for every Source and Target gene pair. By default the output is sorted by decreasing Probability and self-interactions are excluded. The optional parameter FDR can be used to limit the output to the set of pairs that has a global false discovery rate (FDR) less than a desired value. The qvalue column in the output can be used for further filtering of the output. Say you ran findr with an FRD threshold of 5% as above. If you now want to restrict the output to an FDR threshold of 1%, you can do:
filter!(row -> row.qvalue <= 0.01, dP_mir_all)| Row | Source | Target | Probability | qvalue |
|---|---|---|---|---|
| String | String | Float64 | Float64 | |
| 1 | hsa-miR-574-3p | hsa-miR-574-5p | 1.0 | 0.0 |
| 2 | hsa-miR-4482-1-3p | hsa-miR-1304-3p | 1.0 | 0.0 |
| 3 | hsa-miR-3667-5p | hsa-miR-3667-3p | 1.0 | 0.0 |
| 4 | hsa-miR-550a-2-5p | hsa-miR-550a-2-3p | 1.0 | 0.0 |
| 5 | hsa-miR-550a-2-5p | hsa-miR-550a-3-5p | 1.0 | 0.0 |
| 6 | hsa-miR-3130-2-5p | hsa-miR-3130-1-5p | 1.0 | 0.0 |
| 7 | hsa-let-7b-5p | hsa-let-7b-3p | 1.0 | 0.0 |
| 8 | hsa-miR-96-5p | hsa-miR-182-5p | 1.0 | 0.0 |
| 9 | hsa-miR-96-5p | hsa-miR-1271-5p | 1.0 | 0.0 |
| 10 | hsa-miR-3667-3p | hsa-miR-3667-5p | 1.0 | 0.0 |
| 11 | hsa-miR-181c-5p | hsa-miR-181c-3p | 1.0 | 0.0 |
| 12 | hsa-miR-181c-5p | hsa-miR-181d-5p | 1.0 | 0.0 |
| 13 | hsa-miR-335-5p | hsa-miR-335-3p | 1.0 | 0.0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 14545 | hsa-miR-4662a-5p | hsa-miR-486-5p | 0.961667 | 0.0099779 |
| 14546 | hsa-miR-181b-1-3p | hsa-miR-3609-3p | 0.961654 | 0.00997985 |
| 14547 | hsa-miR-23a-5p | hsa-miR-150-5p | 0.961633 | 0.0099818 |
| 14548 | hsa-miR-500a-3p | hsa-miR-138-1-3p | 0.961624 | 0.00998375 |
| 14549 | hsa-miR-425-3p | hsa-miR-200c-3p | 0.961621 | 0.00998571 |
| 14550 | hsa-miR-125a-5p | hsa-miR-140-3p | 0.961608 | 0.00998766 |
| 14551 | hsa-miR-342-5p | hsa-miR-335-5p | 0.961607 | 0.00998961 |
| 14552 | hsa-miR-3615-3p | hsa-miR-744-5p | 0.9616 | 0.00999156 |
| 14553 | hsa-miR-320d-1-3p | hsa-miR-365a-5p | 0.961596 | 0.00999351 |
| 14554 | hsa-miR-23a-5p | hsa-miR-425-5p | 0.961593 | 0.00999547 |
| 14555 | hsa-miR-26b-3p | hsa-miR-4687-3p | 0.961592 | 0.00999742 |
| 14556 | hsa-miR-589-5p | hsa-miR-644b-3p | 0.96159 | 0.00999937 |
Note that the filter! command modifies the input DataFrame in-place, that is, the rows not matching the selection criteria are deleted. Use filter to return a new DataFrame with the selected rows.
Finally, remember that the output of coexpression analysis in BioFindr is not symmetric, that is
\[ P(Source, Target) \neq P(Target, Source) \]
This is because the posterior probabilities are estimated using a Source-specific background distribution, accounting for the fact that different genes may have a different number of non-null interaction partners a priori. See the Findr paper for details.
Bipartite coexpression analysis
Since BioFindr’s posterior probabilities are Source gene-specific, we may be interested in computing probabilities only for a subset of Source genes, or using different Source and Target gene sets.
As an example of the first scenario, assume we are interested in finding microRNAs that are significantly correlated with microRNAs from the Mir-200 family. First find the Mir-200 microRNAs:
mir200 = names(dm)[startswith.(names(dm),"hsa-miR-200")]3-element Vector{String}:
"hsa-miR-200b-3p"
"hsa-miR-200a-3p"
"hsa-miR-200c-3p"
Then run
dP_mir200_mir = findr(dm, colnames=mir200, FDR=0.01)| Row | Source | Target | Probability | qvalue |
|---|---|---|---|---|
| String | String | Float64 | Float64 | |
| 1 | hsa-miR-200a-3p | hsa-miR-200b-3p | 1.0 | 1.39181e-9 |
| 2 | hsa-miR-200b-3p | hsa-miR-200a-3p | 1.0 | 1.55254e-9 |
| 3 | hsa-miR-200c-3p | hsa-miR-93-5p | 0.999998 | 5.68454e-7 |
| 4 | hsa-miR-200c-3p | hsa-miR-140-3p | 0.999984 | 4.44472e-6 |
| 5 | hsa-miR-200b-3p | hsa-miR-429-3p | 0.999918 | 1.99874e-5 |
| 6 | hsa-miR-200c-3p | hsa-miR-16-1-5p | 0.999792 | 5.13862e-5 |
| 7 | hsa-miR-200c-3p | hsa-miR-664-3p | 0.999784 | 7.49279e-5 |
| 8 | hsa-miR-200c-3p | hsa-miR-17-5p | 0.999641 | 0.000110395 |
| 9 | hsa-miR-200b-3p | hsa-miR-582-3p | 0.999246 | 0.00018188 |
| 10 | hsa-miR-200c-3p | hsa-miR-769-5p | 0.999242 | 0.000239459 |
| 11 | hsa-miR-200c-3p | hsa-miR-130b-5p | 0.999139 | 0.000295921 |
| 12 | hsa-miR-200c-3p | hsa-miR-425-5p | 0.999052 | 0.000350225 |
| 13 | hsa-miR-200c-3p | hsa-miR-103a-2-3p | 0.999001 | 0.000400161 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 36 | hsa-miR-200b-3p | hsa-miR-769-3p | 0.983749 | 0.00628075 |
| 37 | hsa-miR-200c-3p | hsa-miR-378a-3p | 0.983227 | 0.00656431 |
| 38 | hsa-miR-200b-3p | hsa-miR-26b-5p | 0.981109 | 0.00688869 |
| 39 | hsa-miR-200c-3p | hsa-miR-619-5p | 0.980948 | 0.00720056 |
| 40 | hsa-miR-200c-3p | hsa-miR-500a-5p | 0.97978 | 0.00752605 |
| 41 | hsa-miR-200c-3p | hsa-miR-4677-3p | 0.979595 | 0.00784016 |
| 42 | hsa-miR-200b-3p | hsa-miR-328-3p | 0.979019 | 0.00815303 |
| 43 | hsa-miR-200b-3p | hsa-miR-9-1-5p | 0.978721 | 0.00845829 |
| 44 | hsa-miR-200c-3p | hsa-miR-141-5p | 0.978487 | 0.00875499 |
| 45 | hsa-miR-200c-3p | hsa-miR-30e-3p | 0.977855 | 0.00905255 |
| 46 | hsa-miR-200a-3p | hsa-miR-26b-5p | 0.976313 | 0.0093707 |
| 47 | hsa-miR-200c-3p | hsa-miR-25-3p | 0.971038 | 0.00978752 |
The parameter colnames must be a vector of strings containing a subset of variable names of the input DataFrame dm to be used as Source genes.
As an example of the second scenario, we may be interested in finding genes that are significantly correlated with all or a subset of microRNAs:
dP_mir_mrna = findr(dt, dm, FDR=0.01)| Row | Source | Target | Probability | qvalue |
|---|---|---|---|---|
| String | String | Float64 | Float64 | |
| 1 | hsa-miR-3150b-3p | ENSG00000245080.3 | 1.0 | 0.0 |
| 2 | hsa-let-7b-5p | ENSG00000197182.8 | 1.0 | 0.0 |
| 3 | hsa-let-7b-5p | ENSG00000241990.1 | 1.0 | 0.0 |
| 4 | hsa-miR-335-5p | ENSG00000106484.8 | 1.0 | 0.0 |
| 5 | hsa-miR-335-5p | ENSG00000128510.5 | 1.0 | 0.0 |
| 6 | hsa-miR-335-5p | ENSG00000220884.2 | 1.0 | 0.0 |
| 7 | hsa-miR-618-5p | ENSG00000111052.3 | 1.0 | 0.0 |
| 8 | hsa-miR-335-3p | ENSG00000106484.8 | 1.0 | 0.0 |
| 9 | hsa-miR-335-3p | ENSG00000128510.5 | 1.0 | 0.0 |
| 10 | hsa-miR-335-3p | ENSG00000220884.2 | 1.0 | 0.0 |
| 11 | hsa-let-7b-3p | ENSG00000241990.1 | 1.0 | 0.0 |
| 12 | hsa-miR-138-1-3p | ENSG00000261786.1 | 1.0 | 0.0 |
| 13 | hsa-miR-138-1-5p | ENSG00000261786.1 | 1.0 | 0.0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 56257 | hsa-miR-3909-3p | ENSG00000143028.7 | 0.973468 | 0.00999676 |
| 56258 | hsa-miR-193b-3p | ENSG00000198833.5 | 0.973467 | 0.00999706 |
| 56259 | hsa-miR-148a-3p | ENSG00000112697.10 | 0.973467 | 0.00999735 |
| 56260 | hsa-miR-425-3p | ENSG00000072195.9 | 0.973467 | 0.00999764 |
| 56261 | hsa-miR-210-5p | ENSG00000013503.4 | 0.973466 | 0.00999794 |
| 56262 | hsa-miR-625-5p | ENSG00000214121.3 | 0.973465 | 0.00999823 |
| 56263 | hsa-miR-1301-3p | ENSG00000231752.1 | 0.973465 | 0.00999853 |
| 56264 | hsa-miR-582-5p | ENSG00000231752.1 | 0.973463 | 0.00999882 |
| 56265 | hsa-miR-132-3p | ENSG00000158406.2 | 0.973463 | 0.00999911 |
| 56266 | hsa-miR-196a-2-5p | ENSG00000229446.1 | 0.973462 | 0.00999941 |
| 56267 | hsa-miR-29b-1-5p | ENSG00000188010.8 | 0.973461 | 0.0099997 |
| 56268 | hsa-miR-125a-5p | ENSG00000047932.9 | 0.973461 | 0.01 |
Note the order of the arguments: here we tested all microRNAs as \(A\) or Source genes (dm argument) against all mRNA transcripts as \(B\) or Target genes (dt argument), that is, background distributions are fitted for each microRNA (column of dm) from the log-likelihood ratios for all 23,722 mRNAs (columns of dt).