Matrix-based data

Introduction

Internally, all BioFindr functions work with matrices or array-based data, and the DataFrame based findr methods used in the coexpression analysis, association analysis, and causal inference tutorials are wrapper functions provided for convenience. If you prefer matrix-based data over DataFrames, you can directly use matrix-based findr methods without having to create DataFrames first.

Set up the environment

using DrWatson
quickactivate(@__DIR__)

using DataFrames
using Arrow

using BioFindr

Load data

Let’s pretend our GEUVADIS data is in a matrix-based format:

Xt = Matrix(DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dt.arrow"))));
Xm = Matrix(DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dm.arrow"))));
Gm = Matrix(DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dgm.arrow"))));

We also need the microRNA eQTL mapping (see the causal inference tutorial), in this case in the form of an array where each row corresponds to a cis-eQTL/eGene pair represented by of a column index of Gm (i.e. a SNP) and a column index of Xm (i.e. a microRNA). Recall that due to the preprocessing of the findr-geuvadis data. the column indices are identical, but this will not be the case in general:

mirpairs = zeros(Int32,size(Gm,2),2);
for k=1:size(mirpairs,1)
    mirpairs[k,:] = [k k]
end

Note that data must be stored in matrices where columns correspond to variables (genes, SNPs, etc.) and rows correspond to observations (samples).

Run BioFindr

Below, we only show the relevant findr commands. Check the corresponding tutorials and BioFindr documentation for more details.

Coexpression analysis

All-vs-all

Coexpression analysis on a single matrix returns a square matrix with dimensions equal to the number of variables (columns) in the input matrix:

P = findr(Xm)

674×674 Matrix{Float64}:
 1.0        0.0453818   0.225592   …  0.0332217  0.107904   0.105772
 0.0889502  1.0         0.102236      0.0474569  0.731827   0.248213
 0.1729     0.0403721   1.0           0.777618   0.106922   0.1002
 0.0159859  0.106602    0.0893431     0.0699576  0.098629   0.122935
 0.120573   0.112145    0.238959      0.0346163  0.164631   0.297973
 0.0624062  0.0586858   0.0694624  …  0.569015   0.104281   0.186142
 0.423979   0.0332279   0.0971337     0.0435612  0.124816   0.094099
 0.0533054  0.0266426   0.559493      0.0550043  0.34664    0.0981003
 0.114737   0.0143626   0.1419        0.0439354  0.192045   0.131555
 0.176766   0.48804     1.0           0.0337898  0.213006   0.133968
 0.360143   0.153889    0.425408   …  0.0332874  0.163973   0.187538
 0.182787   0.0493317   0.0386118     0.0538748  0.105971   0.160988
 0.676394   0.0264916   0.444864      0.11139    0.281354   0.0970358
 ⋮                                 ⋱                        
 0.0763606  0.0436318   0.868105      0.0379098  0.0992675  0.157452
 0.190505   0.137327    0.113494      0.0334964  0.379698   0.516604
 0.378913   0.00636714  0.969245      0.0333967  0.218768   0.0972078
 0.0925918  0.00949055  0.11976    …  0.044673   0.111653   0.0953303
 0.1283     0.0292623   0.789323      0.0366376  0.190591   0.0937124
 0.115916   0.0566189   0.0567839     0.0330524  0.243382   0.123887
 0.0371264  0.0516759   0.177314      0.0620753  0.111248   0.817595
 0.286305   0.132744    0.106446      0.0361477  0.112938   0.152742
 0.162885   0.178503    0.105606   …  0.999703   0.116572   0.0951264
 0.0429484  0.123023    0.905047      1.0        0.122408   0.912701
 0.0920713  0.715456    0.108091      0.0425472  1.0        1.0
 0.0982375  0.312023    0.0995635     0.7971     1.0        1.0

In the output, columns correspond to A-genes (causal factors) and rows to B-genes (targets), that is:

\[ P_{i,j} = P(X_j \to X_i) \]

Note that the diagonal is arbitrarily set to one, BioFindr cannot make any inferences about the presence or absence of self-regulation!

Bipartite

Analyse coexpression from a subset of variables to the whole set:

P = findr(Xm; cols=[1,3,7,50])

674×4 Matrix{Float64}:
 1.0        0.225592   0.346674   0.245078
 0.0889502  0.102236   0.113262   0.162557
 0.1729     1.0        0.11442    0.073312
 0.0159859  0.0893431  0.146108   0.473624
 0.120573   0.238959   0.239501   0.0620736
 0.0624062  0.0694624  0.0989481  0.0511453
 0.423979   0.0971337  1.0        0.203824
 0.0533054  0.559493   0.201162   0.124401
 0.114737   0.1419     0.0934371  0.23661
 0.176766   1.0        0.133559   0.116917
 0.360143   0.425408   0.114545   0.288854
 0.182787   0.0386118  0.123711   0.146996
 0.676394   0.444864   0.135674   0.260814
 ⋮                                
 0.0763606  0.868105   0.265134   0.513929
 0.190505   0.113494   0.260401   0.114172
 0.378913   0.969245   0.213968   0.130775
 0.0925918  0.11976    0.197852   0.32143
 0.1283     0.789323   0.132733   0.283125
 0.115916   0.0567839  0.513517   0.239496
 0.0371264  0.177314   0.140021   0.298428
 0.286305   0.106446   0.158433   0.39212
 0.162885   0.105606   0.104283   0.347775
 0.0429484  0.905047   0.139504   0.0675872
 0.0920713  0.108091   0.139412   0.0202383
 0.0982375  0.0995635  0.0917764  0.375652

Analyse coexpression from the variables in Xm to the variables in Xt:

P = findr(Xt,Xm)

23722×674 Matrix{Float64}:
 0.0234184  0.00291162  0.732244     …  0.156543    0.0361393  0.0764542
 0.0221085  0.00435185  0.472921        0.306938    0.0698138  0.125194
 0.0241246  0.00209359  0.351981        0.0358197   0.0646466  0.0804391
 0.532158   0.00245132  0.616879        0.00249436  0.0958314  0.0656744
 0.0222376  0.00256429  0.0337209       0.0317953   0.0874541  0.691511
 0.0311296  0.00405144  0.282852     …  0.0512983   0.0662781  0.841356
 0.0212987  0.00689882  0.300301        0.226043    0.0351949  0.1855
 0.0590008  0.00292483  0.146968        0.0127612   0.0549558  0.0626004
 0.0217252  0.00209742  0.276152        0.237079    0.0438237  0.361715
 0.0226942  0.00344102  0.00199485      0.0140767   0.0350283  0.267969
 0.0213827  0.00532704  0.455821     …  0.197865    0.0368173  0.0158116
 0.0343095  0.00236632  0.534035        0.429556    0.0311012  0.0787474
 0.0214705  0.00207517  0.520825        0.239688    0.609959   0.99285
 ⋮                                   ⋱                         
 0.063287   0.0026124   0.955061     …  0.983501    0.103916   0.353626
 0.0230399  0.0200878   0.14564         0.00081434  0.0299894  0.152551
 0.0244435  0.00362406  0.230516        0.679986    0.0488017  0.452377
 0.0439679  0.00265316  0.0162124       0.113298    0.0386101  0.0763466
 0.0481255  0.0152828   0.10255         0.227967    0.069969   0.323395
 0.0219638  0.0025475   0.129958     …  0.797988    0.0437969  0.235947
 0.0359421  0.00217451  0.0695795       0.0309512   0.0536083  0.0518225
 0.0228804  0.00371468  0.122758        0.404031    0.0544299  0.175212
 0.0214806  0.00207504  0.000637106     0.266088    0.0540143  0.153354
 0.0225533  0.00213282  0.0144905       0.0383339   0.0452756  0.0636567
 0.0285356  0.00291061  0.0484913    …  0.044657    0.0463433  0.235425
 0.0628126  0.0021249   0.0326125       0.00878032  0.117111   0.42914

Association analysis

Testing associations between eQTL genotypes in Gmand microRNA expression levels in Xm:

P = findr(Xm,Gm)

674×55 Matrix{Float64}:
 0.99709    0.336726  0.0       0.0  …  0.000455842  0.0  0.00194948
 0.076501   0.999976  0.0       0.0     0.000743437  0.0  0.00511882
 0.0534877  0.199234  1.0       0.0     0.000448411  0.0  0.00836276
 0.0231871  0.441697  0.0       1.0     0.000418771  0.0  0.00229785
 0.0359331  0.418019  0.0       0.0     0.000657638  0.0  0.00169117
 0.0434282  0.35389   0.0       0.0  …  0.00166157   0.0  0.00197825
 0.0243615  0.370613  0.0       0.0     0.000723086  0.0  1.0
 0.0244173  0.314934  0.0       0.0     0.00219063   0.0  0.0108873
 0.0357465  0.209157  0.0       0.0     0.00064661   0.0  0.00470863
 0.110227   0.376232  0.999825  0.0     0.00167592   0.0  0.00316303
 0.0281279  0.415582  0.0       0.0  …  0.000939568  0.0  0.00779912
 0.027045   0.274934  0.0       0.0     0.00101715   0.0  0.00163995
 0.0603348  0.366114  0.0       0.0     0.000360567  0.0  0.00454491
 ⋮                                   ⋱                    
 0.0374556  0.322883  0.0       0.0     0.000363842  0.0  0.00862389
 0.02492    0.227238  0.0       0.0     0.000602233  0.0  0.00375107
 0.0311861  0.219238  0.0       0.0     0.00039574   0.0  0.00327624
 0.0795592  0.231186  0.0       0.0  …  0.00140163   0.0  0.00415341
 0.046263   0.215004  0.0       0.0     0.000416495  0.0  0.00505554
 0.0303016  0.539201  0.0       0.0     0.008878     0.0  0.00257078
 0.0548847  0.546104  0.0       0.0     0.000603634  0.0  0.00463545
 0.0267455  0.511972  0.0       0.0     0.000423386  0.0  0.00158495
 0.0463963  0.557115  0.0       0.0  …  0.00110603   0.0  0.00157338
 0.105566   0.235344  0.0       0.0     0.00102074   0.0  0.00651383
 0.0220181  0.219283  0.0       0.0     0.000538151  0.0  0.00476893
 0.0307281  0.270128  0.0       0.0     0.000459762  0.0  0.0735222

In the output, columns correspond to eQTLs and rows to genes, that is,

\[ P_{i,j} = P(E_j \to X_i) \]

Causal inference

Subset-to-all

When you run causal inference with findr using matrix-based inputs, the default is to return posterior probabilities for each test separately:

P = findr(Xm,Gm,mirpairs);

Note the dimensions of P:

size(P)

(674, 4, 55)

The third dimension indexes the A-genes (causes), the second dimension the tests (test 2-5, see link above), and the first the B-genes (targets). If you are interested only in a specific combination, use the optional combination argument as explained in the causal inference tutorial:

P = findr(Xm,Gm,mirpairs; combination="IV");