Utilities
Data preparation
The findr
function expects that input DataFrames use scientific types, mainly to distinguish between count-based expression data and categorical genotype data which could otherwise both be presented as integer-valued data. A utility function coerce_scitypes!
is provided to convert data to the right scientific type:
BioFindr.coerce_scitypes!
— Functioncoerce_scitypes!(df, scitype)
Coerce all columns of dataframe df
to the scientic type scitype
. If df
contains gene expression data, scitype
can be Continuous
or Count
. If df
contains genotype data (or categorical data more generally), scitype
can be Multiclass
or OrderedFactor
. Note though that genotypes are always treated as unordered categorical variables in BioFindr, and the ordering of levels for OrderedFactor
data is not used. Continuous genotypes (e.g. expected allele counts outputted by genotype imputation methods) are not supported and must be converted to integers before calling this function.
BioFindr.test_scitype
— Functiontest_scitype(df, scitype)
Test if all columns of dataframe df
have the scientific type scitype
. If scitype
is Continuous
or Count
, the function will return true
if all columns are of that type. If scitype
is Multiclass
or OrderedFactor
, the function will return true
if all columns are of that type, although the columns may have different levels. If the columns have different types, the function will return false
.
Postprocessing functions
Several utility functions are used when findr
is called with DataFrame inputs, some of which may be useful when manually post-processing the output of findr_matrix
calls with matrix-based inputs.
BioFindr.getpairs
— Functiongetpairs(dX::T, dG::T, dE::T; colG=1, colX=2)
Get pairs of indices of matching columns from dataframes dX
and dG
, with column names that should be matched listed in dataframe dE
. The optional parameters colG
(default value 1) and colX
(default value 2) indicate which columns of dE
need to be used for matching, either as a column number (integer) or column name (string). The optional parameter namesX
can be used to match rows in dE
to only a subset of the column names of dX
.
BioFindr.symprobs
— Functionsymprobs(P; combination="prod")
Symmetrize a square matrix of posterior probabilities P
. The optional parameter combination
defines the symmetrization method:
none
: do nothing (default)prod
: $P'_{ij}=P_{ij}P_{ji}$mean
: $P'_{ij}=\frac{1}{2}(P_{ij} + P_{ji})$anti
: $P'_{ij}=\frac{1}{2}(P_{ij} + 1 - P_{ji})$
Note that the anti
option defines "antisymmetric" probabilities, $P'_{ij} + P'_{ji} = 1$, where evidence for a causal interaction $i\to j$ is also considered evidence against the opposite interaction $j\to i$.
BioFindr.combineprobs
— Functioncombineprobs(P; combination="none")
Combine posterior probabilities P
for multiple likelihood likelihood ratio tests in a single probability (local precision) value.
The optional parameter combination
defines the combination test:
none
: do nothing, return the inputP
(default)mediation
: the mediation test ($P_2 P_3$)IV
: the instrumental variable or non-independence test ($P_2 P_5$)orig
: BioFindr's original combination ($\frac{1}{2}(P_2 P_5 + P_4)$
The input must be a three-dimensional array where the second dimension has size 4 and indexes the individual BioFindr tests (test 2-5). The output is a matrix of size size(P,1) x size(P,3)
.
BioFindr.stackprobs
— Functionstackprobs(P,colnames,rownames;nodiag=true)
Convert a matrix of pairwise posterior probabilities P
with column and row names colnames
and rownames
, respectively, to a stacked dataframe with Source
, Target
, and Probability
columns, corresponding respectively to a column name, a row name, and the value of P
in the corresponding row and column pair.
The optional parameter nodiag
determines if self-interactions (equal row and column name) are excluded (nodiag=true
, default) or not (nodiag=false
).
BioFindr.globalfdr!
— Functionglobalfdr!(dP::T; FDR=1.0, sorted=true) where T<:AbstractDataFrame
For a DataFrame dP
of posterior probabilities (local precision values), compute their corresponding q-values and keep only the rows with q-value less than a desired global false discovery rate FDR
(default value 1, no selection). dP
is assumed to be the output of a findr
run with columns Source
, Target
, and Probability
. The output DataFrame mirrors the structure of dP
, keeping only the selected rows, and with an additional column qvalue
. The output is sorted by qvalue
if the optional argument sorted
is true
(default). If dP
already contains a column qvalue
, only the filtering and optional sorting are performed.
BioFindr.globalfdr
— Functionglobalfdr(P::Array{T},FDR) where T<:AbstractFloat
For an array (matrix or vector) P
of posterior probabilities (local precision values), compute their corresponding q-values Q
, and return the indices of P
with q-value less than a desired global false discovery rate FDR
.
See also qvalue
BioFindr.qvalue
— Functionqvalue(P::Vector{T}) where T<:AbstractFloat
Convert a vector P
of posterior probabilities (local precisions) to a vector of q-values. For a threshold value c
on the posterior probabilities P
, the global FDR, $FDR(c)$ is defined as one minus the average local precision:
$FDR(c) = 1 - \frac{1}{N_c} \sum_{i\colon P_i\leq c} P_i,$
where $N_c=\sharp\{i\colon P_i\leq c\}$ is the number of selected pairs. The q-value of a given index in P
is then defined as the smallest FDR at which this pair is still called significant.
Generating simulated data
BioFindr includes a function generate_test_data
for generating simple simulated data for testing the package:
BioFindr.generate_test_data
— Functiongenerate_test_data(nA, nB, fB, ns, ng, maf, bGA, bAB, supernormalize)
Generate test data for BioFindr with nA
causal variables, nB
potential target variables of which a random fraction fB
are true targets for each causal variable, ns
samples, ng
genotype (instrumental variable) groups with minor allele frequence maf
, and effect sizes bGA
and bAB
. Variables are sampled from a linear model with independent Gaussian noise with variance ϵ
and correlated Gaussian noise with variance δ
and covariance δρ
. If supernormalize
is true
, the data is supernormalized.