Utilities

Data preparation

The findr function expects that input DataFrames use scientific types, mainly to distinguish between count-based expression data and categorical genotype data which could otherwise both be presented as integer-valued data. A utility function coerce_scitypes! is provided to convert data to the right scientific type:

BioFindr.coerce_scitypes!Function
coerce_scitypes!(df, scitype)

Coerce all columns of dataframe df to the scientic type scitype. If df contains gene expression data, scitype can be Continuous or Count. If df contains genotype data (or categorical data more generally), scitype can be Multiclass or OrderedFactor. Note though that genotypes are always treated as unordered categorical variables in BioFindr, and the ordering of levels for OrderedFactor data is not used. Continuous genotypes (e.g. expected allele counts outputted by genotype imputation methods) are not supported and must be converted to integers before calling this function.

source
BioFindr.test_scitypeFunction
test_scitype(df, scitype)

Test if all columns of dataframe df have the scientific type scitype. If scitype is Continuous or Count, the function will return true if all columns are of that type. If scitype is Multiclass or OrderedFactor, the function will return true if all columns are of that type, although the columns may have different levels. If the columns have different types, the function will return false.

source

Postprocessing functions

Several utility functions are used when findr is called with DataFrame inputs, some of which may be useful when manually post-processing the output of findr_matrix calls with matrix-based inputs.

BioFindr.getpairsFunction
getpairs(dX::T, dG::T, dE::T; colG=1, colX=2)

Get pairs of indices of matching columns from dataframes dX and dG, with column names that should be matched listed in dataframe dE. The optional parameters colG (default value 1) and colX (default value 2) indicate which columns of dE need to be used for matching, either as a column number (integer) or column name (string). The optional parameter namesX can be used to match rows in dE to only a subset of the column names of dX.

source
BioFindr.symprobsFunction
symprobs(P; combination="prod")

Symmetrize a square matrix of posterior probabilities P. The optional parameter combination defines the symmetrization method:

  • none: do nothing (default)
  • prod: $P'_{ij}=P_{ij}P_{ji}$
  • mean: $P'_{ij}=\frac{1}{2}(P_{ij} + P_{ji})$
  • anti: $P'_{ij}=\frac{1}{2}(P_{ij} + 1 - P_{ji})$

Note that the anti option defines "antisymmetric" probabilities, $P'_{ij} + P'_{ji} = 1$, where evidence for a causal interaction $i\to j$ is also considered evidence against the opposite interaction $j\to i$.

source
BioFindr.combineprobsFunction
combineprobs(P; combination="none")

Combine posterior probabilities P for multiple likelihood likelihood ratio tests in a single probability (local precision) value.

The optional parameter combination defines the combination test:

  • none: do nothing, return the input P (default)
  • mediation: the mediation test ($P_2 P_3$)
  • IV: the instrumental variable or non-independence test ($P_2 P_5$)
  • orig: BioFindr's original combination ($\frac{1}{2}(P_2 P_5 + P_4)$

The input must be a three-dimensional array where the second dimension has size 4 and indexes the individual BioFindr tests (test 2-5). The output is a matrix of size size(P,1) x size(P,3).

source
BioFindr.stackprobsFunction
stackprobs(P,colnames,rownames;nodiag=true)

Convert a matrix of pairwise posterior probabilities P with column and row names colnames and rownames, respectively, to a stacked dataframe with Source, Target, and Probability columns, corresponding respectively to a column name, a row name, and the value of P in the corresponding row and column pair.

The optional parameter nodiag determines if self-interactions (equal row and column name) are excluded (nodiag=true, default) or not (nodiag=false).

source
BioFindr.globalfdr!Function
globalfdr!(dP::T; FDR=1.0, sorted=true) where T<:AbstractDataFrame

For a DataFrame dP of posterior probabilities (local precision values), compute their corresponding q-values and keep only the rows with q-value less than a desired global false discovery rate FDR (default value 1, no selection). dP is assumed to be the output of a findr run with columns Source, Target, and Probability. The output DataFrame mirrors the structure of dP, keeping only the selected rows, and with an additional column qvalue. The output is sorted by qvalue if the optional argument sorted is true (default). If dP already contains a column qvalue, only the filtering and optional sorting are performed.

source
BioFindr.globalfdrFunction
globalfdr(P::Array{T},FDR) where T<:AbstractFloat

For an array (matrix or vector) P of posterior probabilities (local precision values), compute their corresponding q-values Q, and return the indices of P with q-value less than a desired global false discovery rate FDR.

See also qvalue

source
BioFindr.qvalueFunction
qvalue(P::Vector{T}) where T<:AbstractFloat

Convert a vector P of posterior probabilities (local precisions) to a vector of q-values. For a threshold value c on the posterior probabilities P, the global FDR, $FDR(c)$ is defined as one minus the average local precision:

$FDR(c) = 1 - \frac{1}{N_c} \sum_{i\colon P_i\leq c} P_i,$

where $N_c=\sharp\{i\colon P_i\leq c\}$ is the number of selected pairs. The q-value of a given index in P is then defined as the smallest FDR at which this pair is still called significant.

source

Generating simulated data

BioFindr includes a function generate_test_data for generating simple simulated data for testing the package:

BioFindr.generate_test_dataFunction
generate_test_data(nA, nB, fB, ns, ng, maf, bGA, bAB, supernormalize)

Generate test data for BioFindr with nA causal variables, nB potential target variables of which a random fraction fB are true targets for each causal variable, ns samples, ng genotype (instrumental variable) groups with minor allele frequence maf, and effect sizes bGA and bAB. Variables are sampled from a linear model with independent Gaussian noise with variance ϵ and correlated Gaussian noise with variance δ and covariance δρ. If supernormalize is true, the data is supernormalized.

source