Bayesian networks

Book sections

Pattern Recognition and Machine Learning: Chapter 8

A crash course in Bayesian networks

In Bayesian networks, the joint distribution over a set \(\{X_1,\dots, X_p\}\) of random variables is represented by:

a directed acyclic graph (DAG) \(\cal G\) with the variables as vertices,
a set of conditional probability distributions, \[\begin{aligned} P\bigl(X_i \mid \{X_j \mid j\in \mathrm{Pa}_i\}\bigr), \end{aligned}\] where \(\mathrm{Pa}_i\) is the set of parents of vertex \(i\) in \(\mathcal{G}\),

such that \[\begin{aligned} P(X_1,\dots,X_p) = \prod_{i=1}^p P\bigl(X_i \mid \{X_j \mid j\in \mathrm{Pa}_i\}\bigr) \end{aligned}\]

In GRN reconstruction:

\(X_i\) represents the expression level of gene \(i\),
\(P(X_1,\dots,X_p)\) represents the joint distribution from which (independent) experimental samples are drawn,
\(\mathcal{G}\) represents the unknown GRN.

Assume we have a set of \(N\) observations \(x_1,\dots,x_N\in \mathbb{R}^p\) of \(p\) variables, collected in the \(N\times p\) matrix \(\mathbf{X}\).

Assume we know \(\mathcal{G}\) and the conditional distributions. Then the likelihood of observing the data is:

\[\begin{aligned} P(\mathbf{X}\mid \mathcal{G}) &= \prod_{k=1}^N P(X_{k1},\dots,X_{kp}\mid \mathcal{G})\\\\ &= \prod_{i=1}^p \prod_{k=1}^N P\bigl(X_{ki} \mid \{X_{kj} \mid j\in \mathrm{Pa}_i\}\bigr) \end{aligned}\]

We can now use an iterative algorithm to optimize \(\mathcal{G}\) and the conditional distributions:

Start with a random graph \(\mathcal{G}\).
Given \(\mathcal{G}\), the likelihood decomposes in a product of independent likelihoods, one for each gene, and the conditional distributions can be optimized by standard regression analysis.
In the next iterations, randomly add, delete, or reverse edges in \(\mathcal{G}\), as long as the likelihood improves.

A more formal approach to optimizing \(\mathcal{G}\) uses Bayes’ theorem: \[\begin{aligned} P(\mathcal{G}\mid \mathbf{X}) = \frac{P(\mathbf{X}\mid \mathcal{G}) P(\mathcal{G})}{P(\mathbf{X})} \end{aligned}\]

\(P(\mathcal{G})\) represents the prior distribution: even without seeing any data, not all graphs need to be equally likely a priori.
\(P(\mathbf{X})\) represents the marginal distribution: \(P(\mathbf{X}) = \sum_{\mathcal{G}'} P(\mathbf{X}\mid \mathcal{G}') P(\mathcal{G}')\). It is independent of \(\mathcal{G}\) and can be ignored.

We can use \(P(\mathcal{G})\) to encode evidence for causal interactions from integrating genomics and transcriptomics data. This is the main idea from Zhu et al. (2004).