Probabilistic and causal modelling of genome-scale data

Machine learning plays an important role in computational biology. See the Machine Learning in Computational and Systems Biology community or the Machine Learning in Computational Biology conference series.

These lecture notes focus on probabilistic machine learning methods for computational biology, where the experimental data are viewed as random samples from an underlying data-generating process.

The “probabilistic modelling” in the title refers to the use of abstract data-generating processes, not based on any specific biological mechanisms, and derived from generic models and methods. A typical example will be clustering using Gaussian mixture models.

To speak of “causal modelling” will require something more, namely that the data-generating process is based on some qualitative prior knowledge or understanding of the true underlying biological process. A typical example will be path analysis.

The notes are divided in chapters, each focusing on a specific class of methods:

Clustering
Regularized regression
Dimensionality reduction
Causal inference
Graphical models
Spatio-Temporal models

Each chapter follows the same structure:

A “classic” biological or biomedical research paper is studied where the algorithm (or class of algorithms) of interest was first used. A more recent follow-up or related paper is given as a reading assignment.
The method used in the classic paper is presented in detail, along with additional methods to solve the same type of problem. The methods are put in practice in a programming assignment. Where possible, original data from the papers studied in the first part is used.

Four appendices contain the minimum required background knowledge on gene regulation, probability theory, linear algebra, and optimization.

The theoretical sections contain the basic information to understand a method. For more background, try the following textbooks (with free pdfs!), all used in preparation of this course:

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning (second edition) (2009).
Christopher Bishop. Pattern Recognition and Machine Learning (2006).
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Mathematics for Machine Learning (2020).

The use of classic or path-breaking papers is motivated by Back to the future: education for systems-level biologists. Since the field of genome-scale data analysis is still relatively young, the choice of papers for study is still a bit open and likely to evolve as the course matures.

These lecture are taught as part of the master program in bioinformatics at UiB, making up about half of the BINF301 Genome-scale Algorithms course. As such, good background knowledge on basic bioinformatics and omics data is assumed.

A note on figures and copyright

One of the objectives of the course is to learn to read and understand scientific papers. Figures from papers selected for discussion are reproduced in these notes. Attribution to the original authors is always given. Where possible open access papers are used, but some “classic” papers date from before the birth of open access. If the full text version of the paper is available on EuropePMC, through Unpaywall or otherwise, I’ve reused figures without seeking any further reprint permission. If anyone feels their copyright is violated, please let me know.

Probabilistic and causal modelling of genome-scale data

A note on figures and copyright

Cluster analysis

Statistical significance for genome-wide studies

Regularized regression

Dimensionality reduction

Causal inference

Graphical models

Spatio-temporal models

Appendix

Contribution Guidelines