Cluster analysis
Cancer subtypes. Combinatorial clustering. Mixture distributions.
Machine learning plays an important role in computational biology. See the Machine Learning in Computational and Systems Biology community or the Machine Learning in Computational Biology conference series.
These lecture notes focus on probabilistic machine learning methods for computational biology, where the experimental data are viewed as random samples from an underlying data-generating process.
The “probabilistic modelling” in the title refers to the use of abstract data-generating processes, not based on any specific biological mechanisms, and derived from generic models and methods. A typical example will be clustering using Gaussian mixture models.
To speak of “causal modelling” will require something more, namely that the data-generating process is based on some qualitative prior knowledge or understanding of the true underlying biological process. A typical example will be path analysis.
The notes are divided in chapters, each focusing on a specific class of methods:
Each chapter follows the same structure:
Four appendices contain the minimum required background knowledge on gene regulation, probability theory, linear algebra, and optimization.
The theoretical sections contain the basic information to understand a method. For more background, try the following textbooks (with free pdfs!), all used in preparation of this course:
The use of classic or path-breaking papers is motivated by Back to the future: education for systems-level biologists. Since the field of genome-scale data analysis is still relatively young, the choice of papers for study is still a bit open and likely to evolve as the course matures.
These lecture are taught as part of the master program in bioinformatics at UiB, making up about half of the BINF301 Genome-scale Algorithms course. As such, good background knowledge on basic bioinformatics and omics data is assumed.
Cancer subtypes. Combinatorial clustering. Mixture distributions.
Statistical significance for genome-wide studies. False discovery rate estimation.
Drug sensitivity prediction. Ridge, lasso, and elastic net regression.
Single-cell genomics. Probabilistic PCA. T-SNE and UMAP.
Genetics of gene expression. The method of path coefficients. False discovery control.
Gene regulatory networks. Bayesian networks. Other network inference methods.
Spatial and temporal gene expression. Gaussian processes.
Gene regulation. Probability theory. Linear algebra. Optimization.
How to contribute to the docs