Cancer subtype classification

Gene expression profling of tumour samples.

Gene expression profiling predicts clinical outcome of breast cancer

The study by van ’t Veer et al was one of the first to use to microarrays, a brand-new technology at the time, to profile gene expression on a genome-wide scale from surgically removed tumour samples - breast tumours in this case. Another paper from around the same time is: Perou et al. Molecular portraits of human breast tumours. The credit for being the first to using cluster analysis on gene expression data (from yeast) probably goes to Eisen et al. Cluster analysis and display of genome-wide expression patterns.

Van ’t Veer et al clustered data from 98 tumours based on their similarities across approximately 5,000 differentially expressed genes - genes that showed more variation than expected by chance in the dataset. The most striking finding is in their Figure 1: the tumours segregated in two distinct groups that correlated strongly with clinical features, namely:

Overall, tumours in the bottom group of the figure were clearly associated with measures that predict better patient outcome.

Following the discovery that unsupervised clustering of gene expression profiles identifies good and poor prognosis groups, the authors tried to identify a minimal prognostic signature from their data, which resulted in an optimal set of 70 marker genes, which they could validate in an independent set of tumour samples.

Clearly, with such a strong signature, the race to bring it to the clinic is on. That this is far from trivial can be seen by tracing the follow-up studies and clinical trials:

They got there eventually, and the gene expression signature is now commercially available under the name of Mammaprint.

The Cancer Genome Atlas

Although the results by Van ’t Veer et al. were obtained from a small (by current standards!) sample size, they have been reproduced consistenly in larger studies (see the assignment in the next cluster analysis lecture) and arguably spawned a search for similar signatures in other cancer types through large-scale projects, such as The Cancer Genome Atlas (TCGA) Program.

The amount of data and number of publications produced by TCGA is too enormous to survey here.

For the purposes of illustration, have a look at the Pan-Cancer Atlas, and then do the following assignment.

Assignment

Last modified May 30, 2023: update Gaussian processes (ff6a6c2)