Progressive multiple sequence alignment

Bioinformatics
Author

Tom Michoel

Published

September 15, 2025

Introduction

Multiple sequence alignment (MSA) is one of the most fundamental problems in bioinformatics. MSA algorithms depend on two things: a mechanism to score and compare MSAs, and an algorithm to find a high-scoring MSA. Because of the computational complexity of the problem, most algorithms use heuristic methods and are not guaranteed to find the globally optimal MSA.

The most popular approaches construct progressive alignments, which build up a MSA from successive pairwise alignments. The history of progressive alignment goes back to Feng and Doolittle’s 1987 article, with further development leading to the popular Clustal Omega software.

As always with computational problems for which a globally optimal solution cannot be computed in reasonable time, there is always room for further tweaks and improvements in the heuristics being used, leading to complex algorithms that are not at all easy to understand for a novice. Therefore, from an educational point of view it is important to go back to basics, and in particular Feng and Doolittle’s description of the problem:

Ordinarily, alignments […] are performed pairwise. The problem is that when the various paired alignments are grouped, they are seldom consistent with one to another. Thus, when sequence A is paired with sequence B, gaps may appear at various locations, but when either A or B is aligned with a third sequence, C, the arrangment of gaps may be entirely different.

To illustrate the nature of this problem, how (progressive) multiple sequence alignment tries to overcome it, and the pitfalls of using a heuristic algorithm, nothing beats a simple example.

Sequences and alignment scoring scheme

Throughout this example we will consider three nucleotide sequences:

s1: A A T
s2: A A C
s3: A A C T 

The following alignment scoring scheme will be used

Alignment Score
Match +1
Mismatch -1
Gap -2

Pairwise alignments

Since the gap penalty is larger than the mismatch penalty, it is easy to see that the optimal (highest-scoring) pairwise alignments between the sequences are:

pa12: A A T
      A A C
      -----
  S = 1+1-1 = 1    

pa13: A A - T
      A A C T
      -------
  S = 1+1-2+1 = 1

pa13: A A C -
      A A C T
      -------
  S = 1+1+1-2 = 1

Here we immediately see the problem that Feng & Doolittle described: when sequence 1 is combined with sequence 2, the optimal position for the T is in the third position, but when it is combined with the third sequence, the optimal position for the T shifts to the fourth position. In contrast the C in sequence 2 remains in third position in both alignments.

Multiple sequence alignment

To resolve this ambiguity, we need to align all three sequences using multiple sequence alignment. MSA typically uses the sum-of-pairs score: the total score contribution for each position is calculated as the sum of scores of each pairwise comparison at that position, using the original pairwise scoring scheme. One ingredient needs to be added to the scheme, namely when two gaps occur in the same position (which doesn’t happen in a pairwise alignment), we commonly count a contribution of zero to the score, to avoid double counting (or extra penalization) for the use of gaps.

From the pairwise alignments above, it follows that we only need to consider two MSAs, one with s1 aligned as in pa12 and one with s1 aligned as in pa13:

msa1: A A T -
      A A C -
      A A C T
      -------
 SP = 3+3-1-4 = 1
msa2: A A - T
      A A C -
      A A C T
      -------
 SP = 3+3-3-3 = 0

We see that the ambiguity from the pairwise alignments is resolved: under the sum-of-pairs score, msa1 is better than msa2.

Progressive multiple sequence alignment

Progressive multiple sequence alignment is an algorithm to perform multiple sequence alignment when the number and/or length of the sequences makes an exhaustive search for the highest-scoring MSA impractical. The algorithm works by progressively (hence the name) reducing the number of sequences to be aligned, by reducing the most similar sequence pair to their consensus sequence.

Consensus sequences keep track of ambiguous sites (see also the FASTA sequence representation format). When scoring an ambiguous site in a pairwise alignment, we take the average of the scores of each possibility. Gaps are ignored when representing ambiguous sites.

Progressive MSA proceeds in the following steps:

Step 1: Compute pairwise alignment scores

First the pairwise alignment scores between all sequences are computed and typically represented in an array

\[ S = \begin{pmatrix} - & 1 & 1\\ - & - & 1\\ - & - & - \end{pmatrix} \]

where the \((i,j)\)th entry is the score for the pairwise alignment between sequence \(i\) and \(j\).

Step 2: Reduce the most similar pair of sequences to their consensus

The most similar pair of sequences is identified by finding the highest-scoring entry in the score matrix \(S\). When there are multiple equally scoring pairs, one is selected at random. Here all three pairs have the same score, so let’s randomly select the pair \((1,2)\). Its consensus sequence is:

pa12: A A T
      A A C
      -----
cs12: A A [TC]

Step 3: Update the pairwise alignment score matrix

We replace sequences 1 and 2 by their consensus sequence cs12 and update the pairwise alignment score matrix by computing the pairwise alignment scores between cs12 and the remaining sequences, in this case only s3. There are two possibilities to consider:

    A A [TC] -
    A A C    T
    ----------------
S = 1+1+0.5*(-1+1)-2 = 0

and

    A A - [TC]
    A A C T
    ----------------
S = 1+1-2+0.5*(-1+1) = 0

In either case the score is zero and our updated \(S\) is

\[ S = \begin{pmatrix} - & 0\\ - & - \end{pmatrix} \]

Step 4: Repeat step 1–3 until only one consensus sequence remains

We already have only one pair left in \(S\). Because there were two pairwise alignments that gave the same score for the value in \(S\), we randomly pick one, say the first, and construct the consensus:

    A A [TC] -
    A A C    T
    ----------
    A A [TC] T

Step 5: Unfold the final consensus sequence in a multiple sequence alignment

Tracing back through the selected pairwise alignments at each step of the algorithm, we recover the MSA that corresponds to the final consensus sequence:

consensus: A A [TC] T

s1:        A A  T   -
s2:        A A  C   -
s3:        A A  C   T

Lo and behold, we have recovered the optimal msa1 with nothing but progressive pairwise alignments!

Caveats

It should be pretty clear from the previous steps that we could just as well not have recovered the optimal msa1.

Consider step 3. If the other, equal scoring alignment

    A A - [TC]
    A A C T

had been chosen to build the consensus sequence in step 4, then we would have obtained

consensus: A A C [TC]

s1:        A A -  T
s2:        A A -  C
s3:        A A C  T

with an SP score of 0.

Or consider step 1. If instead of starting with pa12, we had started with pa13, then we would have obtained

consensus: A A C T

s1:        A A - T
s2:        A A C -
s3:        A A C T

also with an SP score of 0.

Hence it should always be remembered that progressive multiple sequence alignment is a greedy algorithm, which makes locally optimal choices at each step that produce a good solution in a reasonable amount of time, but without any guarantee to return the globally optimal solution!