Have you wondered how scientists identify regions of similarity in three or more biological sequences? As described in my previous article, Sequence alignment is a method of arranging sequences of DNA, RNA, or protein to identify regions of similarity. In my latest article on bioinformatics, I have discussed about pairwise sequence alignment. Make sure to check them out as well. Multiple sequence alignment is quite similar to pairwise sequence alignment, but it uses three or more sequences instead of only two sequences.
In this article, I will be walking you through multiple sequence alignment. Furthermore, we will be trying out some examples with Clustal Omega and T-Coffee whicle checking out some coding examples with Biopython.
Figure 1: Results from T-Coffee
What is Multiple Sequence Alignment?
In multiple sequence alignment (MSA) we try to align three or more related sequences so as to achieve maximal matching between them. The goal of MSA is to arrange a set of sequences in such a way that as many characters from each sequence are matched according to some scoring function.
Figure 2: Image Source: https://openi.nlm.nih.gov/detailedresult.php?img=PMC2921379_1756-0500-3-199-1&req=4
The scoring process of MSA is based on the sum of the scores of all possible pairs of sequences in the multiple alignment according to some scoring matrix. You can refer my previous article to learn about the different scoring matrices and how to match them.
Score of multiple alignment = ∑ score(A, B)
where score(A, B) = pair-wise alignment score of A, B.
Sequence 1: G K N Sequence 2: T R N Sequence 3: S H E Sum of pairs: -1 + 1 + 6 = 6
Sum of 2nd Col = score (K, R) + score (R, H) + score (K, H) = 2+0+-1 = 1
The goal of MSA is to achieve the maximum Sum of Pairs.
Types of Multiple Sequence Alignment
Aligning three or more sequences can be difficult and are almost always time-consuming to align manually. Hence computational algorithms are used to produce and analyze these alignments. Most MSA algorithms use dynamic programming and heuristic methods.
Given below are MSA techniques which use heuristic methods.
- Progressive Alignment Construction
- Iterative Alignment Construction
- Block-base Alignment
These methods can find solutions among all possible solutions, but they do NOT guarantee that the best solution will be found. Hence they are considered as approximations but we can easily find a solution close to the actual one within a short time.
Progressive Alignment Construction
This method, also known as the hierarchical or tree method, was developed by Paulien Hogeweg and Ben Hesper in 1984. It builds up a final MSA by combining pairwise alignments beginning with the most similar pair and progressing to the most distantly related pair.
Two of the popular progressive alignment methods used at present are,
Iterative Alignment Construction
This method consists of a set of methods to produce MSAs while reducing the errors inherent in progressive methods. They work similarly to progressive methods, but repeatedly realign the initial sequences as well as add new sequences to the growing MSA.
This method divides the sequences into blocks and tries to identify blocks of un-gapped alignments shared by many sequences.
DIALIGN2 is a popular block-base alignment approach .
Time to Practice
The time you have been waiting for has arrived. It is time to try out some tools and examples. I will be using Clustal Omega and T-Coffee to show you a few examples of MSA. You can try out these tools online.
Figure 3: Clustal Omega
You will get a page to select the type of data (Protein, DNA or RNA), enter the sequences (or upload a file of a supported format) and set the output format.
I will be using the following 10 genome sequences (DNA) of Acanthaster planci (commonly known as Crown-of-thorns starfish) for this demonstration. You can download the complete genome for an organism from here, in .fsa format.
An FSA file (.fsa) is a fragment analysis data file created by DNA sequencers and analyzers.
|>gnl|GNOMON|17969.m Partial model predicted by Gnomon on Acanthaster planci unplaced genomic scaffold, OKI-Apl_1.0 oki_scaffold1752, whole genome shotgun sequence (NW_019093106.1)|
|>gnl|GNOMON|70594930.m Model predicted by Gnomon on Acanthaster planci unplaced genomic scaffold, OKI-Apl_1.0 oki_scaffold1731, whole genome shotgun sequence (NW_019093085.1)|
|>gnl|GNOMON|18113.m Partial model predicted by Gnomon on Acanthaster planci unplaced genomic scaffold, OKI-Apl_1.0 oki_scaffold1727, whole genome shotgun sequence (NW_019093081.1)|
|>gnl|GNOMON|130719778.m Model predicted by Gnomon on Acanthaster planci unplaced genomic scaffold, OKI-Apl_1.0 oki_scaffold1726, whole genome shotgun sequence (NW_019093080.1)|
|>gnl|GNOMON|65528162.m Model predicted by Gnomon on Acanthaster planci unplaced genomic scaffold, OKI-Apl_1.0 oki_scaffold1702, whole genome shotgun sequence (NW_019093056.1)|
|>gnl|GNOMON|85684402.m Model predicted by Gnomon on Acanthaster planci unplaced genomic scaffold, OKI-Apl_1.0 oki_scaffold1690, whole genome shotgun sequence (NW_019093044.1)|
|>gnl|GNOMON|65753714.m Model predicted by Gnomon on Acanthaster planci unplaced genomic scaffold, OKI-Apl_1.0 oki_scaffold1685, whole genome shotgun sequence (NW_019093039.1)|
|>gnl|GNOMON|22673.m Partial model predicted by Gnomon on Acanthaster planci unplaced genomic scaffold, OKI-Apl_1.0 oki_scaffold1684, whole genome shotgun sequence (NW_019093038.1)|
|>gnl|GNOMON|11409.m Partial model predicted by Gnomon on Acanthaster planci unplaced genomic scaffold, OKI-Apl_1.0 oki_scaffold1657, whole genome shotgun sequence (NW_019093011.1)|
|>gnl|GNOMON|22945.m Partial model predicted by Gnomon on Acanthaster planci unplaced genomic scaffold, OKI-Apl_1.0 oki_scaffold1640, whole genome shotgun sequence (NW_019092994.1)|
10 sequences from the genome of Acanthaster planci (commonly known as Crown-of-thorns starfish)
After entering the sequences and selecting the parameters, you can submit your job. The results of the job can be viewed as follows. You can check the results for this example from here.
Figure 4: Results of the job on Clustal Omega
You can download the results as an alignment file with the .aln extension by clicking on the Download Alignment File button.
Figure 5: T-Coffee
For this demonstration, I have selected Combine popular aligners (M-Coffee) under DNA section. Then you will get a page to enter the sequences (or upload a file of supported format).
Figure 6: M-Coffee input page
I will be using the same file I used to demonstrate Clustal Omega. After entering the sequences, you can submit your job. The results of the job can be viewed as follows. You can check the results for this example from here.
Figure 7: Results for the job on T-Coffee
Biopython Wrappers for Clustal Omega and T-Coffee
Biopython, which I had introduced in my previous article, consists of command line wrappers for Clustal Omega, T-Coffee and many other tools such as ClustalW and DIALIGN. You can check out all the wrappers and sample code from here. I will show how to use the Clustal Omega wrapper in the next example.
To run the Clustal Omega wrapper, first you should download its precompiled binaries. You can download them from here. A binary file will be downloaded. You can make it an executable using the command given below. The name of the binary file may vary according to the binary file version you have downloaded. In my machine it was downloaded as clustal-omega-1.2.3-macosx.
chmod 777 clustal-omega-1.2.3-macosx
I will be using the same .fsa file used before, for this example as well.
The following code uses the Clustal Omega wrapper to develop MSA for the given input .fsa file. It will be easy if you have the .fsa file, .py file and the executable binary in the same location.
|# Import Clustal Omega wrapper|
|from Bio.Align.Applications import ClustalOmegaCommandline|
|# Define input file|
|in_file = "/Users/vijinimallawaarachchi/Documents/Python/Acanthaster_planci_Gnomon.fsa"|
|# Define output file|
|out_file = "aligned.fasta"|
|# Get the command for Clustal Omega|
|clustalomega_cline = ClustalOmegaCommandline(infile=in_file, outfile=out_file, verbose=True, auto=True)|
|# Print the executable command|
After running this python code you will get a command as output. The path may change according to where you put your files.
clustalo -i /Users/vijinimallawaarachchi/Documents/Python/Acanthaster_planci_Gnomon.fsa -o aligned.fasta — auto -v
Figure 8: Output command
Now execute the downloaded binary file with the shown command as follows. Make sure to replace clustalo by ./clustal-omega-1.2.3-macosx
./clustal-omega-1.2.3-macosx -i /Users/vijinimallawaarachchi/Documents/Python/Acanthaster_planci_Gnomon.fsa -o aligned.fasta --auto -v
You will get the following results saying that progressive alignment has been done.
Figure 9: Final result
Now you can see that a file named as aligned.fasta has been formed in the same location as your files are. Once you open it, you can see the alignments in FASTA format.
FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. The format originates from the FASTA software package, but has now become a standard in the field of bioinformatics.
Hope you enjoyed reading this article and learned something useful and interesting.
Since I’m still very new to this field, I would like to hear your advice. 😊
Thanks for reading… 😃
Originally posted in Towards Data Science