We know that there are trillions of microbes in the environment surrounding us, even in our bodies. These microscopic communities have very diverse ecosystems and by studying their composition and behaviour we can learn a lot about them. If you have come across my previous article Metagenomics — Who is there and what are they doing? then you know that binning is an important step in metagenomics analysis.
What is Metagenomics Binning?
Metagenomics binning is the process where we cluster sequences into similar groups corresponding to taxonomic groups such as species, genus or higher levels. We can consider two main categories of metagenomics binning approaches. They are reference-based binning and reference-free binning. Reference-based binning methods align sequences to databases of reference genomes and determines the taxonomic group to which the sequence belongs to. Reference-free binning methods make use of sequence information, without any prior knowledge and group sequences into unlabelled bins.
In this article, we will be focusing on reference-free binning methods. These methods can be divided into 3 categories. They are
- Composition-based binning
- Abundance-based binning
- Composition and abundance-based binning
Composition-based Binning Tools
These tools make use of the compositional information of the sequences. The compositional information is generally represented by oligonucleotide composition. An oligonucleotide is considered to be a contiguous string of a small number of nucleotides. In computational terms, we define oligonucleotides as k-mers (words of size k). The oligonucleotide composition is considered to be conserved within microbial species and varies between species. Sequences are represented as oligonucleotide frequency vectors and different machine learning approaches can be applied to these vectors to group together similar sequences.
Example tools include:
You can read the following articles to read about more analyses I have carried out using composition-based binning techniques.
- Composition-based Clustering of Metagenomic Sequences
- How similar is COVID-19 to previously discovered Coronaviruses
Different species can be present at different abundances in a metagenomics sample. Some species can have a low abundance and some can have a high abundance. The coverage of sequences in a metagenomics sample can represent the abundance of underlying species to which the sequences belong to. Abundance-based binning tools make use of this coverage information to identify sequences of similar abundance.
Example tools include,
Composition and Abundance-based Binning
Sometimes there can be species having similar nucleotide composition and hence sequences originating from those species cannot be well-distinguished using composition-based binning tools. In such cases, the abundance of the underlying species can be made use of to separate the sequences. Hence, composition and abundance-based binning methods have been introduced.
Example tools include,
Apart from the above three methods, the research community has proposed new tools which make use of additional information. Some of them are,
- BMC3C: makes use of codon information
- COCACOLA: make use of linkage information from paired-end reads
- d2S Bin: refines binning results by adjusting sequences based on their dissimilarity
- GraphBin: refines binning results using the connection information of the contigs in the assembly graph (which I have authored)
Hope you found this article useful, especially for beginners in the field of bioinformatics of metagenomics. Feel free to try out these tools and see how they perform. I have provided research articles relevant to them. Most of the articles have links to their software so you can download and try them out.
Thank you for reading!
This article was originally published in The Computational Biology Magazine on Medium.
You can find the original article at https://medium.com/computational-biology/software-tools-for-reference-free-binning-of-metagenomes-f2d26b27eef2