Software Tools for Reference-free Binning of Metagenomes

We know that there are trillions of microbes in the environment surrounding us, even in our bodies. These microscopic communities have very diverse ecosystems and by studying their composition and behaviour we can learn a lot about them. If you have come across my previous article Metagenomics — Who is there and what are they doing? then you know that binning is an important step in metagenomics analysis.

What is Metagenomics Binning?

Metagenomics binning is the process where we cluster sequences into similar groups corresponding to taxonomic groups such as species, genus or higher levels. We can consider two main categories of metagenomics binning approaches. They are reference-based binning and reference-free binning. Reference-based binning methods align sequences to databases of reference genomes and determines the taxonomic group to which the sequence belongs to. Reference-free binning methods make use of sequence information, without any prior knowledge and group sequences into unlabelled bins.

In this article, we will be focusing on reference-free binning methods. These methods can be divided into 3 categories. They are

  1. Composition-based binning
  2. Abundance-based binning
  3. Composition and abundance-based binning

Composition-based Binning Tools

These tools make use of the compositional information of the sequences. The compositional information is generally represented by oligonucleotide composition. An oligonucleotide is considered to be a contiguous string of a small number of nucleotides. In computational terms, we define oligonucleotides as k-mers (words of size k). The oligonucleotide composition is considered to be conserved within microbial species and varies between species. Sequences are represented as oligonucleotide frequency vectors and different machine learning approaches can be applied to these vectors to group together similar sequences.

Example tools include:

You can read the following articles to read about more analyses I have carried out using composition-based binning techniques.

  1. Composition-based Clustering of Metagenomic Sequences
  2. How similar is COVID-19 to previously discovered Coronaviruses

Abundance-based Binning

Different species can be present at different abundances in a metagenomics sample. Some species can have a low abundance and some can have a high abundance. The coverage of sequences in a metagenomics sample can represent the abundance of underlying species to which the sequences belong to. Abundance-based binning tools make use of this coverage information to identify sequences of similar abundance.

Example tools include,

Composition and Abundance-based Binning

Sometimes there can be species having similar nucleotide composition and hence sequences originating from those species cannot be well-distinguished using composition-based binning tools. In such cases, the abundance of the underlying species can be made use of to separate the sequences. Hence, composition and abundance-based binning methods have been introduced.

Example tools include,

Other Approaches

Apart from the above three methods, the research community has proposed new tools which make use of additional information. Some of them are,

  • BMC3C: makes use of codon information
  • COCACOLA: make use of linkage information from paired-end reads
  • d2S Bin: refines binning results by adjusting sequences based on their dissimilarity
  • GraphBin: refines binning results using the connection information of the contigs in the assembly graph (which I have authored)

Hope you found this article useful, especially for beginners in the field of bioinformatics of metagenomics. Feel free to try out these tools and see how they perform. I have provided research articles relevant to them. Most of the articles have links to their software so you can download and try them out.

Thank you for reading!

Cheers.


This article was originally published in The Computational Biology Magazine on Medium.

Cover image by Markus Spiske from Pixabay

You can find the original article at https://medium.com/computational-biology/software-tools-for-reference-free-binning-of-metagenomes-f2d26b27eef2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s