Microbial 'omics

# Binning without mapping

### a post by A. Murat Eren (Meren)

The anvi’o metagenomic workflow assumes that you have metagenomic short reads. But what if all you have is a bunch of contigs, or a draft genome, or a MAG without any short reads to map to?

This need was brought up by one of our early users, and there has been an open issue to address this at some point. It is now resolved, and the following functionality is available in the master branch.

The key is to create a blank anvi’o profile database to go with the contigs database, and this is what I will demonstrate here. But before I start, let me put this here so you know what version I am using:

## Preparing the FASTA file

For this example I will download a this Bacillus subtilis genome, which was sequenced by Mitsuhiro Itaya et al., and was made available to the community along with their publication with an awesome title:

Combining two genomes in one cell: Stable cloning of the Synechocystis PCC6803 genome in the Bacillus subtilis 168 genome

OK. The reason I will use this genome is simple: it is a chimeric genome, and it will be fun to see that visually.

Simple. As you may know, anvi’o requires simple deflines in FASTA files), and this FASTA file does not conform that requiremenet at all:

So I need to fix that first before I generate the contigs database. For this purpose I use anvi-script-remove-short-contigs-from-fasta with a minimum length of 0 and --simplify flag. This way the script will not remove anything from the FASTA file while converting deflines to simpler ones:

Now it looks much better:

So this is my FASTA file, and you have your’s. We are golden. The next step is to create an anvi’o contigs database.

## Creating the contigs database

Creating a contigs database is not much different for this workflow than any other. However, (1) I will set a shorter split size than default (since I want to see a more highly resolved depiction of the genome), and (2) I will skip mindful splitting, so anvi’o does not lose time with something that is not useful in this context:

Next, I populate the single-copy gene hit tables in this newly generated contigs database:

If you hapenned to read this post, you already know that at this point we can take a look at the distribuiton of bacterial single-copy genes in this contigs database and predict the number of genomes in it:

Which gives me back this PDF image:

This output confirms that there are two genomes in this contigs database. This was not a case of contamination, but let’s assume it was. So you did things, and figured out that your draft genome is likely contamianted.

Learning that your genome (or your draft genome, or your MAG, you name it) is contaminated is always very important (because we know what happens when you never learn it and continue analyzing what you have), but the realization of contamination gives you nothing but a dark afternoon if you don’t know what is the best way to deal with it. When there is mapping data available, anvi’o offers powerful ways to characterize contamination as we recently demonstarted in our publication “Identifying contamination with advanced visualization and analysis practices”.

But none of that works if you don’t short reads to map, and the next step will enable us to bypass that need.

## Creating a blank profile database

To visualize a contigs database, we need an anvi’o profile database. But what does one do if there is no mapping data to create a profile database? Well, they run anvi-profile with --blank-profile parameter!

Thanks to the resulting blank profile database, now I will be able to visualize what this contigs database holds in it by using anvi-interactive. Now I can do binning with real time completion / contamination estimates, store and load states, create collections, and summarize them.

## Visualizing

OK. Everything is ready to simply start the anvi’o interactive interface for a better look:

Which gives me this view in the opening window:

The inner tree shows the organization of each 5 kbp split in this genome based on their tetra-nucleotide frequency profiles. As you can see, there are two distinct branches in the tree. One of these branches probably represents the Synechocystis genome, and the other represents the Bacillus subtilis genome Mitsuhiro Itaya et al. masterfully merged together.

Although this is the simplest look at this data for the sake of clarity, you can indeed use --additional-view, or --additional-layers to add more data into the interface, and do pretty much anything you would do with any other anvi’o project.

As an example, here I will separate those genomes from each other by selecting splits represented in those distinct branches into different bins, and then store my selections into a ‘collection’ to summarize it later (click to load the animated gif):

As you can see, completion / redundancy estimates look much better when they are selected separately.

## Summary

Now I can summarize the collection I stored in the blank profile database:

The result of this summary is a static HTML output. I copied that directory here for your viewing pleasure:

http://anvio.org/data/Bacillus_subtilis_summary/

Mission accomplished!