Microbial 'omics

# How many bacterial genomes do you have in that assembly?

### a post by A. Murat Eren (Meren)

We have a citable version, and a more formal description of this workflow in our recent paper “Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies” (see the supplementary material).

Your ability to identify genomes appropriately from an assembly will depend on many factors, such as the number of samples you have to exploit the differential coverage patterns of genomes, or the algorithm you use to tease apart that information. But even before doing any of the real binning, you may have a rough answer to this question by just taking a quick look from your contigs:

How many bacterial genomes should I expect to find in these contigs I assembled from my shotgun data?

Anvi’o has been very practical for us to answer this, and here you will find the workflow to try it on your own FASTA file of contigs.

## The workflow

I will describe the three anvi’o steps using Sharon et al.’s infant gut study as an example. The contigs.fa file I will use is this section is the co-assembly of all samples in Sharon et al’s study.

As a reminder, Sharon et al. identified 12 bacterial draft genomes in this dataset, 8 of which were complete or near-complete. Our re-analysis of this dataset in the anvi’o methods paper also yielded similar resulted (you can find more about our re-analysis here).

### Generating a contigs database

First, you will need to introduce anvi’o to your contigs by generating a contigs database –one of the essential files of the anvi’o workflow.

And this is how this goes on my screen:

### Looking for single-copy genes

Now we have the contigs database, the next thing we will do is to look for bacterial single copy genes. If you have read our article you already know that anvi’o installs with four single-copy gene collections from four different groups.

All you need to do is to run this command for HMM hits for thoese gene collections to be added to the database you just created:

And this is how this step goes on my screen as anvi’o goes through each single-copy gene collection:

So far so good.

### Visualizing the results

Now we have a contigs database with everything we need. It is time to visualize the results. For now this is a two-step process. First we need to generate essential input files for the R program that will do the visualization:

This should generate two new files in the directory:

And the final step is to visualize the information reported in those files:

Which should generate a PDF file in the same directory if you have a proper R installation:

Here is the result:

Which suggests that there are 8 to 10 genomes in this assembly!

## Other examples

Here is the result of the worfklow I described above on an assembly generated from the shotgun seqeuncing of a cultivar:

Source Number of expected bacterial genomes
Alneberg et al. 1
Creevey et al. 1
Campbell et al. 1
Dupont et al. 1

In contrast an ocean sample we assembled recently:

Source Number of expected bacterial genomes
Alneberg et al. 451
Creevey et al. 451
Campbell et al. 431
Dupont et al. 354

Clearly these estimations are mere approximations at best, and should be taken with a grain of salt. However, they do give a rough idea about the complexity of a given metagenome, and what you should expect to get from it.