The latest version of anvi’o is
v4. See the release notes.
Some parts of this article may require anvi’o
v5 or later. You can learn which version you have on your computer by typing
anvi-profile --version in your terminal.
- A brief background:
- BUSCO: Benchmarking Universal Single-Copy Orthologs
- The BUSCO collection of single-copy core genes for Protists
- Assessing the completion/redundancy of eukaryotic MAGs from 3 different projects
This post describes a collection of 83 single-copy core genes I curated from BUSCO, so anvi’o can identify eukaryotic genome bins in metagenomic binning efforts (stay tuned, viruses, you are not forgotten, and your time will come soon), and estimate their level of completion.
A brief background:
While anvi’o comes with reference single-copy core gene collections to assess completion and redundancy of bacterial and archaeal bins, until
v5, it did not include any collection for viruses and eukaryotes. This is a limiting factor for anvi’o users who are interested in the genome-resolved metagenomics workflow to study domains of life beyond Bacteria and Archaea.
Fortunately, it is easy to incorporate new HMM models into anvi’o to search for particular gene families of interest.
Here I simply used this opportunity to curate a collection of eukaryotic single-copy core genes for anvi’o using the BUSCO database. This led to a promising (yet preliminary) collection of 83 single-copy core genes for eukaryotes “ready-to-be-used” from within anvi’o (which is currently available in the
master reposotory, and will be included in the
v5 version by default):
BUSCO_83_Protista with anvi’o
This collection will be available by default in anvi’o
v5, however, if you would like to use it with anvi’o
v4, you still can do it. For that, first download and unpack the archive:
# download wget http://merenlab.org/files/BUSCO_83_Protista.tar.gz # unpack tar -zxvf BUSCO_83_Protista.tar.gz
If you already have an anvi’o contigs database for your project, you can add this new set of HMMs the following way:
anvi-run-hmms -c CONTIGS.db -H /PATH/TO/BUSCO_83_Protista/
Once this is done, the results will be immediately available to you in real-time completion estimates in the interactive interface, and when you use anvi’o command line programs that estimate completion/redundancy of genome bins:
If you simply have a FASTA file, then you can follow these steps to first create a contigs database, and then estimate completion quickly:
anvi-gen-contigs-database -f FASTA.file -o CONTIGS.db anvi-run-hmms -c CONTIGS.db -H /PATH/TO/BUSCO_83_Protista/ anvi-compute-completeness -c CONTIGS.db --completeness-source BUSCO_83_Protista
BUSCO: Benchmarking Universal Single-Copy Orthologs
Single-copy core gene collections dedicated to different lineages of organisms have been created under the label BUSCO for Benchmarking Universal Single-Copy Orthologs (see the web page). Here are the related articles:
BUSCO applications from quality assessments to gene prediction and phylogenomics. doi:10.1093/molbev/msx319
BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. doi:10.1093/bioinformatics/btv351
The following sections describe the incorporation of a BUSCO collection of single-copy core genes dedicated to Protists into anvi’o and its subsequent testing using reference genomes and eukaryotic MAGs recovered from three distinct genome-resolved metagenomic projects.
Protists include all eukaryotic cells not affiliated with plants, animals or fungi.
The BUSCO collection of single-copy core genes for Protists
The BUSCO collection of single-copy core genes for Protists (named
protists_ensembl) is available from here. It contains HMM models for 215 single-copy core genes.
To test the effectiveness of this collection when used from within anvi’o
v4, I generated contigs databases for four complete eukaryotic genomes: Bathycoccus prasinos, Micromonas pusilla, Ostreococcus tauri and Pseudo-nitzschia multistriata.
I followed the relevant tutorial to create an anvi’o collection, and used
anvi-run-hmms program to test it on the four reference eukaryotic genomes, using different e-value cut-offs.
BUSCO uses optimized length and e-value cut-offs for each HMM model. This flexibility is not available in anvi’o
v4, for which a single e-value cut-off must be defined for all HMM models within a collection.
I detected some of the single-copy core genes dozens of times in all four reference eukaryotic genomes, even after varying the e-value cut-off from
E-100. Most others were detected once, as expected. To avoid increased number of false positives, I selected a fixed e-value of E-25, and removed genes that occurred multiple times or never in each of the four reference genomes, resulting in a total of 83 single-copy core genes. I combined them in a new collection of single-copy core genes dedicated to protists named BUSCO_83_Protista.
As expected, here are the anvi’o completion estimates for my reference genomes:
In the following section, I tested the efficacy of this collection to estimate the completion and redundancy of eukaryotic MAGs characterized from different genome-resolved metagenomic projects.
MAG stands for metagenome-assembled genome.
Assessing the completion/redundancy of eukaryotic MAGs from 3 different projects
I recently spent some time manually binning ~2.6 million scaffolds assembled from metagenomes of the TARA Oceans project that cover the surface of four oceans and two seas. Among the ~1,000 MAGs we characterized in this project, 45 belong to the domain Eukarya. Until now, their completion had not been tested beyond the use of inappropriate bacterial single-copy core genes.
14 TARA Oceans MAGs were affiliated to Micromonas, and we know from cultivation that Micromonas genomes are typically 20 Mbp long. This is a good case study to determine the accuracy of the new collection of single-copy core genes.
Here are the related statistics:
There is a strong correlation between length and completion estimates (R-scare > 0.9), and MAGs with a length close to 20 Mbp have high completion values (with the exception of TARA_ANW_MAG_00088 for which completion seems a little off - but this can be due to assembly or binning problems).
7 TARA Oceans MAGs were affiliated to Ostreococcus (reference genomes are ~13 Mbp), and here are the related statistics:
The correlation between the genomic length and completion estimates for Micromonas and Ostreococcus MAGs characterized from the surface ocean was quite promising:
Overall, average redundancy estimates for the 45 eukaryotic MAGs went from ~20% with the bacterial single-copy core gene collection to 1.4% with
BUSCO_83_Protista. This suggests the eukaryotic MAGs represent individual population genomes with no substantial amounts of contamination. As a result, it seems key information computed by anvi’o (especially differential coverage and sequence composition) is also effective for the characterization of (sometimes near-complete) eukaryotic genomes. Good.
Pseudo-nitzschia from the Southern Ocean
In another project led by Anton Post, we have characterized using genome-resolved metagenomics a 34 Mbp MAG corresponding to Pseudo-nitzschia. This lineage was prevailing in the Ross Sea polynya in the coast of Antarctica during the 2013-2014 austral summer. The story is not yet published, and until now I had never estimated the completion or redundancy of this MAG.
BUSCO_83_Protista estimated this MAG to be 86.8% complete and 2.4% redundant. This made my day…
Candida albicans in a gut metagenome
Sharon et al. (2013) characterized a MAG corresponding to Candida albicans (a fungi) from metagenomes of an infant gut, and we later recapitulated this finding using our own metagenomic binning effort of the same dataset. The Candida albicans MAG we recovered is 13.8 Mbp long. Based on
BUSCO_83_Protista, this MAG is 83.1% complete and 7.2% redundant. A reference Candida albicans genome is 14.7 Mbp long, suggesting that the estimate is not far from what one would expect.
Not too bad given that fungi are not protists. This suggests such collection could work to a certain extent beyond the realm of protists.
With this addition, anvi’o can now estimate the completion and redundancy of microbial genomes from three domains of life.
However, please keep in mind that single-copy core gene collections only provide a rough estimation of the completeness and contamination of MAGs. Visualizing a MAG in the context of recruited reads from metagenomes for instance (the environmental signal) remains essential when it comes to genome-resolved metagenomics, and this approach very often allows the detection and removal of suspicious contigs that contain no single-copy core genes and hence are invisible from the perceptive of completion / redundancy estimates. Thus, these estimates are only one of multiple parameters one can (and maybe should) use to characterize, manually refine, and assess the biological relevance of eukaryotic MAGs.
Of course, these problems are less relevant if one is working with cultivar genomes and single cell genomes .. Or are they? (smiley)
Estimating the completion of eukaryotic genomes reconstructed from metagenomes hasn't been quite straightforward.— A. Murat Eren (Meren) (@merenbey) May 10, 2018
But here is a blog post from @tomodelmont, extending #anvio's capabilities towards that direction 🎉https://t.co/lK43avyi05