The latest version of anvi’o is
2.1.0. See the release notes.
This pangenomic workflow is for
2.1.0 and later versions of anvi’o. You can learn which version you have on your computer by typing
anvi-profile --version in your terminal. If your anvi’o installation is
v2.0.2 or earlier, you can follow the pangenomic workflow described here (and not read this one to not be upset about what you are missing).
We make a lot of typos, sometimes parameters or versions slightly change, and we fail to keep tutorials up-to-date all the time. If you found a mistake on this page, or if you would like to change something in it, you can directly edit its source code by clicking “Edit this file” icon on the right top corner (which you will see if you have logged in to GitHub), and send us a ‘pull request’. We will be very thankful.
With the anvi’o pangenomic workflow you can,
- Identify protein clusters,
- Visualize the distribution of protein clusters across your genomes,
- Organize genomes based on shared protein clusters,
- Interactively (or programmatically) bin protein clusters into collections, and summarize your bins,
- Annotate your genes, and inspect amino acid alignments within your protein clusters,
- Extend your pangenome with contextual information about your genomes or protein clusters,
- Combine metagenome-assembled genomes from your anvi’o projects directly with cultivar genomes from other sources.
You can use anvi’o for pangenomic workflow even if you haven’t done any metagenomic work with anvi’o. All you need is an anvi’o installation, and a FASTA file for each of your genomes.
The anvi’o pangenomic workflow is composed of three main steps:
- Generating an anvi’o genomes storage using the program
- Generating an anvi’o pan database using the program
anvi-pan-genome(which requires a genomes storage)
- Displaying results using the program
anvi-display-pan(which requires a genomes storage and an anvi’o pan database).
You can then use the interactive interface to bin your protein clusters into collections, or use the program
anvi-import-collection to import bins for your protein clusters, and finally you can use the program
anvi-summarize to create a static HTML summary of your results. Easy peasy!
Following sections will detail each step, and finally an example run will follow, but let’s first make sure you have all the dependencies installed and your installation is good to go.
Even if you have a complete installation of anvi’o, the pangenomic workflow uses some additional software that may not be installed on your system.
- DIAMOND or NCBI’s blastp for search.
- MCL for clustering.
- muscle for alignment. Which is optional. If you don’t have it, amino acid sequences within your protein clusters will not be aligned, and will look ugly.
If your system is properly setup, this command should run without any errors:
Generating an anvi’o genomes storage
An anvi’o genomes storage is a special database that stores information about genomes. A genomes storage can be generated only from external genomes, only from internal genomes, or it can contain both types. Before we go any further, here are some definitions to clarify things:
An external genome is anything you have in a FASTA file format (i.e., a genome you downloaded from NCBI, or obtained through any other way).
An internal genome is any genome bin you stored in an anvi’o collection at the end of your metagenomic analysis (if you are not familiar with the anvi’o metagenomic workflow please take a look at this post).
Converting FASTA files into anvi’o contigs databases: Working with internal genomes is quite straightforward since you already have an anvi’o contigs and an anvi’o profile database for them. But if all you have is a bunch of FASTA files, this workflow will require you to convert each of them into an anvi’o contigs database. There is a lot of information about how to create an anvi’o contigs database here, but if you feel lazy, you can also use the script
anvi-script-FASTA-to-contigs-db, which takes a single parameter: the FASTA file path. Power users should consider taking a look at the code, and create their own batch script with improvements on those lines based on their needs (for instance, increasing the number of threads when running HMMs, etc). Also, you may want to run
anvi-run-ncbi-cogs on your contigs database to annotate your genes.
You can create a new anvi’o genomes storage using the program
anvi-gen-genomes-storage, which will require you to provide descriptions of genomes to be included in this storage. File formats for external genome and internal genome descriptions differ slightly. For instance, this is an example
and this is an example file for
For names in the first column please use only letters, digits, and the underscore character.
Thanks to these two files, genome bins in anvi’o collections and cultivar genomes could be combined and analyzed together seamlessly. The most essential need for the coherence of the genomes storage is to make sure each internal and external genome is generated identically with respect to how genes were called, how functions were assigned, etc. Anvi’o will check for some things, but it can’t stop you from doing terrible mistakes. For instance, if the version of Prodigal is not identical across all contigs databases, the genes described in genomes storage will not necessarily be comparable. If you are not sure about something, send us an e-mail, and we will be happy to try to clarify.
Running a pangenome analysis
Once you have your genomes storage ready, you can use the program
anvi-pan-genome to run the actual pangenomic analysis. This is the simplest form of this command:
When you run it,
Will use all genomes in the genomes storage. If you would like to focus on a subset, you can use the parameter
Will use only a single core by default. Depending on the number of genomes you are analyzing, this process can be very time consuming, hence you should consider increasing the number of threads to be used via the parameter
Will use DIAMOND (Buchnfink et al., 2015) in ‘fast’ mode (you can ask DIAMOND to be ‘sensitive’ by using the flag
--sensitive) by default to calculate similarities of each protein in every genome against every other protein (which clearly requires you to have DIAMOND installed). Alternatively you could use the flag
--use-ncbi-blastto use NCBI’s
blastpfor protein search.
A note from Meren: I strongly suggest you to do your analysis with the
--use-ncbi-blast flag. Yes, DIAMOND is very fast, and it may take 8-9 hours to analyze 20-30 genomes with
blastp. But dramatic increases in speed rarely comes without major trade-offs in sensitivity and accuracy, and some of my observations tell me that DIAMOND is not one of those rare instances. This clearly deserves a more elaborate discussion, and maybe I will have a chance to write it later, but for now take this as a friendly reminder.
Will use every gene call, whether they are complete or not. Although this is not a big concern for complete genomes, metagenome-assembled genomes (MAGs) will have many incomplete gene calls at the end and at the beginning of contigs. Our experiments so far suggest that they do not cause major issues, but if you want to exclude them, you can use the
Will use the maxbit heuristic that was originally implemented in ITEP (Benedict et al, 2014) to eliminate weak matches between two protein sequences. You see, the pangenomic workflow first identifies proteins that are somewhat similar by doing similarity searches, and then resolves clusters based on those similarities. In this scenario, weak similarities can connect protein clusters that should not be connected. Here is a definition of maxbit: If you have two protein sequences
B, the maxbit is defined as
BITSCORE(A, B) / MIN(BITSCORE(A, A), BITSCORE(B, B)). Therefore, the maxbit score of two sequences goes to
1if they are very similar over their entire length, and it goes to
0if they match over a very short stretch compared to their entire length (regardless of how similar they are at the aligned region), or their sequence identity is low. The default maxbit is
0.5, but you can change it using the parameter
Will use the MCL algorithm (van Dongen and Abreu-Goodger, 2012) to identify clusters in protein similarity search results. We use
2as the MCL inflation parameter by default. This parameter defines the sensitivity of the algorithm during the identification of the protein clusters. More sensitivity means more clusters, but of course more clusters does not mean better inference of evolutionary relationships. More information on this parameter and it’s effect on cluster granularity is here http://micans.org/mcl/man/mclfaq.html#faq7.2, but clearly, we, the metagenomics people will need to talk much more about this. So far in the Meren Lab we have been using
2if we are comparing many distantly related genomes (i.e., genomes classify into different families or farther), and
10if we are comparing very closely related genomes (i.e., ‘strains’ of the same ‘species’ (whatever they mean to you)). You can change it using the parameter
--mcl-inflation. Please experiment yourself, and consider reporting!
Will utilize every protein cluster, even if they occur in only one genome in your analyses. Of course, the importance of singletons or doubletons will depend on the number of genomes in your analysis, or the question you have in mind. However, if you would like to define a cut-off, you can use the parameter
--min-occurrence, which is 1, by default. Increasing this cut-off will improve the clustering speed and make the visualization much more manageable, but again, this parameter should be considered in the context of each study.
wardlinkage to organize protein clusters and genomes. You can change those using
Will try to utilize previous search results if there is already a directory. This way you can play with the
--min-occurrenceparameters without having to re-do the protein search. However, if you have changed something, either you need to remove the output directory, or use the
--overwrite-output-destinationsflag to redo the search.
You need another parameter? Well, of course you do! Let us know, and let’s have a discussion. We love parameters.
Once you are done, a new directory with your analysis results will appear. Among other files, it will contain an anvi’o pan database, and an anvi’o samples database. If you would like to update the samples database, you can edit the
*-samples-order.txt files and re-create an updated samples database using the program
anvi-gen-samples-database as explained here.
Displaying and exploring the pan genome
Once your analysis is done, you will use the program
anvi-display-pan to display your results.
This is the simplest form of this command:
anvi-display-pan is very similar to the program
anvi-interactive, and the interface that will welcome you is nothing but the standard anvi’o interactive interface with slight adjustments for pangenomic analyses. Of course
anvi-display-pan will allow you to set the IP address and port number to serve, add additional view data, additional layers, and/or additional trees, and more. Please familiarize yourself with it by running
anvi-display-pan -h in your terminal.
Here is an example to show how the pangenomic analysis of 31 Prochlorococcus genomes we just recently finished looks like when we run
anvi-display-pan the very first time on it:
Looks ugly. But do not despair. For instance, to improve things a little, you may like to organize your genomes based on protein clustering results by selecting the ‘PC frequencies’ tree from the Samples Tab > Sample Order menu:
This is what happens when you draw it again (note the tree that appears on the right):
Looks more meaningful .. but still ugly.
Well, this is exactly where you need to start using the interface more efficiently. For instance, this is the same pangenome after we are done with it:
No excuses for bad looking pangenomes.
Every protein cluster in your analysis will contain one or more amino acid sequences that originate from one or more genomes. While there will likely be a ‘core’ section, in which all protein cluster will appear in every genome, it is also common to find protein clusters that contain more than one gene call from a single genome (i.e., all multi-copy genes in a given genome will end up in the same protein cluster). Sooner or later you will start getting curious about some of the protein clusters, and want to learn more about them. Luckily you can right-click on to any protein cluster, and you would see this menu (or maybe even more depending on when you are reading this article):
For instance, if you click ‘Inspect PC’, you will see all the amino acid sequences from each genome that went into that protein cluster (with the same order and background colors of genomes as they are arranged in the main display):
It is not only fun but also very educational to go through protein clusters one by one. Fine. But what do you do if you want to make sense of large selections?
As you already know, the anvi’o interactive interface allows you to make selections from the tree. So you can select groups of protein clusters into bins (don’t mind the numbers on the left panel, there clearly is a bug, and will be fixed in your version):
You can create multiple bins with multiple selections, and even give them meaningful names if you fancy:
Then you can save your selection as a “collection”, and then run the
anvi-summarize to generate a summary. Which would be as simple as this:
If you open the
index.html file in the summary directory, you will see an output with some essential information about the analysis:
As well as a TAB-delimited file that describes every protein cluster:
The structure of this file will look like this, and will give you an opportunity to investigate your protein clusters in a more detailed manner (you may need to scroll towards right to see more of the table):
|1||PC_00001990||MIT9303||30298||COG1199||Rad3-related DNA helicase||L||Replication, recombination and repair||MLEARSHQQLKHLLLQNSSP(…)|
|91||PC_00001434||Core_HL||MIT9322||42504||COG3769||Predicted mannosyl-3-phosphoglycerate phosphatase||G||Carbohydrate transport and metabolism||MIENSSIWVVSDVDGTLMDH(…)|
|257||PC_00000645||Core_all||MIT9322||42217||COG1185||Polyribonucleotide nucleotidyltransferase||J||Translation, ribosomal structure and biogenesis||MEGQNKSITFDGREIRLTTG(…)|
|2019||PC_00001754||Core_LL||NATL2A||49129||COG0127||Inosine/xanthosine triphosphate pyrophosphatase||F||Nucleotide transport and metabolism||MDNVPLVIASGNKGKIGEFK(…)|
|5046||PC_00001653||PAC1||52600||COG1087||UDP-glucose 4-epimerase||M||Cell wall/membrane/envelope biogenesis||MRVLLTGGAGFIGSHIALLL(…)|
|5047||PC_00001653||LG||7488||COG1087||UDP-glucose 4-epimerase||M||Cell wall/membrane/envelope biogenesis||MNRILVTGGAGFIGSHTCIT(…)|
|5048||PC_00001653||SS35||56661||COG1087||UDP-glucose 4-epimerase||M||Cell wall/membrane/envelope biogenesis||MNRILVTGGAGFIGSHTCIT(…)|
|5049||PC_00001653||NATL2A||49604||COG1087||UDP-glucose 4-epimerase||M||Cell wall/membrane/envelope biogenesis||MRVLLTGGSGFIGSHVALLL(…)|
The summary directory is a self-contained, static HTML output you can send to your colleagues, or add to your publications as a supplementary dataset.
We are realizing that there is a lot to explore in this front, and excited to work with the community. Please let us know if you have any suggestions using our discussion group, the comments section below. You can also keep an eye on our public code repository for new releases.