Microbial 'omics

# An anvi'o workflow for microbial pangenomics

### Table of Contents

The latest version of anvi’o is v3. See the release notes.

This pangenomic workflow is for 2.1.0 and later versions of anvi’o. You can learn which version you have on your computer by typing anvi-profile --version in your terminal. If your anvi’o installation is v2.0.2 or earlier, you can follow the pangenomic workflow described here (and not read this one to not be upset about what you are missing).

With changed all instances of the term ‘protein cluster’ in our pangenomic workflow with ‘gene cluster’ in anvi’o v4. The reason behind this change is detailed here.

We make a lot of typos, sometimes parameters or versions slightly change, and we fail to keep tutorials up-to-date all the time. If you found a mistake on this page, or if you would like to change something in it, you can directly edit its source code by clicking “Edit this file” icon on the right top corner (which you will see if you have logged in to GitHub), and send us a ‘pull request’. We will be very thankful.

With the anvi’o pangenomic workflow you can,

• Identify gene clusters among your genomes,
• Visualize the distribution of gene clusters across your genomes,
• Estimate relationships between your genomes based on shared gene clusters,
• Interactively (or programmatically) bin gene clusters into collections, and summarize your bins,
• Annotate your genes, and inspect amino acid alignments within your gene clusters,
• Extend your pangenome with contextual information about your genomes or gene clusters,
• Combine metagenome-assembled genomes from your anvi’o projects directly with cultivar genomes from other sources.

You can use anvi’o for pangenomic workflow even if you haven’t done any metagenomic work with anvi’o. All you need is an anvi’o installation, and a FASTA file for each of your genomes.

## Introduction

The anvi’o pangenomic workflow is composed of three main steps:

• Generating an anvi’o genomes storage using the program anvi-gen-genomes-storage.
• Generating an anvi’o pan database using the program anvi-pan-genome (which requires a genomes storage)
• Displaying results using the program anvi-display-pan (which requires a genomes storage and an anvi’o pan database).

You can then use the interactive interface to bin your gene clusters into collections, or use the program anvi-import-collection to import bins for your gene clusters, and finally you can use the program anvi-summarize to create a static HTML summary of your results. Easy peasy!

Following sections will detail each step, and finally an example run will follow, but let’s first make sure you have all the dependencies installed and your installation is good to go.

### Dependencies

Even if you have a complete installation of anvi’o, the pangenomic workflow uses some additional software that may not be installed on your system.

• DIAMOND or NCBI’s blastp for search.
• MCL for clustering.
• muscle for alignment. Which is optional. If you don’t have it, amino acid sequences within your gene clusters will not be aligned, and will look ugly.

If your system is properly setup, this command should run without any errors:

$anvi-self-test --suite pangenomics  ## Generating an anvi’o genomes storage An anvi’o genomes storage is a special database that stores information about genomes. A genomes storage can be generated only from external genomes, only from internal genomes, or it can contain both types. Before we go any further, here are some definitions to clarify things: • An external genome is anything you have in a FASTA file format (i.e., a genome you downloaded from NCBI, or obtained through any other way). • An internal genome is any genome bin you stored in an anvi’o collection at the end of your metagenomic analysis (if you are not familiar with the anvi’o metagenomic workflow please take a look at this post). Converting FASTA files into anvi’o contigs databases: Working with internal genomes is quite straightforward since you already have an anvi’o contigs and an anvi’o profile database for them. But if all you have is a bunch of FASTA files, this workflow will require you to convert each of them into an anvi’o contigs database. There is a lot of information about how to create an anvi’o contigs database here, but if you feel lazy, you can also use the script anvi-script-FASTA-to-contigs-db, which takes a single parameter: the FASTA file path. Power users should consider taking a look at the code, and create their own batch script with improvements on those lines based on their needs (for instance, increasing the number of threads when running HMMs, etc). Also, you may want to run anvi-run-ncbi-cogs on your contigs database to annotate your genes. You can create a new anvi’o genomes storage using the program anvi-gen-genomes-storage, which will require you to provide descriptions of genomes to be included in this storage. File formats for external genome and internal genome descriptions differ slightly. For instance, this is an example --external-genomes file: name contigs_db_path Name_01 /path/to/contigs-01.db Name_02 /path/to/contigs-02.db Name_03 /path/to/contigs-03.db (…) (…) and this is an example file for --internal-genomes: name bin_id collection_id profile_db_path contigs_db_path Name_01 Bin_id_01 Collection_A /path/to/profile.db /path/to/contigs.db Name_02 Bin_id_02 Collection_A /path/to/profile.db /path/to/contigs.db Name_03 Bin_id_03 Collection_B /path/to/another_profile.db /path/to/another/contigs.db (…) (…) (…) (…) (…) For names in the first column please use only letters, digits, and the underscore character. Thanks to these two files, genome bins in anvi’o collections and cultivar genomes could be combined and analyzed together seamlessly. The most essential need for the coherence of the genomes storage is to make sure each internal and external genome is generated identically with respect to how genes were called, how functions were assigned, etc. Anvi’o will check for some things, but it can’t stop you from doing terrible mistakes. For instance, if the version of Prodigal is not identical across all contigs databases, the genes described in genomes storage will not necessarily be comparable. If you are not sure about something, send us an e-mail, and we will be happy to try to clarify. ## Running a pangenome analysis Once you have your genomes storage ready, you can use the program anvi-pan-genome to run the actual pangenomic analysis. This is the simplest form of this command: $ anvi-pan-genome -g MY-GENOMES.h5 -n PROJECT_NAME


When you run it, anvi-pan-genome,

• Will use all genomes in the genomes storage. If you would like to focus on a subset, you can use the parameter --genome-names.

• Will use only a single core by default. Depending on the number of genomes you are analyzing, this process can be very time consuming, hence you should consider increasing the number of threads to be used via the parameter --num-threads.

• Will use DIAMOND (Buchnfink et al., 2015) in ‘fast’ mode (you can ask DIAMOND to be ‘sensitive’ by using the flag --sensitive) by default to calculate similarities of each protein in every genome against every other protein (which clearly requires you to have DIAMOND installed). Alternatively you could use the flag --use-ncbi-blast to use NCBI’s blastp for protein search.

A note from Meren: I strongly suggest you to do your analysis with the --use-ncbi-blast flag. Yes, DIAMOND is very fast, and it may take 8-9 hours to analyze 20-30 genomes with blastp. But dramatic increases in speed rarely comes without major trade-offs in sensitivity and accuracy, and some of my observations tell me that DIAMOND is not one of those rare instances. This clearly deserves a more elaborate discussion, and maybe I will have a chance to write it later, but for now take this as a friendly reminder.

• Will use every gene call, whether they are complete or not. Although this is not a big concern for complete genomes, metagenome-assembled genomes (MAGs) will have many incomplete gene calls at the end and at the beginning of contigs. Our experiments so far suggest that they do not cause major issues, but if you want to exclude them, you can use the --exclude-partial-gene-calls flag.

• Will use the minbit heuristic that was originally implemented in ITEP (Benedict et al, 2014) to eliminate weak matches between two protein sequences. You see, the pangenomic workflow first identifies proteins that are somewhat similar by doing similarity searches, and then resolves clusters based on those similarities. In this scenario, weak similarities can connect gene clusters that should not be connected. Although the network partitioning algorithm can recover from these weak connections, it is always better to eliminate as many noise as possible at every step. So the minbit heuristic provides a mean to set a to eliminate weak matches between two protein sequences. We learned it from ITEP (Benedict MN et al, doi:10.1186/1471-2164-15-8), which is another comprehensive analysis workflow for pangenomes, and decided to use it because it makes a lot of sense. Briefly, If you have two protein sequences, A and B, the minbit is defined as BITSCORE(A, B) / MIN(BITSCORE(A, A), BITSCORE(B, B)). So the minbit score between two sequences goes to 1.0 if they are very similar over the entire length of the ‘shorter’ protein sequence, and goes to 0.0 if (1) they match over a very short stretch compared even to the length of the shorter protein sequence or (2) the match betwen sequence identity is low. The default minbit is 0.5, but you can change it using the parameter --minbit.

This parameter has wrongly named as maxbit in anvi’o v2.4.0 and earlier versions. Please take a look at the relevant issue if you are using an anvi’o version that is impacted by this typo.

• Will use the MCL algorithm (van Dongen and Abreu-Goodger, 2012) to identify clusters in protein similarity search results. We use 2 as the MCL inflation parameter by default. This parameter defines the sensitivity of the algorithm during the identification of the gene clusters. More sensitivity means more clusters, but of course more clusters does not mean better inference of evolutionary relationships. More information on this parameter and it’s effect on cluster granularity is here http://micans.org/mcl/man/mclfaq.html#faq7.2, but clearly, we, the metagenomics people will need to talk much more about this. So far in the Meren Lab we have been using 2 if we are comparing many distantly related genomes (i.e., genomes classify into different families or farther), and 10 if we are comparing very closely related genomes (i.e., ‘strains’ of the same ‘species’ (whatever they mean to you)). You can change it using the parameter --mcl-inflation. Please experiment yourself, and consider reporting!

• Will utilize every gene cluster, even if they occur in only one genome in your analyses. Of course, the importance of singletons or doubletons will depend on the number of genomes in your analysis, or the question you have in mind. However, if you would like to define a cut-off, you can use the parameter --min-occurrence, which is 1, by default. Increasing this cut-off will improve the clustering speed and make the visualization much more manageable, but again, this parameter should be considered in the context of each study.

• Will use euclidean distance and ward linkage to organize gene clusters and genomes. You can change those using --distance and --linkage parameters.

• Will try to utilize previous search results if there is already a directory. This way you can play with the --minbit, --mcl-inflation, or --min-occurrence parameters without having to re-do the protein search. However, if you have changed something, either you need to remove the output directory, or use the --overwrite-output-destinations flag to redo the search.

You need another parameter? Well, of course you do! Let us know, and let’s have a discussion. We love parameters.

Once you are done, a new directory with your analysis results will appear. Among other files, it will contain an anvi’o pan database, and an anvi’o samples database. If you would like to update the samples database, you can edit the *-samples-information.txt or *-samples-order.txt files and re-create an updated samples database using the program anvi-gen-samples-database as explained here.

## Displaying the pan genome

Once your analysis is done, you will use the program anvi-display-pan to display your results.

This is the simplest form of this command:

$anvi-display-pan -p PROJECT-PAN.db -s PROJECT-PAN-SAMPLES.db -g PROJECT-PAN-GENOMES.h5  The program anvi-display-pan is very similar to the program anvi-interactive, and the interface that will welcome you is nothing but the standard anvi’o interactive interface with slight adjustments for pangenomic analyses. Of course anvi-display-pan will allow you to set the IP address and port number to serve, add additional view data, additional layers, and/or additional trees, and more. Please familiarize yourself with it by running anvi-display-pan -h in your terminal. Here is an example to show how the pangenomic analysis of 31 Prochlorococcus genomes we just recently finished looks like when we run anvi-display-pan the very first time on it: Looks ugly. But do not despair. For instance, to improve things a little, you may like to organize your genomes based on gene clustering results by selecting the ‘gene_cluster frequencies’ tree from the Samples Tab > Sample Order menu: This is what happens when you draw it again (note the tree that appears on the right): Looks more meaningful .. but still ugly. Well, this is exactly where you need to start using the interface more efficiently. For instance, this is the same pangenome after we are done with it: No excuses for bad looking pangenomes. ## Inspecting gene clusters Every gene cluster in your analysis will contain one or more amino acid sequence that originate from one or more genomes. While there will likely be a ‘core’ section, in which all gene cluster will appear in every genome, it is also common to find gene clusters that contain more than one gene call from a single genome (i.e., all multi-copy genes in a given genome will end up in the same gene cluster). Sooner or later you will start getting curious about some of the gene clusters, and want to learn more about them. Luckily you can right-click on to any gene cluster, and you would see this menu (or maybe even more depending on when you are reading this article): For instance, if you click ‘Inspect gene cluster’, you will see all the amino acid sequences from each genome that went into that gene cluster (with the same order and background colors of genomes as they are arranged in the main display): It is not only fun but also very educational to go through gene clusters one by one. Fine. But what do you do if you want to make sense of large selections? As you already know, the anvi’o interactive interface allows you to make selections from the tree. So you can select groups of gene clusters into bins (don’t mind the numbers on the left panel, there clearly is a bug, and will be fixed in your version): You can create multiple bins with multiple selections, and even give them meaningful names if you fancy: You can store your selections as a collection in the pan database. ## Advanced access to gene clusters This is a new feature that is included for the first time in anvi’o v4, and this part of the tutorial will need to be improved with good examples, but for now, please take a look at the ADVANCED FILTERS section of the output: anvi-get-sequences-for-gene-clusters --help  Here is a bit more information on this. ## Summarizing an anvi’o pan genome When you store your selections of gene clusters as a collection, anvi’o will allow you to summarize these results. Even if you want to simply summarize everything in the pan gneome without making any selections in the interface, you still will need a collection in the pan database. But luckily, you can use the program anvi-script-add-default-collection to add a default collection that contains every gene cluster. The summary step gives you two important things: a static HTML web page you can compress and share with your colleagues or add into your publication as a supplementary data file, and a copmrehensive TAB-delimited file in the output directory that describes every gene cluster. You can summarize a collection using the program anvi-summarize, and a generic form of this command looks like this: $ anvi-summarize -p PROJECT-PAN.db \
-g PROJECT-PAN-GENOMES.h5 \
-C COLLECTION_NAME \
-o PROJECT-SUMMARY


If you open the index.html file in the summary directory, you will see an output with some essential information about the analysis:

As well as the TAB-delimited file for gene clusters:

The structure of this file will look like this, and will give you an opportunity to investigate your gene clusters in a more detailed manner (you may need to scroll towards right to see more of the table):

unique_id protein_cluster_id bin_name genome_name gene_callers_id COG_FUNCTION_ACC COG_FUNCTION COG_CATEGORY_ACC COG_CATEGORY aa_sequence
1 PC_00001990   MIT9303 30298 COG1199 Rad3-related DNA helicase L Replication, recombination and repair MLEARSHQQLKHLLLQNSSP(…)
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…)
91 PC_00001434 Core_HL MIT9322 42504 COG3769 Predicted mannosyl-3-phosphoglycerate phosphatase G Carbohydrate transport and metabolism MIENSSIWVVSDVDGTLMDH(…)
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…)
257 PC_00000645 Core_all MIT9322 42217 COG1185 Polyribonucleotide nucleotidyltransferase J Translation, ribosomal structure and biogenesis MEGQNKSITFDGREIRLTTG(…)
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…)
2019 PC_00001754 Core_LL NATL2A 49129 COG0127 Inosine/xanthosine triphosphate pyrophosphatase F Nucleotide transport and metabolism MDNVPLVIASGNKGKIGEFK(…)
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…)
5046 PC_00001653   PAC1 52600 COG1087 UDP-glucose 4-epimerase M Cell wall/membrane/envelope biogenesis MRVLLTGGAGFIGSHIALLL(…)
5047 PC_00001653   LG 7488 COG1087 UDP-glucose 4-epimerase M Cell wall/membrane/envelope biogenesis MNRILVTGGAGFIGSHTCIT(…)
5048 PC_00001653   SS35 56661 COG1087 UDP-glucose 4-epimerase M Cell wall/membrane/envelope biogenesis MNRILVTGGAGFIGSHTCIT(…)
5049 PC_00001653   NATL2A 49604 COG1087 UDP-glucose 4-epimerase M Cell wall/membrane/envelope biogenesis MRVLLTGGSGFIGSHVALLL(…)
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…)

This file will link each gene from each genome with every selection you’ve made and named through the interface or through the program anvi-import-collection, and will also give you access to the amino acid seqeunce and function of each gene.

## Final words

We are realizing that there is a lot to explore in this front, and excited to work with the community. Please let us know if you have any suggestions using our discussion group, the comments section below. You can also keep an eye on our public code repository for new releases.