Microbial 'omics


a post by A. Murat Eren (Meren)

Alon Shaiber

Mahmoud Yousef

and Özcan C. Esen

Table of Contents

The latest version of anvi’o is v5.2. See the release notes.

This pangenomic workflow is for anvi’o version 2.1.0 and later versions of anvi’o. You can identify which version you have on your computer by typing anvi-profile --version in your terminal. If your anvi’o installation is v2.0.2 or earlier, you can follow the pangenomic workflow described here (so as not to read this one and be upset about what you are missing).

We have changed all instances of the term ‘protein cluster’ in our pangenomic workflow to ‘gene cluster’ in anvi’o v4. The reason behind this change is detailed here.

We make a lot of typos. Sometimes parameters change slightly across versions, and we often fail to keep tutorials up-to-date at all times. If you find a mistake on this page, or if you would like to change something within it, you can directly edit its source code by clicking the “Edit this file” icon at the top-right corner (which you will see if you have logged in to GitHub). We will be very thankful for your contribution 😇

With the anvi’o pangenomic workflow you can,

  • Identify gene clusters among your genomes,
  • Combine metagenome-assembled genomes from your anvi’o projects directly with cultivar genomes from other sources.
  • Visualize the distribution of gene clusters across your genomes,
  • Estimate relationships between your genomes based on gene clusters,
  • Interactively (or programmatically) bin gene clusters into collections, and summarize your bins,
  • Perform phylogenomic analyses on-the-fly given a set of gene clusters,
  • Annotate your genes, and inspect amino acid alignments within your gene clusters,
  • Extend your pangenome with contextual information about your genomes or gene clusters,
  • Quantify geometric and biochemical homogeneity of your gene clusters,
  • Perform functional enrichment analyses on groups of genomes in your pangenome,
  • Compute and visualize average nucleotide identity scores between you genomes.

Citation: If you use the anvi’o pangenomic workflow for your research, please consider citing this work (which details the pangeomic workflows in anvi’o) in addition to this one (which introduces the platform). Thank you for your consideration.

You can use anvi’o for pangenomic workflow even if you haven’t done any metagenomic work with anvi’o. All you need is an anvi’o installation, and a FASTA file for each of your genomes.

Introduction

The anvi’o pangenomic workflow is composed of three main steps:

  • Generate an anvi’o genomes storage using the program anvi-gen-genomes-storage.

  • Generate an anvi’o pan database using the program anvi-pan-genome (which requires a genomes storage)

  • Display results using the program anvi-display-pan (which requires a genomes storage and an anvi’o pan database).

You can then use the interactive interface to bin your gene clusters into collections, or use the program anvi-import-collection to import bins for your gene clusters, and finally you can use the program anvi-summarize to create a static HTML summary of your results. Easy peasy!

The following sections will detail each step, and culminating in an example run will follow, but let’s first make sure you have all the required dependencies installed and your installation is good to go.

Dependencies

Even if you have a complete installation of anvi’o, the pangenomic workflow uses some additional software that may not be installed on your system.

  • DIAMOND or NCBI’s blastp for search.
  • MCL for clustering.
  • muscle for alignment. Which is optional. If you don’t have it, anvi’o will not automatically align amino acid sequences within your gene clusters, and things will look ugly :(

If your system is properly setup, this command should run without any errors:

$ anvi-self-test --suite pangenomics

Generating an anvi’o genomes storage

An anvi’o genomes storage is a special database that stores information about genomes. A genomes storage can be generated only from external genomes, only from internal genomes, or it can contain both types. Before we go any further, here are some definitions to clarify things:

  • An external genome is anything you have in a FASTA file format (i.e., a genome you downloaded from NCBI, or obtained through any other way).

  • An internal genome is any genome bin you stored in an anvi’o collection at the end of your metagenomic analysis (if you are not familiar with the anvi’o metagenomic workflow please take a look at this post).

Converting FASTA files into anvi’o contigs databases: Working with internal genomes is quite straightforward since you already have an anvi’o contigs and an anvi’o profile database for them. But if all you have is a bunch of FASTA files, this workflow will require you to convert each of them into an anvi’o contigs database. There is a lot of information about how to create an anvi’o contigs database here, but if you feel lazy, you can also use the script anvi-script-FASTA-to-contigs-db, which requires a single parameter: the FASTA file path. Power users should consider taking a look at the code, and create their own batch script with improvements on those lines based on their needs (for instance, increasing the number of threads when running HMMs, etc). Also, you may want to run anvi-run-ncbi-cogs on your contigs database to annotate your genes.

You can create a new anvi’o genomes storage using the program anvi-gen-genomes-storage, which will require you to provide descriptions of genomes to be included in this storage. File formats for external genome and internal genome descriptions differ slightly. For instance, this is an example --external-genomes file:

name contigs_db_path
Name_01 /path/to/contigs-01.db
Name_02 /path/to/contigs-02.db
Name_03 /path/to/contigs-03.db
(…) (…)

and this is an example file for --internal-genomes:

name bin_id collection_id profile_db_path contigs_db_path
Name_01 Bin_id_01 Collection_A /path/to/profile.db /path/to/contigs.db
Name_02 Bin_id_02 Collection_A /path/to/profile.db /path/to/contigs.db
Name_03 Bin_id_03 Collection_B /path/to/another_profile.db /path/to/another/contigs.db
(…) (…) (…) (…) (…)

For names in the first column please use only letters, digits, and the underscore character.

Thanks to these two files, genome bins in anvi’o collections and cultivar genomes can be combined and analyzed together seamlessly. The most essential need for the coherence within the genomes storage is to make sure each internal and external genome is generated identically with respect to how genes were called, how functions were assigned, etc. Anvi’o will check for some things, but it can’t stop you from doing terrible mistakes. For instance, if the gene caller that identified open reading frames is not identical across all contigs databases, the genes described in genomes storage will not necessarily be comparable. If you are not sure about something, send us an e-mail, and we will be happy to try to clarify.

A real example for hardcore tutorial readers: A Prochlorococcus pangenome

For the sake of reproducibility, the rest of the tutorial will follow a real example.

We will simply create a pangenome of 31 Prochlorococcus isolate genomes that we used in this study.

If you wish to follow the tutorial on your computer, you can download the Prochlorococcus data pack (doi:10.6084/m9.figshare.6318833) which contains anvi’o contigs databases for these isolate genomes on your computer:

wget https://ndownloader.figshare.com/files/11857577 -O Prochlorococcus_31_genomes.tar.gz
tar -zxvf Prochlorococcus_31_genomes.tar.gz
cd Prochlorococcus_31_genomes
anvi-migrate-db *.db

The directory contains anvi’o contigs databases, an external genomes file, and a TAB-delimited data file that contains additional information for each genome (which is optional, but you will see later why it is very useful). You can generate a genomes storage as described in this section the following way:

anvi-gen-genomes-storage -e external-genomes.txt \
                         -o PROCHLORO-GENOMES.db

Running a pangenome analysis

Once you have your genomes storage ready, you can use the program anvi-pan-genome to run the actual pangenomic analysis. This is the simplest form of this command:

$ anvi-pan-genome -g MY-GENOMES.db -n PROJECT_NAME

A real example for hardcore tutorial readers: A Prochlorococcus pangenome [continued]

Let’s run the pangenome using the genomes storage we created using the 31 Prochlorococcus isolates:

anvi-pan-genome -g PROCHLORO-GENOMES.db \
                --project-name "Prochlorococcus_Pan" \
                --output-dir PROCHLORO \
                --num-threads 6 \
                --minbit 0.5 \
                --mcl-inflation 10 \
                --use-ncbi-blast

Each parameter after the --project-name is optional (yet aligns to the way we run the pangenome for our publication).

The directory you have downloaded also contains a file called “layer-additional-data.txt”, which summarizes teh clade to which each genome belongs. Once the pangenome is computed, we can add it into the pan database:

anvi-import-misc-data layer-additional-data.txt \
                      -p PROCHLORO/Prochlorococcus_Pan-PAN.db \
                      --target-data-table layers

New layers additional data...
===============================================
Data key "clade" .............................: Predicted type: str
Data key "light" .............................: Predicted type: str

New data added to the db for your layers .....: clade, light.

We all are looking for ways to enrich our pangenomic displays, and anvi’o’s additional data tables are excellent ways to do that. Please take a moment to learn more about them here.

When you run anvi-pan-genome, the program will,

  • Use all genomes in the genomes storage. If you would like to focus on a subset, you can use the parameter --genome-names.

  • Use only a single core by default. Depending on the number of genomes you are analyzing, this process can be very time consuming, hence you should consider increasing the number of threads to be used via the parameter --num-threads.

  • Use DIAMOND (Buchnfink et al., 2015) in ‘fast’ mode by default (or you can ask DIAMOND to be ‘sensitive’ by using the flag --sensitive) to calculate the similarity of each amino acid seqeunce in every genome against every other amino acid sequence across all genomes (which clearly requires you to have DIAMOND installed). Alternatively you could use the flag --use-ncbi-blast to use NCBI’s blastp for amino acid sequence similarlty search.

A note from Meren: I strongly suggest you to do your analysis with the --use-ncbi-blast flag. Yes, DIAMOND is very fast, and it may take 8-9 hours to analyze 20-30 genomes with blastp (using one core on a standard laptop). But dramatic increases in speed rarely comes without major trade-offs in sensitivity and accuracy, and some of my observations tell me that DIAMOND is not one of those rare instances. This clearly deserves a more elaborate discussion, and maybe I will have a chance to write it later, but for now take this as a friendly reminder.

  • Use every gene call, whether they are complete or not. Although this is not a big concern for complete genomes, metagenome-assembled genomes (MAGs) will have many incomplete gene calls at the end and at the beginning of contigs. Our experiments so far suggest that they do not cause major issues, but if you want to exclude them, you can use the --exclude-partial-gene-calls flag.

  • Use the minbit heuristic that was originally implemented in ITEP (Benedict et al, 2014) to eliminate weak matches between two amino acid sequences. You see, the pangenomic workflow first identifies amino acid seqeunces that are somewhat similar by doing similarity searches, and then resolves gene clusters based on those similarities. In this scenario, weak similarities can connect gene clusters that should not be connected. Although the network partitioning algorithm can recover from these weak connections, it is always better to eliminate as much noise as possible at every step. So the minbit heuristic provides a mean to set a to eliminate weak matches between two amino acid sequences. We learned it from ITEP (Benedict MN et al, doi:10.1186/1471-2164-15-8), which is another comprehensive analysis workflow for pangenomes, and decided to use it because it makes a lot of sense. Briefly, If you have two amino acid sequences, A and B, the minbit is defined as BITSCORE(A, B) / MIN(BITSCORE(A, A), BITSCORE(B, B)). So the minbit score between two sequences goes to 1.0 if they are very similar over the entire length of the ‘shorter’ amino acid sequence, and goes to 0.0 if (1) they match over a very short stretch compared even to the length of the shorter amino acid sequence or (2) the match between sequence identity is low. The default minbit is 0.5, but you can change it using the parameter --minbit.

This parameter has wrongly named as maxbit in anvi’o v2.4.0 and earlier versions. Please take a look at the relevant issue if you are using an anvi’o version that is impacted by this typo.

  • Use the MCL algorithm (van Dongen and Abreu-Goodger, 2012) to identify clusters in amino acid sequence similarity search results. We use 2 as the MCL inflation parameter by default. This parameter defines the sensitivity of the algorithm during the identification of the gene clusters. More sensitivity means more clusters, but of course more clusters does not mean better inference of evolutionary relationships. More information on this parameter and it’s effect on cluster granularity is here http://micans.org/mcl/man/mclfaq.html#faq7.2, but clearly, we, the metagenomics people will need to talk much more about this. So far in the Meren Lab we have been using 2 if we are comparing many distantly related genomes (i.e., genomes classify into different families or farther), and 10 if we are comparing very closely related genomes (i.e., ‘strains’ of the same ‘species’ (based whatever definition of these terms you fancy)). You can change it using the parameter --mcl-inflation. Please experiment yourself, and consider reporting!

  • Utilize every gene cluster, even if they occur in only one genome in your analyses. Of course, the importance of singletons or doubletons will depend on the number of genomes in your analysis, or the question you have in mind. However, if you would like to define a cut-off, you can use the parameter --min-occurrence, which is 1, by default. Increasing this cut-off will improve the clustering speed and make the visualization much more manageable, but again, this parameter should be considered in the context of each study.

  • Use euclidean distance and ward linkage to organize gene clusters and genomes. You can change those using --distance and --linkage parameters.

  • Try to utilize previous search results if there is already a directory. This way you can play with the --minbit, --mcl-inflation, or --min-occurrence parameters without having to re-do the amino acid seqeunce search. However, if you have changed something, either you need to remove the output directory, or use the --overwrite-output-destinations flag to redo the search.

You need another parameter? Well, of course you do! Let us know, and let’s have a discussion. We love parameters.

Once you are done, a new directory with your analysis results will appear. You can add or remove additional data items into your pan profile database using anvi’o additional data tables subsystem.

Displaying the pan genome

Once your analysis is done, you will use the program anvi-display-pan to display your results.

This is the simplest form of this command:

$ anvi-display-pan -p PROJECT-PAN.db -g PROJECT-PAN-GENOMES.db

The program anvi-display-pan is very similar to the program anvi-interactive, and the interface that will welcome you is nothing but the standard anvi’o interactive interface with slight adjustments for pangenomic analyses. Of course anvi-display-pan will allow you to set the IP address and port number to serve, add additional view data, additional layers, and/or additional trees, and more. Please familiarize yourself with it by running anvi-display-pan -h in your terminal.

Here is the pangenome for the 31 Prochlorococcus isolate genomes we have created in the previous sections of this tutorial:

anvi-display-pan -g PROCHLORO-GENOMES.db \
                 -p PROCHLORO/Prochlorococcus_Pan-PAN.db

31 Prochlorococcus raw

Looks ugly. But do not despair. For instance, to improve things a little, you may like to organize your genomes based on gene clustering results by selecting the ‘gene_cluster frequencies’ tree from the Samples Tab > Sample Order menu:

31 Prochlorococcus samples

This is what happens when you draw it again (note the tree that appears on the right):

31 Prochlorococcus ordered

Looks more meaningful .. but still ugly.

Well, this is exactly where you need to start using the interface more efficiently. For instance, this is the same pangenome after some changes using the additional settings items in the settings panel of the interactive interface:

31 Prochlorococcus final

The anvi’o state file for this display is also in your work directory if you have downloaded the Prochlorococcus data pack, and you can import it this way:

anvi-import-state -p PROCHLORO/Prochlorococcus_Pan-PAN.db \
                  --state pan-state.json \
                  --name default

No excuses for bad looking pangenomes.

Inspecting gene clusters

Every gene cluster in your analysis will contain one or more amino acid sequence that originate from one or more genomes. While there will likely be a ‘core’ section, in which all gene cluster will appear in every genome, it is also common to find gene clusters that contain more than one gene call from a single genome (i.e., all multi-copy genes in a given genome will end up in the same gene cluster). Sooner or later you will start getting curious about some of the gene clusters, and want to learn more about them. Luckily you can right-click on to any gene cluster, and you would see this menu (or maybe even more depending on when you are reading this article):

31 Prochlorococcus final

For instance, if you click ‘Inspect gene cluster’, you will see all the amino acid sequences from each genome that went into that gene cluster (with the same order and background colors of genomes as they are arranged in the main display):

31 Prochlorococcus final

You can learn more about the amino acid color coding algorithm here.

It is not only fun but also very educational to go through gene clusters one by one. Fine. But what do you do if you want to make sense of large selections?

As you already know, the anvi’o interactive interface allows you to make selections from the tree. So you can select groups of gene clusters into bins (don’t mind the numbers on the left panel, there clearly is a bug, and will be fixed in your version):

31 Prochlorococcus selection

You can create multiple bins with multiple selections, and even give them meaningful names if you fancy:

31 Prochlorococcus collection

While selecting gene clusters manually using the dendrogram is an option, it is also possible to identify them using the search interface that allows you to define very specific search criteria:

31 Prochlorococcus collection

You can highlight these selections in the interface, or you can add them to a collection to summarize them later.

In addition, you can search gene clusters also based on functions:

31 Prochlorococcus collection

Similarly, you can add these gene clusters into collections with whatever name you like, and summarize those collections later.

Advanced access to gene clusters is also possible through the command line through the program anvi-get-sequences-for-gene-clusters. For more information see this issue or this vignette.

Inferring the homogeneity of gene clusters

This functionality is available since anvi’o v5.2 (here is the original contribution from Mahmoud Yousef).

Gene clusters are good, but not all gene clusters are created equal. By simply inspecting them you can have see differing levels of disagreements between amino acid sequences across different genomes in each gene cluster.

Concept of homogeneity

A gene cluster may contain amino acid sequences from different genomes that are almost identical, which would be a highly homogeneous gene cluster. Another gene cluster may contain highly divergent amino acid sequnces from different genomes, which would then be a highly non-homogeneous one, and so on.

One could infer the nature of sequence homogeneity within a gene cluster by focusing on two primary attributes of sequence alignments: functional homogeneity (i.e., how conserved aligned amino acid residues across genes), and geometric homogeneity (i.e., how does the gap / residue distribution look like within a gene cluster regardless of the identity of amino acids. While it is rather straightforward to have an idea about the homogeneity of gene clusters through the manual inspection of the aligned sequences within them, it has not been possible to quantify this information automatically. But anvi’o now has you covered.

Indeed, understanding the within gene cluster homogeneity could yield detailed ecological or evolutionary insights regarding the forces that shape the genomic context across closely related taxa, or help scrutinize gene clusters further for downstream analyses manually or programmatically. The purpose of this section is to show you how we solved this problem in anvi’o and to demonstrate its efficacy.

Functional and geometric homogeneity estimates in anvi’o

Thanks to Mahmoud Yousef’s code, anvi’o pangenomes contains two layers that summarize homogeneity indices for each gene cluster.

If you are working with a pangenome that was generated prior to anvi’o v5.2, you can use the program anvi-compute-gene-cluster-homogeneity to add homogeneity estimates to the existing anvi’o pan database. Please see the help menu of this program, and let us know if you are lost.

Here is an example in the context of our Prochlorococcus pangenome (see the outermost two additional layers):

31 Prochlorococcus collection

Geometric homogeneity index

The geometric homogeneity index indicates the degree of geometric configuration between the genes of a gene cluster. Specifically, we look for the gap/residue patterns of the gene cluster that have been determined by the alignment. The alignment process by definition aligns similar sections of genes to each other, and denotes missing residues (or different structural configurations of gene contexts) through gap characters. If gap/residue distribution patterns are mostly uniform throughout the gene cluster, then this gene cluster will have a high geometric homogeneity, and the maximum value of 1.0 indicates that are no gaps in the alignment.

Anvi’o computes the geometric homogeneity index by combining analysis of the gene cluster content in two levels: the site-level analyses (i.e., vertically aligned positions) and the gene-level analyses (i.e., horizontally aligned positions because they are in the same gene). We convert the information in a gene cluster into a binary matrix, where gaps and residues are simply represented by 1s and 0s, and we utilize the logical operator xor to identify and enumerate all comparisons that have a different pattern. In site-level homogeneity checks, the algorithm sweeps from left to right and identifies these differences across aligned columns. In gene-level homogeneity checks, the algorithm sweeps from one gene to the next and identifies all differences in a gene where the pattern is not the same, effectively checking for the spread of gaps across the gene cluster. It is possible to skip the gene-level geometric homogeneity caluclations with the flag --quick-homogeneity. While this will not be as accurate or as comprehensive as the default approach, it may save you some time, depending on the number of genomes with which you are working.

Functional homogeneity index

In contrast, the functional homogeneity index considers aligned residues (by ignoring gaps), and attempts to quantify differences across residues in a site by considering the biochemical properties of differing residues (which could affect the functional conservation of genes at the protein-level). To do this, we divided amino acids into seven different “conserved groups” based on their biochemical properties. These groups are: nonpolar, aromatic, polar uncharged, both polar and nonpolar characteristics (these amino acids also lie in groups besides this one), acidic, basic, and mostly nonpolar (but contain some polar characteristics) (for more information please see this article). Then, the algorithm goes through the entire gene cluster and assigns a similarity score between 0 and 3 to every pair of amino acids at the same position across genes based on how close the biochemical properties of the amino acid residues are to each other. The sum of all assigned similarity scores is indicative of the functional homogeneity index of the gene cluster, and will reach to its maximum value of 1.0 if all residues are identical.

Both indices are on a scale of 0 to 1, where 1 is completely homogeneous and 0 is completely heterogeneous. If the algorithm is interrupted by a runtime error (due to unexpected issues such as not all genes being the same length for whatever reason, etc.), it will default to error values of -1 for the indices. So if you see a -1 in your summary outputs, it means we failed to make sense of the alignments in that gene cluster for some reason :/

In reality, the complex processes that influence protein folding and the intricate chemical interactions between amino acid residues should remind us that these assessments of similarity are only mere numerical suggestions, and do not necessarily reflect accurate biochemical insights, which is not the goal of these homogeneity indices.

Using homogeneity indices in pangenomes

Given all of that, let’s look at our pangenome again. What are they good for? Well, there is a lot one could do with these homogeneity indices, and it would be unfair to expect this tutorial to cover all of them. Nevertheless, here is an attempt to highlight some aspects of it.

To get a quick idea about some of the least homogeneous gene clusters, one could order gene clusetrs based on increasing or decreasing homogeneity. Let’s say we want to order it by increasing functional homogeneity (as you likely know already, this can be done through the “Items order” combo box in the main settings panel):

Prochlorococcus pan

If you open the “Mouse” panel on the right side of your display, you can actually hover over the layer and see the exact value of the index for all gene clusters. The gene cluster shown underneath the cursor in the figure above has a relatively low functional homogeneity estimate, but a higher geometric homogeneity. We can inspect the gene cluster to take a look for ourselves:

Prochlorococcus pan

You can see why the relatively higher geometric homogeneity score makes sense. Three of these genes have the same gap/residue pattern, and the gaps at the end of the other gene throw things off slightly, making the geometric score close to 0.75. On the other hand, the color coding of the aligned amino acids also gives us a hint regarding the lack of functional homogeneity across them. We can do the same for the geometric homogeneity index. Try it yourself: order the pangenome based on the geometric homogeneity index, and inspect a gene cluster with a relatively low score.

So, what can we do for more in depth analyses of our gene clusters?

Anvi’o offers quite a powerful way to filter gene clusters both through the command line program anvi-get-sequences-for-gene-clusters, and through the interface for interactive exploration of the data:

Prochlorococcus pan

Exploratory analyses

Let’s assume you wish to find a gene cluster that represents a single-copy core gene with very high discrepancy between its geometric homogeneity versus its functional homogeneity. As in, you want something that is highly conserved across all genomes, it is under structural constraints that keeps its alignment homogeneous, but it has lots of room to diversify in a way that impacts its functional heterogeneity. You want a lot. But could anvi’o deliver?

Well, for this very specific set of constraints you could first order all gene clusters based on decreasing geometric homogeneity index, then enter the following values to set up a filter before applying it and highlighting the matching gene clusters:

Prochlorococcus pan

The first gene cluster in counter-clockwise order is the one that best matches to the listed criteria. Upon closer inspection,

Prochlorococcus pan

One could see that there is tremendous amount of functional variability among aligned residues across genomes, despite the geometric homogeneity of the sequences in the gene cluster (only a portion of the alignment is shown here):

Prochlorococcus pan

Those of you who read our study on the topic, would perhaps not be surprised to learn that the COG functional annotation of this particular gene cluster resolves to an enzyme that is related to cell wall/membrane/envelop biogenesis. It is always good when things check out.

Scrutinizing phylogenomics

Here is another example usage of homogeneitcy indices. We often use single-copy core gene clusters for phylogenomic analyses to estimate evolutionary relationships between our genomes. Identifying single-copy core gene clusters is easy either through advanced filters, or through manual binning of gene clusters:

Prochlorococcus pan

But single-copy core gene clusters will contain many poorly aligned gene clusters that you may not want to use for a phylogenomic analysis so as to minimize the influence of noise that stems from bioinformatics decisions regarding where to place gaps. On the other hand, there will be many gene clusters that are near-identical in this collection, which would be rather useless to infer phylogenomic relationships. Luckily, you could use homogeneity indices and advanced search options to identify those that are geometrically perfect, but functionally diverse:

Prochlorococcus pan

You could easily append those that match to these criteria to a separate bin:

Prochlorococcus pan

And perform a quick-and dirty analysis on-the-fly to see how they would organize your genomes directly on the interface:

Prochlorococcus pan

Or you could get the FASTA file with aligned and concatenated amino acid sequences for these gene clusters to do a more appropriate phylogenomic analysis:

$ anvi-get-sequences-for-gene-clusters -p PROCHLORO/Prochlorococcus_Pan-PAN.db \
                                       -g PROCHLORO-GENOMES.db \
                                       -C default -b Better_core \
                                       --concatenate-gene-clusters \
                                       -o better_core.fa

Needless to say, estimates for homogeneity indices per gene cluster will also appear in your summary files from anvi-summarize to satisfy the statistician within you.

Making sense of functions in your pangenome

Once we have our pangenome, one of the critical things that we usually want to do is look at the functions associated with our gene clusters. This is a crucial yet complicated challenge to which we can appraoch in multiple ways. Here, we will describe how you can identify functions that are enriched for some of the clades or sub clades that are included in your pangenome. In addition, we will discuss how you can find the functional core of the pangenome. This is done with our new program anvi-get-enriched-functions-per-pan-group.

This program utilizes information in the layers additional data table of your pan database to identify ‘groups’ within your genomes, and find functions that are enriched in those groups, i.e. functions that are characteristic of these genomes, and predominantly absent from genomes from outside this group. To use this feature you must have at least one categorical additional layer information (which can easily be done via anvi-import-misc-data), and at least one functional annotation source for your genomes storage (which will automatically be the case if every contigs database that were used when you run anvi-gen-genomes-storage was annotated with the same functional source).

Alon’s short ‘behind the scenes’ lecture on anvi’o functional enrichment analysis for the curious

For those of you who like to dive into the details, here is some information about what goes on behind the scenes.

In order for this analysis to be compatible with everything else relating to the pangenome, we decided that it should be focused on gene clusters, since they are the heart of our pangenomes. This means that the first step of the functional analysis is to try to associate every gene cluster with a function. Because gene clusters do not have functional annotation by default as functional annotation is done in anvi’o at the level of individual genes.

In an ideal world, all genes in a single gene cluster would be annotated with the same identical function. While in reality this is usually the case, it is not always true. In cases in which there are multiple functions associated with a gene cluster, we chose the most frequent function, i.e. the one that the largest number of genes in the gene cluster matches (if there is a tie of multiple functions, then we simply choose one arbitrarily). If none of the genes in the gene cluster were annotated, then the gene cluster has no function associated to it.

Naturally, when we associate each gene cluster with a single function, we could end up with multiple gene clusters in a pangenome with the same function. From our experience, most functions are associated with a single gene cluster, but there are still plenty of functions that associate with multiple gene clusters, and this will be more common in pangenomes that contain distantly related genomes. In these cases, in order to find the occurrence of a given function among genomes, we simply ‘merge’ the occurrences of all gene clusters associated with that same function (for you computational readers, we simply take the “or” product of the presence/absence vectors of the gene clusters across genomes).

The careful readers will notice that we distinguish between ‘functional annotation’ and ‘functional association’ in the following text. When we mention ‘functional annotation’, we refer to the annotation of a single gene with a function by the functional annotation source (i.e. COGs, EggNOG, etc.), whereas ‘functional association’ of a gene cluster is the association of gene clusters with a single function as described above.

Ok, so now we have a presence/absence table of functions in genomes, and we can calculate different scores for each function, and also visualize it. See the details below!

Let’s use the Prochlorococcus example to learn what we can do with this.

First we will compare the low-light vs. the high-light genomes in order to see if there are any functions that are unique to either group using the ‘light’ categorical additional layer data (if this doesn’t make sense to you please go back to one of the pangenome figures above and see the layer additional data ‘light’):

anvi-get-enriched-functions-per-pan-group -p PROCHLORO/Prochlorococcus_Pan-PAN.db \
                                          -g PROCHLORO-GENOMES.db \
                                          --category light \
                                          --annotation-source COG_FUNCTION \
                                          -o PROCHLORO-PAN-enriched-functions-light.txt

Here is the structure of the output file PROCHLORO-PAN-enriched-functions-light.txt (there are more columns, scroll towards right to see them):

category COG_FUNCTION enrichment_score p_value portion_occurrence_in_group portion_occurrence_outside_of_group occurrence_in_group occurrence_outside_of_group gene_clusters_ids core_in_group core corrected_p_value
LL Proteasome lid subunit RPN8/RPN11, contains Jab1/MPN domain metalloenzyme (JAMM) motif 4.78 0 1 0 11 0 GC_00002219, GC_00003850, GC_00004483 TRUE FALSE 0
LL 1,6-Anhydro-N-acetylmuramate kinase 4.78 0 1 0 11 0 GC_00001728 TRUE FALSE 0
LL L-amino acid N-acyltransferase YncA 4.78 0 0.91 0 10 0 GC_00001902 FALSE FALSE 0
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…)
LL Secreted protein containing bacterial Ig-like domain and vWFA domain 1.19 0.23 0.18 0 2 0 GC_00004324 FALSE FALSE 0.68
LL Exopolysaccharide biosynthesis protein related to N-acetylglucosamine-1-phosphodiester alpha-N-acety… 1.19 0.23 0.18 0 2 0 GC_00004835 FALSE FALSE 0.68
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…)
LL Hydroxymethylpyrimidine/phosphomethylpyrimidine kinase -4.78 0 0 1 0 20 GC_00001251 FALSE FALSE 0
LL Uncharacterized conserved protein, DUF697 family -4.78 0 0 1 0 20 GC_00001393 FALSE FALSE 0
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…)
HL Chromatin remodeling complex protein RSC6, contains SWIB domain 4.78 0 1 0 20 0 GC_00001035 TRUE FALSE 0
HL Metallophosphoesterase superfamily enzyme 4.78 0 0.95 0 19 0 GC_00001533 FALSE FALSE 0
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…)
HL N-acetylglucosamine-6-phosphate deacetylase -4.78 0 0 1 0 11 GC_00001770 FALSE FALSE 0
HL Exonuclease VII, large subunit -4.78 0 0 1 0 11 GC_00001726 FALSE FALSE 0

The following describes each column:

  1. category is the column you chose from your layers additional data table, in order to identify groups within your genomes. In the Prochlorococcus case we have two light groups: low light (LL), and high light (HL). The output table is ordered according to this column first, and within each group, the table is ordered according to the enrichment score.

  2. COG_FUNCTION this column has the name of the specific function for which enrichment was calculated. In this example we chose to use COG_FUNCTION for functional annotation, and hence the column title is COG_FUNCTION. You can specify whichever functional annotation source you have in your PAN database using the --annotation-source, and then the analysis would be done according to that annotation source. Even if you have multiple functional annotation sources in your genome storage, only one source could be used for a single run of this program. If you wish, you could run it multiple times and each time use a different annotation source. If you don’t remember which annotation sources are available in your genomes storage, you can use --list-annotation-sources.

  3. enrichment_score is a score to measure how much is this function unique to the genomes that belong to a specific group vs. all other genomes in your pangenome. This score was developed by Amy Willis. For more details about how we generate this score, see the note from Amy below. Notice that in the comparison above, each genome belongs to one of the two groups (HL, LL), but if the column you chose from your layers additional data table has more than two groups, then when comparing a function for each group, the occurrence is compared between the group members, and the rest of the genomes, i.e. the comparison is not pair-wise between groups (you can see the example below of comparison between clades of Prochlorococcus for more details). When the occurrence of a function in the group is lower than outside the group, then we get a negative enrichment score.

  4. portion_occurrence_in_group is the number of genomes in the group that were associated with the function, divided by the total number of genomes in the group

  5. portion_occurrence_outside_of_group is the number of genomes not in the group that were associated with the function, divided by the total number of genomes not in the group.

  6. occurrence_in_group is the number of genomes in the group that were associated with the function.

  7. occurrence_outside_of_group is the number of genomes outside the group that were associated with the function.

  8. gene_clusters_ids are the gene clusters that were associated with this function. Notice that each gene cluster would be associated with a single function, but a function could be associated with multiple gene clusters.

  9. core_in_group is “true” if the function occurs in all members of the group.

  10. core is “true” if the function occurs in all genomes in the pangenome.

  11. p_value is the p value for the enrichment score.

  12. corrected_p_value a correction for multiple tests using the Benjamini-Hochberg false discovery rate method. This is done using statsmodels.stats.multitest.multipletests.

A note from Amy Willis regarding the functional enrichment score The enrichment score proposed here is the test statistic for a two sample Z-test for proportions. It takes the proportion of times the function is observed in the group, subtracts the proportion of times the function is observed outside the group, and re-scales this difference to reflect the number of samples observed in each group. The adjustment for group size means that larger scores are given when groups are larger – essentially, a difference between groups can be considered more robust when there are more representatives of each group. While this scoring system is motivated by a formal statistical hypothesis test, by default anvi’o applies this to many functions, and so the test statistics cannot be used to perform a real hypothesis test. (By real hypothesis test, I mean one that has a 5% probability of returning a p value less than 5% when there is no underlying difference between the groups.) Therefore, it should be considered a method to help you order and sort through your data, and not as a formal statistical test for differences between groups.

Now let’s search for the top function in the table “Ser/Thr protein kinase RdoA involved in Cpx stress response, MazF antagonist”, which is enriched for the members of the LL group, and we can see in the table that it matches four gene clusters.

layers

In fact, if we look carefully, then we find that this function matches a gene cluster that is unique for each one of the four low light clades, and is a single-copy core gene for the low light genomes in our pangenome. Cool!

Let’s look at another function from the table: Exonuclease VII, large subunit (third line). When we search this function, we should be careful since the name of the function contains a comma, and the function search option in the interactive interface treats the comma as if it is separating multiple functions that are to be searched at the same time. Hence I just searched for Exonuclease VII, and here are the results:

layers

We can see that the search matched hits for both Exonuclease VII, large and small subunits, with a total of 22 hits.

layers

The large subunit matches a single gene cluster which is in the CORE LL, and the small subunit matches a gene cluster in each one of the clade specific cores (similar to what we saw above for the Ser/Thr protein kinase). Both of these genes are also part of the single copy core unique to low light members and absent from high light members.

Creating a quick pangenome with functions

Next, we will explore whether there are any functions enriched for any of the sub clades. In addition, we will introduce another feature --functional-occurrence-table-output. This optional output is a TAB-delimited file with the presence/absence information for functions in genomes.

anvi-get-enriched-functions-per-pan-group -p PROCHLORO/Prochlorococcus_Pan-PAN.db \
                                          -g PROCHLORO-GENOMES.db \
                                          --category clade\
                                          --annotation-source COG_FUNCTION \
                                          -o PROCHLORO-PAN-enriched-functions-clade.txt \
                                          --functional-occurrence-table-output PROCHLORO-functions-occurrence.txt

Let’s look at some results. This is how PROCHLORO-PAN-enriched-functions-clade.txt looks like:

category COG_FUNCTION enrichment_score p_value portion_occurrence_in_group portion_occurrence_outside_of_group occurrence_in_group occurrence_outside_of_group gene_clusters_ids core_in_group core corrected_p_value
LL_IV Cephalosporin hydroxylase 2.59 0.01 0 0.03 0 1 GC_00007377 FALSE FALSE 0.01
LL_IV Serine/threonine protein phosphatase PrpC 2.59 0.01 0.5 0 1 0 GC_00005695 FALSE FALSE 0.01
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…)

And this is how PROCHLORO-functions-occurrence.txt looks like:

  AS9601 CCMP1375 EQPAC1 GP2 LG MED4 MIT9107 MIT9116 MIT9123 MIT9201 MIT9202 MIT9211 MIT9215 MIT9301 MIT9302 MIT9303 MIT9311 MIT9312 MIT9313 MIT9314 MIT9321 MIT9322 MIT9401 MIT9515 NATL1A NATL2A PAC1 SB SS2 SS35 SS51
3-deoxy-D-manno-octulosonate 8-phosphate phosphatase KdsC and related HAD superfamily phosphatases 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
Creatinine amidohydrolase/Fe(II)-dependent formamide hydrolase involved in riboflavin and F420 biosynthesis 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
RNA recognition motif (RRM) domain 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…)

Let’s use the functional occurrence table for visualization. First, we fix the names of functions to get rid of things like commas etc. Basically, we only keep alphanumeric characters, and replace any sequence of non-alphanumeric characters by a single _.

sed "s/[^[:alnum:]	_]/_/g" PROCHLORO-functions-occurrence.txt | \
	tr -s \_ _ | \
	sed 's/^	/name	/' \
	> PROCHLORO-functions-occurrence-fixed-a-little.txt

Notice that if you try to copy and paste the line above then it probably wouldn’t work properly, because it includes tabs. In order to fix it you should manually add tabs after ^[:alnum:], s/^, and /name. If you are using a terminal on a mac, then you can manually add a tab by doing ctrl-V and then tab (as explained here).

Unfortunately there could be some functions with very similar names, let’s check if there are any duplicate names now:

cut -f 1 PROCHLORO-functions-occurrence-fixed-a-little.txt | sort | uniq -d

We can see that Fatty_acid_desaturase, and Protein_tyrosine_phosphatase are duplicated now. This is because, for example, there are two COG functions Protein-tyrosine phosphatase (accession COG2453) and Protein-tyrosine-phosphatase (accession COG0394), which match different gene clusters in our pangenome. Because we cannot create a tree with duplicated nodes, and since we can’t really say that these are different functions, we will use a script to merge these duplicated occurrences (i.e. merge their occurrences with a logical or). Your working directory includes the script fix_functional_occurrence_table.py:

./fix_functional_occurrence_table.py PROCHLORO-functions-occurrence-fixed-a-little.txt PROCHLORO-functions-occurrence-fixed.txt

Then we create trees for the interactive interface:

anvi-matrix-to-newick PROCHLORO-functions-occurrence-fixed.txt \
                      -o PROCHLORO-functions-tree.txt

anvi-matrix-to-newick PROCHLORO-functions-occurrence-fixed.txt \
                      -o PROCHLORO-functions-layers-tree.txt \
                      --transpose

We run a quick dry run to create a manual mode profile database.

anvi-interactive -p PROCHLORO-functions-manual-profile.db \
                 --tree PROCHLORO-functions-tree.txt \
                 -d PROCHLORO-functions-occurrence-fixed.txt \
                 --manual \
                 --dry-run

We import the tree for the layers order:

echo -e "item_name\tdata_type\tdata_value" > PROCHLORO-functions-layers-order.txt
echo -e "PROCHLORO_functions_tree\tnewick\t`cat PROCHLORO-functions-layers-tree.txt`" \
                             >> PROCHLORO-functions-layers-order.txt

anvi-import-misc-data PROCHLORO-functions-layers-order.txt \
                      -p PROCHLORO-functions-manual-profile.db \
                      -t layer_orders \
                      --just-do-it

We can get some information about the genomes from the PAN database:

anvi-export-misc-data -p PROCHLORO/Prochlorococcus_Pan-PAN.db \
                      -t layers \
                      -o PROCHLORO-layer-additional-data.txt

And then import that into our manual mode profile database:

anvi-import-misc-data PROCHLORO-layer-additional-data.txt \
                      -p PROCHLORO-functions-manual-profile.db \
                      -t layers

The work directory you downloaded includes a nice state that we created for this profile database along with a collection, let’s import these:

anvi-import-state -p PROCHLORO-functions-manual-profile.db \
                  -s PROCHLORO-manual-default-state.json \
                  -n default
    
anvi-import-collection -p PROCHLORO-functions-manual-profile.db \
                       -C default \
                       PROCHLORO-functions-collection.txt

And now we can take a look:

anvi-interactive -p PROCHLORO-functions-manual-profile.db \
                 -t PROCHLORO-functions-tree.txt \
                 -d PROCHLORO-functions-occurrence-fixed.txt \
                 --title "Prochlorococcus Pan - functional occurrence" \
                 --manual

layers

A collection of 869 core functions emmerges.

We can also see that the occurrence of functions is recapitulating all four LL clades. In contrast, the two HL clades seem to be mixed together.

A little BASH one-liner can give us all gene clusters associated with core functions:

grep LL PROCHLORO-PAN-enriched-functions-light.txt |\
  awk  -F $'\t' '$11 == "True" { print $9 }' |\
    tr ',' '\n' |\
      sed 's/ //g' > core_functions_gene_clusters.txt

And we get a file with 2613 gene clusters. Now we can compare this to the collection of gene clusters we have above. We can find, for example, how many of the gene clusters that are in the CORE_LL are in the functional core of all 31 Prochlorococcus genomes:

for gc in `grep CORE_LL PROCHLORO-PAN-default-collection.txt`
do
    grep $gc core_functions_gene_clusters.txt
done > CORE_LL_included_in_functional_core.txt

We find that 103 of the 144 gene clusters are part of the functional core. And for HL it is 294 of 499 that are found to be part of the functional core. An important thing to remember is that gene clusters that have no function associated with them are not included in this analysis.

We can find all the gene clusters that are associated with a function:

grep LL PROCHLORO-PAN-enriched-functions-light.txt |
  awk  -F $'\t' '{ print $9 }' |
    tr ',' '\n' |
      sed 's/ //g' > all_gene_clusters_with_functions.txt

There are 3629 gene clusters with functions. How many gene clusters that belong to the CORE_HL have functions?

for gc in `grep CORE_HL PROCHLORO-PAN-default-collection.txt`
do
   grep $gc all_gene_clusters_with_functions.txt;
done > HL_CORE_GC_with_functions.txt

wc -l all_gene_clusters_with_functions.txt
321

There are 321 (of a total of 499) gene clusters in CORE_HL that have functions. Hence many of the gene clusters in CORE_HL that were not found to be part of the functional core, simply don’t have any functional annotation.

We hope you find great uses for the functional enrichment analysis framework in anvi’o pangenomic workflow! Feel free to send us any questions you may have.

Computing the average nucleotide identity for genomes

Anvi’o also contains a program, anvi-compute-ani, which uses PyANI to compute average nucleotide identity across your genomes. It expects you to provide an external genomes file to declare which genomes you are interested in for an ANI analysis, but it also optionally accepts a pan database to add its findings into it as additional layer data.

Here is an example with our Prochlorococcus Pan genome:

anvi-compute-ani --external-genomes external-genomes.txt \
                 --output-dir ANI \
                 --num-threads 6 \
                 --pan-db PROCHLORO/Prochlorococcus_Pan-PAN.db

Once it is complete, we can visualize the pan genome again to see what is new:

anvi-display-pan -g PROCHLORO-GENOMES.db \
                 -p PROCHLORO/Prochlorococcus_Pan-PAN.db

When you first look at it you will not see anything out of ordinary. But if you go to the ‘Layers’ tab, you will see the following addition:

layers

If you click the checkbox for, say, ANI_percentage_identity, you will see a new set of entries will be added into the list of layer data entries. Then, if you click the button ‘Draw’ or ‘Redraw layer data’, you should see the ANI added to your display:

prochlorococcus-ani

Yes. Magic.

You may need to change the min value from the interface for a better representation of ANI across your genomes in your own pangenome.

A note from Meren: We have anvi-compute-ani, because someone asked for it on GitHub. We thank our proactive users, like Mike Lee, on behalf of the community.

A little note from Meren on how to order genomes

It has always been a question mark for me how to order genomes with respect to gene clusters. Should one use the gene cluster presence/absence patterns to organize genomes (where one ignores paralogs and works with a binary table to cluster genomes)? Or should one rely on gene cluster frequency data to order things (where one does consider paralogs, so their table is not binary anymore, but contains the frequencies for number of genes from each genome in each gene cluster)?

Thanks to Özcan’s new addition to the codebase, I’ve made an unexpected observation while I was working on this section of the tutorial regarding this question of ‘how to order genomes’ in a pangenome.

This is how ANI matrix looks like when I order Prochlorococcus genomes based on gene cluster frequency data:

ani

In contrast, this is how ANI matrix looks like when I order genomes based on gene cluster presence/absence data:

ani

There you have it.

If you are working with isolate genomes, maybe you should consider ordering them based on gene cluster frequencies since you would expect a proper organization of genomes based on gene clusters should often match with their overall similarity. Here are some small notes:

  • In this case it is easy to see that organization by gene cluster presence/absence data is doing a poor jobby mixing up high-light group II (purple) and high-light group I (poopcolor) genomes. We can identify the problem clealry because we know the clades to which these genomes belong, in cases where you have no idea about these relationships, it seems gene cluster frequencies can organize your genomes more accurately.

  • It is sad to know that MAGs will often lack multi-copy genes, and they will have less paralogs (since assembly of short reads create very complex de Bruijn graphs which obliterate all redundancy in a given genome), hence their proper organization will not truly benefit from gene cluster frequency data, at least not as much as isolate genomes, and the errors we see with gene cluster presence/absence data-based organization will have a larger impact on our pangenomes.

  • See the emergence of those sub-clades in high-light group II? If you are interested, please take a look at our paper on Prochlorococcus metapangenome (especially the section “The metapangenome reveals closely related isolates with different levels of fitness”), in which we divided HL-II into multiple subclades and showed that despite small differences between these genomes, those sub-clades displayed remarkably different environmental distribution patterns. The sub-clades that we could define in that study based on gene cluster frequencies and environmental distribution patterns match quite well to the groups ANI reveals. This goes back to some of the most fundamental questions in microbiology regarding how to define ecologically relevant units of microbial life. I don’t know if we are any close to making any definition, but I can tell you that those ‘phylogenetic markers’ we are using … I am not sure if they are really working that well (smiley).

Summarizing an anvi’o pan genome

When you store your selections of gene clusters as a collection, anvi’o will allow you to summarize these results.

Even if you want to simply summarize everything in the pan genome without making any selections in the interface, you will still need a collection in the pan database. But luckily, you can use the program anvi-script-add-default-collection to add a default collection that contains every gene cluster.

The summary step gives you two important things: a static HTML web page you can compress and share with your colleagues or add into your publication as a supplementary data file, and a comprehensive TAB-delimited file in the output directory that describes every gene cluster.

You can summarize a collection using the program anvi-summarize, and a generic form of this command looks like this:

$ anvi-summarize -p PROJECT-PAN.db \
                 -g PROJECT-PAN-GENOMES.db \
                 -C COLLECTION_NAME \
                 -o PROJECT-SUMMARY

If you open the index.html file in the summary directory, you will see an output with some essential information about the analysis:

31 summary

As well as the TAB-delimited file for gene clusters:

31 summary

The structure of this file will look like this, and will give you an opportunity to investigate your gene clusters in a more detailed manner (you may need to scroll towards right to see more of the table):

unique_id gene_cluster_id bin_name genome_name gene_callers_id COG_FUNCTION_ACC COG_FUNCTION COG_CATEGORY_ACC COG_CATEGORY aa_sequence
1 PC_00001990   MIT9303 30298 COG1199 Rad3-related DNA helicase L Replication, recombination and repair MLEARSHQQLKHLLLQNSSP(…)
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…)
91 PC_00001434 Core_HL MIT9322 42504 COG3769 Predicted mannosyl-3-phosphoglycerate phosphatase G Carbohydrate transport and metabolism MIENSSIWVVSDVDGTLMDH(…)
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…)
257 PC_00000645 Core_all MIT9322 42217 COG1185 Polyribonucleotide nucleotidyltransferase J Translation, ribosomal structure and biogenesis MEGQNKSITFDGREIRLTTG(…)
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…)
2019 PC_00001754 Core_LL NATL2A 49129 COG0127 Inosine/xanthosine triphosphate pyrophosphatase F Nucleotide transport and metabolism MDNVPLVIASGNKGKIGEFK(…)
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…)
5046 PC_00001653   PAC1 52600 COG1087 UDP-glucose 4-epimerase M Cell wall/membrane/envelope biogenesis MRVLLTGGAGFIGSHIALLL(…)
5047 PC_00001653   LG 7488 COG1087 UDP-glucose 4-epimerase M Cell wall/membrane/envelope biogenesis MNRILVTGGAGFIGSHTCIT(…)
5048 PC_00001653   SS35 56661 COG1087 UDP-glucose 4-epimerase M Cell wall/membrane/envelope biogenesis MNRILVTGGAGFIGSHTCIT(…)
5049 PC_00001653   NATL2A 49604 COG1087 UDP-glucose 4-epimerase M Cell wall/membrane/envelope biogenesis MRVLLTGGSGFIGSHVALLL(…)
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…)

This file will link each gene from each genome with every selection you’ve made and named through the interface or through the program anvi-import-collection, and will also give you access to the amino acid sequence and function of each gene.

Final words

We are realizing that there is a lot to explore in this front, and excited to work with the community. Please let us know if you have any suggestions using our discussion group, the comments section below. You can also keep an eye on our public code repository for new releases.