Microbial 'omics

Brought to you by

### Table of Contents

This tutorial is tailored for anvi’o v6 or later. You can learn the version of your installation by typing anvi-interactive -v. If you have an older version, some things will not work the way they should.

The goal of this tutorial is to explore some of the most fundamental aspects of anvi’o and its application to a real-world dataset. We organized it in multiple interconnected chapters, which all use the same dataset:

At the end of this tutorial, you should be able to

• Characterize high-quality genomes from metagenomes and manually curate them,
• Incorporate automatic binning results into your metagenomes,
• Perform phylogenomic analyses to learn about their approximate location in the tree of life,
• Determine core and accessory genes they contain compared to closely related genomes through pangenomics,
• Explore within population genomic diversity through single-nuleotide variants,
• Put a given metagenome-assembled, single-cell or isolate genome into the context of public metagenomes to investigate their core and accessory genes, and
• Investigate the genomic heterogeneity in the context of predicted protein structures.

All of these sections use the same publicly available metagenomic dataset generated by Sharon et al. (2013), which contains 11 feacal samples from a premature infant collected between Day 15 and Day 24 of her life in a time-series manner. After downloading raw sequencing reads, we co-assembled all samples and processed the resulting contigs longer than 1K using anvi’o following the anvi’o metagenomic workflow. The main reason we chose this dataset was its simplicity: more than 95% of metagenomic reads map back to the contigs, a great indication of the assembled contigs describe a very large fraction of this infant’s gut microbiome.

We hope you find the tutorial useful, and generously share your opinions or criticism should you have any.

## Downloading the pre-packaged Infant Gut Dataset

To download the infant gut data-pack copy-paste the following commands into your terminal:

wget https://ndownloader.figshare.com/files/18046139 -O INFANT-GUT-TUTORIAL.tar.gz
tar -zxvf INFANT-GUT-TUTORIAL.tar.gz && cd INFANT-GUT-TUTORIAL


Alternatively (i.e., if you don’t have wget installed on your system), you can download the same file using this direct download link.

When you click the link, it will start downloading a 210 Mb compressed file automatically. Once it is done, go to the relevant directory that contains this file using your terminal, and unpack it the following way:

tar -zxvf INFANTGUTTUTORIAL.tar.gz && cd INFANT-GUT-TUTORIAL


If you were sent here somewhere from down below, now you can go back. If you have no idea what this means, ignore this notice, and continue reading. You’re OK :)

Some details on the contents of the data-pack for the curious

If you type ls in the dataset directory you will see that the data-pack contains an anvi’o contigs database, an anvi’o merged profile database (that describes 11 metagenomes), and other additional data that are required by various sections in this tutorial. Here are some simple descriptions for some of these files, and how we generated them.

The contigs and profile databases. We generated an anvi’o contigs database using the program anvi-gen-contigs-database. This special anvi’o database keeps all the information related to your contigs: positions of open reading frames, k-mer frequencies for each contig, functional and taxonomic annotation of genes, etc. The contigs database is an essential component of everything related to anvi’o metagenomic workflow. We also generated a merged anvi’o profile databases using the program anvi-profile. In contrast to the contigs database, anvi’o profile databases store sample-specific information about contigs. Profiling a BAM file with anvi’o creates a single profile that reports properties for each contig in a single sample based on mapping results. Each profile database automatically links to a contigs database, and anvi’o can merge single profiles that link to the same contigs database into an anvi’o merged profile (which is what you will work with during this tutorial), using the program anvi-merge. Here are some direct links describing these steps:

Identifying single-copy core genes among contigs. We used the program anvi-run-hmms to identify single-copy core genes for Bacteria, Archaea, Eukarya, as well as sequences for ribosomal RNAs among the IGD contigs. All of these results are also stored in the contigs database. This information allows us to learn the completion and redundancy estimates of newly identified bins in the interactive interface, on the fly. Note that if all single copy-core genes for a given domain are detected once in the selected bin, then the completion will be 100% and the redundancy 0%. If a few genes are detected multiple times, the redundancy value will increase. If a few genes are missing, then it is the completion value that will drop.

Assigning functions to genes. We also run anvi-run-ncbi-cogs on the contigs database before we packaged it for you, which stores functional hits for genes results in the contigs database. At the end of the binning process, functions occurring in each bin will be made available for downstream analyses.

## Chapter I: Genome-resolved Metagenomics

The purpose of this tutorial is to have a conversation about genome-resolved metagenomics (with a focus on manual binning) using the Infant Gut Dataset (IGD), which was generated, analyzed, and published by Sharon et al. (2013), and was re-analyzed in the anvi’o methods paper.

By the end of the tutorial, you will be able to:

• Familiarize with the interactive interface for binning (~25 minutes)
• Inspect contigs in the context of their metagenomic signal (~5 minutes)
• Characterize bins by manual binning (~20 minutes)
• Summarize manual binning results for downstream analyses (~20 minutes)
• Manually curate individual bins for quality control (~10 minutes)
• Import and visualize external binning results (~20 minutes)
• Combine manual and automatic binning (~10 minutes)

By the end of this chapter, you should (1) have a comprehensive understanding of genome-resolved metagenomics, and (2) be able to characterize and manually curate genomes from your own assembly outputs moving forwards.

A typical anvi’o genome-resolved metagenomic workflow starts with one or more BAM files and a FASTA file of your contigs. There are many ways to get your contigs and BAM files for your metagenomes and we have started implementing a tutorial that describes the workflows we often use. But in this tutorial we will start from a point in the workflow where you have used your BAM and FASTA files to generate anvi’o contigs and profile databases.

While this tutorial will take you through a simple analysis of a real dataset, there also is available a more comprehensive (but more abstract) tutorial on anvi’o metagenomic workflow.

Using the files in the data-pack directory, let’s take a first look at the merge profile database for the infant gut dataset metagenome. If you copy-paste this to your terminal:

anvi-interactive -p PROFILE.db -c CONTIGS.db


The anvi’o interactive interface should welcome you with this display (after you click “draw”):

When it is time to type other commands, you can close the window, go back to the terminal and press CTRL + C to kill the server.

Show/hide Tom's description of the metagenomic binning-related features of the anvi'o interactive interface

The interactive interface of anvi’o can be quite overwhelming. This particular box, in addition to the interactive interface tutorial, attempts to give you insights into the features of the interactive interface relevant to metagenomic binning.

First of all, each leaf in the cerntral dendrogram describes an individual contig. Contigs that were fragmented into multiple splits due to their extensive length can be identified in the Parent layer. The grey layers after the GC-Content display mean coverage values (i.e., the environmental signal) of each contig/split across the 11 metagenomes. Finally, you can click on the MOUSE box on the right side of the interface to explore numerical and alphabetic values in more details across the display.

Once the interactive interface is up and running, you can start binning:

Make contig selections by hovering your mouse over the tree in the center of the anvi’o figure. To add the highlighted selection to your current bin, left click. To remove the highlighted selection from your current bin, right click. To create a new bin, click “New bin” under the Bins tab in Settings. To change to a different bin, click the blue dot next to the bin name you’re interested in.

Manipulating the inner dendrogram. By default, anvi’o uses three different clutering approaches to organize contigs in the center dendrogram. Your ability to perform manual binning will be in part determined on your understanding of these clustering strategies. Here is a brief description of each:

• Differential coverage: clustering based on the differential coverage of contigs across metagenomes. The logic behind this metric is that fragments originating from the same genome should have the same distribution patterns, often different from other genomes.

• Sequence composition: clustering based on the sequence composition of contigs (by default their tetra-nucleotide frequency). This might seem strange (and in fact, it is to many of us), but fragments originating from the same genome have the tendency to exhibit a similar sequence composition, often different from genomes corresponding to distant lineages.

• Differential coverage and sequence composition: clustering using the two metrics for optimal binning resolution.

And here is where to change the settings in the Main tab:

Anvi’o by default trusts the assembly; therefore splits from the same contig will remain together (but you can breakup contigs through the interface).

Your selections will not be lost when switching from one organization to another. This particular ability has been very useful to visualize differences between clustering strategies and access the biological relevance of identified bins.

Inspecting individual contigs in the context of their metagenomic signal. In addition to selecting bins based on different clustering strategies, you can explore in more details the coverage of individual contigs across metagenomes: this is the inspection mode. You simply need to right click on any metagenomic layer at the coordinates of a contig of interest, and select “Inspect split”. A new window should pop-up, like in this example:

Note that gene coordinates are displayed at the bottom and their inferred function can be accessed in a simple click. When detected, single nucleotide variants are also described in this display (vertical bars). You can close the inspection mode window when your curiosity has been satisfied.

### Importing taxonomy for genes

Anvi’o can work with gene-level taxonomic annotations, but gene-level taxonomy is not useful for anything beyond occasional help with manual binning. Once gene-level taxonomy is added into the contigs database, anvi’o will determine the taxonomy of each contig based on the taxonomic affiliation of genes they describe, and display them in the interface whenever possible.

Centrifuge (code, pre-print) is one of the options to import taxonomic annotations into an anvi’o contigs database. Centrifuge files for the IGD are already in the directory additional-files/centrifuge-files.

If you import these files into the contigs database the following way,

anvi-import-taxonomy-for-genes -c CONTIGS.db \
-p centrifuge \
-i additional-files/centrifuge-files/centrifuge_report.tsv \
additional-files/centrifuge-files/centrifuge_hits.tsv


And run the interactive interface again,

anvi-interactive -p PROFILE.db -c CONTIGS.db


You will see an additional layer with taxonomy:

In the Layers tab find the Taxonomy layer, set its height to 200, then drag the layer in between DAY24 and hmms_Ribosomal_RNAs, and click Draw again. Then click Save State button, and overwrite the default state. This will make sure anvi’o remembers to make the height of that layer 200px the next time you run the interactive interface!

### Inferring taxonomy for metagenomes

So at this point we don’t have any idea about what genomes do we have in this dataset, but anvi’o can make sense of the taxonomic make up of a given metagenome by characterizing taxonomic affiliations of single-copy core genes. The details of this scg-taxonomy workflow is described here in greater detail.

Special database setup

Due to a glitch in anvi’o v6, you may run into some SCG database issues here, so let’s solve it here once and for all. This is something you will do once. First copy-paste this line to remove all previously generated databases from your disk (this cryptic one-liner will remove all databases for SCG taxonomy):

python -c 'import anvio; \
import glob; \
import os; \
[os.remove(x) for x in glob.glob(os.path.join(os.path.dirname(anvio.__file__), \
"data/misc/SCG_TAXONOMY/GTDB/SCG_SEARCH_DATABASES/") + "*.dmnd") \
if os.path.exists(x)]'


Then recreate them using four threads:

anvi-setup-scg-databases -T 4


And re-run taxonomy on our contigs database:

anvi-run-scg-taxonomy -c CONTIGS.db \
--num-parallel-processes 3 \
--num-threads 2


Aaaand we are done.

You can take a very quick look at the taxonomic composition of the metagenome through the command line first:

anvi-estimate-genome-taxonomy -c CONTIGS.db \
--metagenome-mode


which should give us this output for the IGD:

╒════════════════════╤════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│                    │   percent_identity │ taxonomy                                                                                                                         │
╞════════════════════╪════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Ribosomal_S6_11655 │               98.9 │ Bacteria / Firmicutes / Bacilli / Staphylococcales / Staphylococcaceae / Staphylococcus / Staphylococcus epidermidis             │
├────────────────────┼────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Ribosomal_S6_12163 │               98.9 │ Bacteria / Firmicutes / Bacilli / Staphylococcales / Staphylococcaceae / Staphylococcus / Staphylococcus hominis                 │
├────────────────────┼────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Ribosomal_S6_15200 │               98.9 │ Bacteria / Firmicutes / Bacilli / Lactobacillales / Lactobacillaceae / Leuconostoc / Leuconostoc citreum                         │
├────────────────────┼────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Ribosomal_S6_25880 │               98.9 │ Bacteria / Firmicutes / Clostridia / Tissierellales / Helcococcaceae / Finegoldia / Finegoldia magna                             │
├────────────────────┼────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Ribosomal_S6_2915  │               97.9 │ Bacteria / Firmicutes / Bacilli / Lactobacillales / Enterococcaceae / Enterococcus / Enterococcus faecalis                       │
├────────────────────┼────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Ribosomal_S6_29818 │               98.9 │ Bacteria / Firmicutes / Bacilli / Lactobacillales / Streptococcaceae / Streptococcus / Streptococcus oralis                      │
├────────────────────┼────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Ribosomal_S6_30904 │               92.5 │ Bacteria / Firmicutes / Clostridia / Tissierellales / Helcococcaceae / Anaerococcus / Anaerococcus sp002359915                   │
├────────────────────┼────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Ribosomal_S6_4484  │               98.9 │ Bacteria / Firmicutes / Clostridia / Tissierellales / Peptoniphilaceae / Peptoniphilus / Peptoniphilus rhinitidis                │
├────────────────────┼────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Ribosomal_S6_6421  │                100 │ Bacteria / Actinobacteriota / Actinobacteria / Propionibacteriales / Propionibacteriaceae / Cutibacterium / Cutibacterium avidum │
├────────────────────┼────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Ribosomal_S6_7660  │               98.9 │ Bacteria / Firmicutes / Bacilli / Staphylococcales / Staphylococcaceae / Staphylococcus /                                        │
╘════════════════════╧════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛


Good, but could have been better. Why? Pro tip: we have a profile database, what does it mean and how it could improve the information we see here?

Anvi’o can add these metagenome-level taxonomic insights into a given profile database, let’s do that,

anvi-estimate-genome-taxonomy -c CONTIGS.db \
-p PROFILE.db \
--metagenome-mode \
--compute-scg-coverages \
--update-profile-db-with-taxonomy


and find the new layer additional data and take a look at it:

anvi-interactive -c CONTIGS.db \
-p PROFILE.db


### Manual identification of genomes in the Infant Gut Dataset

Let’s first talk about the signal that leads to this specific organization of contigs we see in our display, and its biological implications.

Next, we can perform a round of manual binning. This should take about 10 minutes.

A few tips for binning:

• You do not have to bin all contigs. Instead, try to identify bins corresponding to an actual genome. Those will have relatively high completion values. The low-completion or no-completion bins in metagenomes might represent viruses, plasmids, or other interesting genetic elements, but this tutorial will ignore them.
• Please try to avoid bins with redundancy >10%. Those likely contain contaminants.
• You can increase the inner tree radius (e.g., 5,000) for a better binning experience in the Main tab (additional settings)
• You can select the option show grid in the Main tab (additional settings) for a better demarcation of identified bins

Here is an example of 16 bins we identified for comparison AFTER you performed your own binning:

Please save your bins as a collection. You can give your collection any name, but if you all it default, anvi’o will treat it differently.

In the anvi’o lingo, a ‘collection’ is something that describes one or more bins, each of which describe one or more contigs.

If you identified near-complete genomes, then congratulations, you have characterized genomic contents of microbial populations de novo.

### Summarizing the binning results

Why do we do binning? Because we are interested in making sense of our metagenomes in the context of genomes we have recovered through binning. Understanding the distribution patterns of the genomes we have in a collection in a quantitative fashion, or getting back a table of function names found in each one of them, or even summarizing our bins as distinct FASTA files are critical for next steps of every binning analysis. After all, binning is a boring detail before you start doing your science.

To ensure that you have everything you need to continue working with the outcomes of your binning effort outside of anvi’o, we have a program called anvi-summarize. It is possible to summarize any collection stored in an anvi’o profile database through this program. The result is a static HTML page that can be viewed on any computer.

For the lazy

If you don’t want to do your own binning and still be able to continue with the commands below, you can import some binning results this way:

anvi-import-collection additional-files/collections/merens.txt \
--bins-info additional-files/collections/merens-info.txt
-p PROFILE.db \
-c CONTIGS.db \
-C default


Let’s summarize the collection you have just created:

anvi-summarize -p PROFILE.db \
-c CONTIGS.db \
-C default \
-o SUMMARY


Once summary is finished, take a minute to look at its contents.

### Renaming bins in your collection (from chaos to order)

As you can see from the summary file, at this point bin names are random, and we often find it useful to put some order on this front. This becomes an extremely useful strategy especially when the intention is to merge multiple binning efforts later. For this task we use the program anvi-rename-bins:

anvi-rename-bins -p PROFILE.db \
-c CONTIGS.db  \
--collection-to-read default  \
--collection-to-write MAGs  \
--call-MAGs \
--prefix IGD \
--report-file rename-bins-report.txt


With those settings, a new collection MAG will be created in which (1) bins with a completion >70% are identified as MAGs (stands for Metagenome-Assembled Genome), (2) and bins and MAGs are attached the prefix IGD and renamed based on the difference between completion and redundancy.

Now we can use the program anvi-summarize to summarize the new collection:

anvi-summarize -p PROFILE.db \
-c CONTIGS.db \
-C MAGs \
-o SUMMARY_AFTER_RENAME


You can now visualize the results by double-clicking on the index.html file present in the newly created folder SUMMARY_MAGs.

Here are the MAGs we got:

Better.

### Refining individual MAGs: the curation step

Why do we need to curate bins? Please see this paper to see how poorly refined bins can influence ecological and evolutionary insights we gain from MAGs. There is also a step-by-step workflow to describe steps of manual curation here.

Why not relying only on single-copy core genes to estimate purity of a genome bin? Let’s think about this altogether. But once we are done with this, please see the relevant section in this study.

To straighten the quality of the MAGs collection, it is possible to visualize individual bins and if needed, refine them. For this we use the program anvi-refine. For instance, if you were to be interested in refining one of the bins in our current collection, you could run this command:

anvi-refine -p PROFILE.db \
-c CONTIGS.db \
-C MAGs \
-b IGD_MAG_00001


Now the interactive interface only displays contigs from a single bin. During this curation step, one can try different clustering strategies (i.e. by only relying on coverage, or only relying on sequence composition) to identify outliers and investigate carefully whether they may be contaminants. You can select everything, and remove those contigs you don’t want to keep in the bin before using the Bins panel to store your updated set of contigs in the database.

Here is an example of MAG we had to curate (we removed three contigs):

Storing the refined new bin in the database will modify the collection, but you will need to run anvi-summarize if you want the summary output to also be updated.

This is indeed a simple example. But in some cases refining a given MAG can take hours. But this is an extremely critical step of genome-resolved metagenomics studies, especially when important claims are made based on MAGs.

## Chapter II: Automatic Binning

Even if you prefer manual binning over automatic binning for the sake of accuracy and control over your data, automatic binning is an unavoidable need due to performance limitations associated with manual binning. The actual purpose of this chapter is to talk about advantages and disadvantages of automatic binning, by comparing multiple binning approaches on the simplest real-world gut metagenome there is: the infant gut dataset. However, while we are going through this, you will also learn about how to incorporate automatic binning results into anvi’o.

The directory additional-files/external-binning-results contains a number of files that describe the binning of contigs in the IGD based on various automatic and manual approaches. These files include (1) outputs from some of the well-known binning algorithms (i.e., GROOPM.txt, MAXBIN.txt, METABAT.txt, BINSANITY_REFINE.txt, MYCC.txt, and CONCOCT.txt), (2) the original binning of this dataset (SHARON_et_al.txt), and the manual binning we performed in the anvi’o paper (MEREN_et_al.txt).

External binning results:

The first five files are courtesy of Elaina Graham, who used GroopM (v0.3.5), MetaBat (v0.26.3), MaxBin (v2.1.1), MyCC, and BinSanity (v0.2.1) to bin the IGD. For future references, here are the parameters Elaina used for each approach:

# GroopM v0.3.5 (followed the general workflow on their manual)
groopm parse groopm.db contigs.fa [list of bam files]
groopm core groopm.db -c 1000 -s 10 -b 1000000
groopm recruit groopm.db -c 500 -s 200
groopm extract groopm.db contigs.fa

# MetaBat v0.26.3 (used jgi_summarize_bam_contig_depths to get a depth file from BAM files).
metabat -i contigs.fa -a depth.txt -o bin

# MaxBin v2.1.1
run_MaxBin.pl -contig contigs.fa -out maxbin_IGM -abund_list [list of all coverage files in associated format]

# MyCC (ran via the docker image available here: https://sourceforge.net/projects/sb2nhri/files/MyCC/)
(used jgi_summarize_bam_contig_depths to get depth file from BAM files per the authors suggestion)
MyCC.py contigs.fa 4mer -a depth.txt

# BinSanity + refinement v0.2.1
Binsanity -f . -l contigs.fa -p -10 -c igm.coverage.lognorm

## After bin inspection using CheckM & Anvi'o, INFANT-GUT-ASSEMBLY-bin_18
## was refined using the following parameters
Binsanity-refine -f . -l -p -150 -c igm.coverage.lognorm


CONCOCT results come from the CONCOCT module that was embedded within anvi’o until v6.

Eren et al. results come directly from the collection generated during the study.

Finally, a file corresponding to Sharon et al. results was created by BLAST-searching sequences in bins identified by the authors of the study (see http://ggkbase.berkeley.edu/carrol) to our contigs to have matching names for our assembly.

Now you have the background information about where these files are coming from. Moving on. But just before we continue to move on, let’s remove the default collection from our profile database, if there is one, to avoid any confusion.

You can use this command to see all collections in your anvi’o profile database:

anvi-show-collections-and-bins -p PROFILE.db


And use this one to remove one that’s called default:

anvi-delete-collection -p PROFILE.db \
-C default


### Importing an external binning result

You can create a collection by using the interactive interface (e.g., the default and MAGs collections you just created), or you can import external binning results into your profile database as a collection and see how that collection groups contigs. For instance, let’s import the CONCOCT collection:

anvi-import-collection additional-files/external-binning-results/CONCOCT.txt \
-c CONTIGS.db \
-p PROFILE.db \
-C CONCOCT \
--contigs-mode


you can immediately see what collections are available in a given profile database the following way (which in this case should show us the collections CONCOCT, and MAGs):

anvi-show-collections-and-bins -p PROFILE.db


You can get a quick idea regarding the estimated completion of bins in a given collection:

anvi-estimate-genome-completeness -p PROFILE.db \
-c CONTIGS.db \
-C CONCOCT


You can also get an idea about their taxonomy:

anvi-estimate-genome-taxonomy -p PROFILE.db \
-c CONTIGS.db \
-C CONCOCT


OK. Let’s run the interactive interface again with the CONCOCT collection:

anvi-interactive -p PROFILE.db \
-c CONTIGS.db \
--collection-autoload CONCOCT


Alternatively you could load the interface without the --collection-autoload  flag, and click Bins > Load bin collection > CONCOCT > Load to load the CONCOCT collection.

To turn off text annotation, go to Main > Display > Additional Settings > Selections and then uncheck Show names. You will then see something like this:

So this is how you load and display an external collection. So far so good.

### Comparing multiple binning approaches

Since we have all these results from different binning approaches, it clearly would have been interesting to compare them to each other (because benchmarking stuff is often very insightful). But how to do it? The simplest way to do it is to assume a ‘true organization of contigs’, and then investigate every other approach with respect to that.

Here we have multiple independent sources of information we could use. Including (1) the organization of contigs based on hierarchical clustering analysis, (2) per-contig taxonomy estimated from the gene-level taxonomic annotations by Centrifuge, and (3) results from the original publication from Sharon et al., in which authors did a very careful job to identify every genome in the dataset (even resolving the Staphylococcus pangenome, which is extremely hard for automatic binning approaches that work with a single co-assembly). So these are the things we can build upon for a modest comparison.

To include binning results in this framework, we could import each collection into the profile database the way we imported CONCOCT. But unfortunately at any given time there could only be one collection that can be displayed in the interface. Luckily there are other things we can do. For instance, as a workaround, we can merge all binning results into a single file, and use that file as an ‘additional data file’ to visualize them in the interactive interface.

Anvi’o has a script to merge multiple files for external binning results into a single merged file (don’t ask why):

anvi-script-merge-collections -c CONTIGS.db \
-i additional-files/external-binning-results/*.txt \
-o collections.tsv


If you take a look at this file, you will realize that it has a very simple format:

head collections.tsv | column -t
contig                           BINSANITY_REFINE                CONCOCT  GROOPM     MAXBIN      METABAT               MYCC              SHARON_et_al
Day17a_QCcontig1000_split_00001  INFANT-GUT-ASSEMBLY-bin_19.fna  Bin_4    db_bin_11  maxbin.008  metabat_igm.unbinned  Cluster.5.fasta   Finegoldia_magna
Day17a_QCcontig1001_split_00001  INFANT-GUT-ASSEMBLY-bin_6.fna   Bin_7    db_bin_46  maxbin.006  metabat_igm.unbinned  Cluster.3.fasta   Staphylococcus_epidermidis_virus_014
Day17a_QCcontig1002_split_00001  INFANT-GUT-ASSEMBLY-bin_19.fna  Bin_4    db_bin_11  maxbin.007  metabat_igm.unbinned  Cluster.5.fasta   Finegoldia_magna
Day17a_QCcontig1003_split_00001  INFANT-GUT-ASSEMBLY-bin_14.fna  Bin_2    db_bin_1   maxbin.009  metabat_igm.7         Cluster.12.fasta
Day17a_QCcontig1004_split_00001  INFANT-GUT-ASSEMBLY-bin_16.fna  Bin_3    db_bin_8   maxbin.008  metabat_igm.10        Cluster.14.fasta
Day17a_QCcontig1005_split_00001  INFANT-GUT-ASSEMBLY-bin_13.fna  Bin_5    db_bin_47  maxbin.007  metabat_igm.unbinned  Cluster.3.fasta   Staphylococcus_epidermidis_viruses
Day17a_QCcontig1006_split_00001  INFANT-GUT-ASSEMBLY-bin_16.fna  Bin_3    db_bin_8   maxbin.008  metabat_igm.10        Cluster.14.fasta  Leuconostoc_citreum
Day17a_QCcontig1007_split_00001  INFANT-GUT-ASSEMBLY-bin_16.fna  Bin_3    db_bin_8   maxbin.008  metabat_igm.10        Cluster.14.fasta
Day17a_QCcontig1008_split_00001  INFANT-GUT-ASSEMBLY-bin_14.fna  Bin_2    db_bin_1   maxbin.009  metabat_igm.7         Cluster.8.fasta   Candida_albcans


Good. Now you can run the interactive interface to display all collections of bins stored in collections.tsv as ‘additional layers’:

anvi-interactive -p PROFILE.db \
-c CONTIGS.db \
-A collections.tsv


Dealing with additional data tables like a pro

As you can see, -A parameter allows us to add anything to the interface as additional layers as far as the first column of that data matches to our item names. We could alternatively import this additional information into our profile database, and the way to do it is through the use of additional data tables subsystem of anvi’o. There is much more information on how to deal with additional data of all sorts here, but basically we can run the following command to import this data into our database:

anvi-import-misc-data collections.tsv \
-p PROFILE.db \
-t items


Now you can run the interactive interface without the collections.tsv, and you would get the exact same display, since the additional data now would be read from the additional data tables:

anvi-interactive -p PROFILE.db -c CONTIGS.db


But for now, we will continue without these tables, so let’s delete them from the profile database:

anvi-delete-misc-data -p PROFILE.db \
-t items \
--just-do-it


Fine.

At this point you should be seeing a display similar to this (after setting the height of each additional layer to 200px):

The legends for each of the bin collections are available in the Legends tab of Settings. To visually emphasize relationships between bins, you can change the color of each bin manually by clicking on the colored boxes in the legends. Or, if you’re not a masochist you can import an anvi’o state where we did that for you:

anvi-import-state --state additional-files/state-files/state-merged.json \
--name default \
-p PROFILE.db


and run the interactive interface again with the same command line,

anvi-interactive -p PROFILE.db \
-c CONTIGS.db \
-A collections.tsv


this time you should get this display:

So far so good?

Now we can discuss about different approaches of automatic binning.

Please note that the algorithms we have used here may have been improved since the time we did these analyses, therefore please don’t make any decisions about their performance or efficacy based on what you are seeing here.

Just a reminder, once you have the interactive interface in front of you, you can in fact investigate the taxonomy of contigs by BLASTing them against various NCBI collections using the right-click menu to have a second opinion about what do public databases think they are:

We recently have added an option to quickly run them on BIGSI. Sometimes it takes a split second, sometimes (especially when you are in France) it takes minutes. So, no promises, but try it for sure! It is an excellent algorithm.

### Manually curating automatic binning outputs

OK. Let’s assume, we didn’t see the interactive interface, and we have no idea about the dataset. We didn’t do any of the things we did up to this point. We just had profiled and merged the IGD, and we did binning of this dataset using MaxBin. Let’s start by importing MaxBin results into the profile database as a collection:

anvi-import-collection additional-files/external-binning-results/MAXBIN.txt \
-c CONTIGS.db \
-p PROFILE.db \
-C MAXBIN \
--contigs-mode


From here, there are two things we can do very quickly. First, we can create a summary of our new collection, or we can take a quick look at the completion / redundancy estimates of bins described by this collection from the command line:

anvi-estimate-genome-completeness -p PROFILE.db \
-c CONTIGS.db \
-C MAXBIN


Alternatively, we can take a quick look at the binning results by initiating the interactive interface in collection mode:

anvi-interactive -p PROFILE.db \
-c CONTIGS.db \
-C MAXBIN


This command should give you a display similar to this:

All previous interactive displays were at the contig-level (each leaf in the center tree was a contig). However, this display is at the bin-level. Instead of contigs, this display shows us the distribution of bins MaxBin identified. We also have completion and redundancy estimates for each bin, which helps us make some early sense of what is going on.

Please read this post to learn more about completion and redundancy estimates: Assessing completion and contamination of metagenome-assembled genomes

It is clear that some bins are not as well-resolved as others. For instance, bins maxbin_007 and maxbin_008 have redundancy estimates of 22% and 91%, respectively, which suggests each of them describe multiple distinct populations. Well, clearly we would have preferred those bins to behave.

If you order bins based on their detection across metagenomes (by changing the ‘Items order’ to ‘detection’ from the menu in the Main tab), you can also see that bins maxbin_007 and maxbin_008 are right next to each other. This suggests that it may be a good idea to simply merge these bins first, and then refine to avoid issues of over-splitting populations of interest. Let’s merge them into a single bin first:

anvi-merge-bins -p PROFILE.db \
--collection-name MAXBIN \
--bin-names-list "maxbin_007, maxbin_008" \
--new-bin-name maxbin_007_and_008


If you take a quick look at the collection again, you can see that the new bin maxbin_007_and_008 is the bin that needs immediate action based on single-copy core genes:

anvi-estimate-genome-completeness -p PROFILE.db \
-c CONTIGS.db \
-C MAXBIN

Bins in collection "MAXBIN"
===============================================
╒════════════════════╤══════════╤══════════════╤════════════════╤════════════════╤══════════════╤════════════════╕
│ bin name           │ domain   │   confidence │   % completion │   % redundancy │   num_splits │   total length │
╞════════════════════╪══════════╪══════════════╪════════════════╪════════════════╪══════════════╪════════════════╡
│ maxbin_001         │ BACTERIA │            1 │            100 │           4.23 │          148 │        2969341 │
├────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_002         │ BACTERIA │          0.9 │          94.37 │              0 │           88 │        1801068 │
├────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_003         │ BACTERIA │          0.7 │          81.69 │           2.82 │          144 │        2764617 │
├────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_004         │ BACTERIA │          0.9 │          97.18 │           1.41 │          188 │        2571878 │
├────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_005         │ BACTERIA │            1 │          98.59 │              0 │          151 │        2555414 │
├────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_006         │ BACTERIA │          0.6 │          78.87 │           5.63 │          305 │        2901149 │
├────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_009         │ EUKARYA  │          0.7 │          83.13 │           7.23 │         1379 │       13915165 │
├────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_007_and_008 │ BACTERIA │            1 │          95.77 │         154.93 │         2381 │        6287535 │
╘════════════════════╧══════════╧══════════════╧════════════════╧════════════════╧══════════════╧════════════════╛


Fine. Let’s refine that bin:

anvi-refine -p PROFILE.db \
-c CONTIGS.db \
-C MAXBIN \
-b maxbin_007_and_008


Which should give us this display, on which we see the distribution of contigs that were originally binned into maxbin_007 and maxbin_008 across samples. The hierarchical clustering picked up some trends, and you can see clusters one can identify quickly:

We can now make the following selections to split these two bins into six, and update our database by storing these refined bins from the Bins panel in the interface:

If you take another look after this, you can see that maxbin_007_and_008 is replaced with smaller bins, and the total redundancy in the collection is much lower:

anvi-estimate-genome-completeness -p PROFILE.db \
-c CONTIGS.db \
-C MAXBIN

Bins in collection "MAXBIN"
===============================================
╒══════════════════════╤══════════╤══════════════╤════════════════╤════════════════╤══════════════╤════════════════╕
│ bin name             │ domain   │   confidence │   % completion │   % redundancy │   num_splits │   total length │
╞══════════════════════╪══════════╪══════════════╪════════════════╪════════════════╪══════════════╪════════════════╡
│ maxbin_001           │ BACTERIA │            1 │            100 │           4.23 │          148 │        2969341 │
├──────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_002           │ BACTERIA │          0.9 │          94.37 │              0 │           88 │        1801068 │
├──────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_003           │ BACTERIA │          0.7 │          81.69 │           2.82 │          144 │        2764617 │
├──────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_004           │ BACTERIA │          0.9 │          97.18 │           1.41 │          188 │        2571878 │
├──────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_005           │ BACTERIA │            1 │          98.59 │              0 │          151 │        2555414 │
├──────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_006           │ BACTERIA │          0.6 │          78.87 │           5.63 │          305 │        2901149 │
├──────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_009           │ EUKARYA  │          0.7 │          83.13 │           7.23 │         1379 │       13915165 │
├──────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_007_and_008_1 │ BACTERIA │          0.4 │          38.03 │           2.82 │          417 │         696607 │
├──────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_007_and_008_2 │ BACTERIA │          0.6 │          47.89 │              0 │          369 │         753160 │
├──────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_007_and_008_3 │ BLANK    │            1 │              0 │              0 │          206 │         290126 │
├──────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_007_and_008_4 │ BACTERIA │          0.3 │          32.39 │           5.63 │          488 │         810476 │
├──────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_007_and_008_5 │ BACTERIA │          0.3 │          38.03 │           2.82 │          247 │        2333489 │
├──────────────────────┼──────────┼──────────────┼────────────────┼────────────────┼──────────────┼────────────────┤
│ maxbin_007_and_008_6 │ BACTERIA │          0.8 │          71.83 │           2.82 │          556 │        1237568 │
╘══════════════════════╧══════════╧══════════════╧════════════════╧════════════════╧══════════════╧════════════════╛


The take home message here is that even when automatic binning approaches yield poorly identified bins, it is possible to improve the final results through a manual refinement step. Clearly these extra steps require a lot of expertise, intuition, attention, and decision making. And fortunately you are all familiar with each one of them because science.

Thank you for following the tutorial this far!

Show/hide More on refinement

You can read more about anvi-refine here. Also you may want to look at Tom’s refining of the Loki archaea: Inspecting the genomic link between Archaea and Eukaryota.

If you are feeling lazy, you can just take a quick look at this videos from the post above.

First a closer look at Lokiarchaeum sp. GC14_75

And then curating it:

You should always double-check your metagenome-assembled genomes.

Show/hide Meren's two cents on binning

Binning is inherently a very challenging task.

In most cases it is absolutely doable, especially when there is a decent assembly, but it is very challenging.

The IGD is one of the most friendly metagenomic datasets available to play with (since an astonishing fraction of nucleotides map back to the assembly), and it comes from a well-implemented experimental design (because that’s what Banfield group does). Yet, you now have seen the extent of disagreement between multiple binning approaches even for this dataset.

You should reming yourself that each of these approaches are implemented by people who are well-trained scientists working with groups of people who are experts in their fields. These tools are benchmarked against others and showed improvements. So each one of them provides the best result compared to all others in at least one metagenomic dataset. I think understanding what this means is important. There is no silver bullet in the common bioinformatics toolkit that will take care of every dataset when you fire it. In fact, depending on the dataset, even the best tools we have may be as efficient as sticks and stones against the Death Star. Computational people are working very hard to improve things, but they would be the first ones to suggest that their tools should never make users feel free from the fact that it is their own responsibility to make sure the results are meaningful and appropriate.

So which one to choose? How to get out of this situation easily and move on? I know how much desire there is to outsource everything we do to fully automated computational solutions. I also acknowledge that the ability to do that is important to perform large-scale and reproducible analyses without going through too much pain. But we are not at a stage yet with metagenomics where you can rely on any of the available automated binning tools, and expect your MAGs to be safe and sound.

For instance, I think CONCOCT is doing a pretty awesome job identifying MAGs in the IGD, even with the low-abundance organisms. However, it is not perfect, either. In fact if you look carefully, you can see that it creates two bins for one Candida albicans genome. Hierarchical clustering will always get you closest to the best organization of contigs with simple distance metrics and linkage algorithms. But there are major challenges associated with that approach, including the fact that it is simply an exploratory method and can’t give you “bins” out-of-the-box. Even more importantly, it has tremendous limitations come from its computational complexity (~O(m2 log m), where m is the number of data points). So in most cases it is not even a remote possibility to organize contigs using a hierarchical clustering approach in an assembly in reasonable amount of time (and there is no way to visualize that even if you were to get a dendrogram for 200,000 contigs (you can create simple 2D ordinations with that number of items, but you really shouldn’t, but that’s another discussion)). Except assemblies with rather smaller number of contigs like the IGD, we are always going to use automated ways to identify bins, at least initially, knowing that resulting bins may be, and in most cases will be, crappy. That’s why in anvi’o we implemented ways to quickly look into automatically identified bins (i.e., the collection mode of anvi-interactive), and even refine those with poor redundancy scores to improve final results (i.e., anvi-refine).

So we can fix crappy bins to an extent since we know more or less how things should look like, and we have tools to do that. That being said, there is one more consideration that is very easy to miss. Although it is somewhat possible to recover from conflation error (i.e., more than one genome ends up in one bin), it is much harder to recover from the fragmentation error (i.e., one genome is split into multiple bins). You can see an example for fragmentation error if you take a careful look from this figure (i.e., CONCOCT bins between 9:30 and 12:00 o’clock, or MaxBin bins between 5:00 and 7:00 o’clock):

This is a problem that likely happens quite often, and very hard to deal with once the bins are identified. But we can recover from that.

#### From fragmentation to conflation error: A Meren Lab Heuristic to fight back

One of the heuristics we recently started using in our lab to avoid fragmentation error is to confine CONCOCT’s clustering space to a much smaller number of clusters than the expected number of bacterial genomes in a given dataset, and then curate resulting contaminated bins manually. Let’s say we expect to find n bacterial genomes, so we run CONCOCT with a maximum number of clusters of about n/2 (no judging! I told you it was a heuristic!).

Well, how do you even know how many bacterial genomes you should expect to find in a metagenome?
You

Thanks for the great question. Although this may sound like a challenging problem to some, we have a very simple way to resolve it (which I described in this blog post). If you still have access to the IGD, you can run this simple command:

anvi-display-contigs-stats CONTIGS.db


If you take a look at the resulting interactive graph, you can see that one should expect to find about 10 near-complete genomes in this dataset:

We have a citable version, and a more formal description of this workflow in our recent paper “Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies” (see the supplementary material).

Fine. Using anvi-cluster-with-concoct program, we ask CONCOCT to naively identify 5 clusters in this dataset, and store the results in the profile database as a collection:

anvi-cluster-with-concoct -p PROFILE.db \
-c CONTIGS.db \
--num-clusters 5 \
-C CONCOCT_C5


Now you can run the interface again,

anvi-interactive -p PROFILE.db \
-c CONTIGS.db \
--title 'Infant Gut Time Series by Sharon et al. [w/ 5 CONCOCT clusters]' \
--collection-autoload CONCOCT_C5


and you would see this:

Well, there aren’t any fragmentation errors anymore, and in fact CONCOCT did an amazing job to identify general patterns in the dataset. Now refining these bins to fix all the conflation errors would be much more easier. If you would like to try, here is an example:

anvi-refine -p PROFILE.db \
-c CONTIGS.db \
-C CONCOCT_C5 \
-b Bin_1


There are more ways to improve bins and binning results. But although we have seen major improvements in our research by exploring these directions, there are also many other cases nothing is quite enough.

Then it is time to increase the depth of sequencing, implement a different assembly strategy, rethink the sampling strategy, or change the experimental approach to do what seems to be undoable. Here is an example from Tom Delmont et al. to that last point with soil metagenomics: doi:10.3389/fmicb.2015.00358.

We all just have to continue working, and enjoy this revolution.

## Chapter III: Phylogenomics

This is more of a practical tutorial to do phylogenomic analyses on metagenome-assembled genomes described in anvi’o collections. For a more abstract tutorial on phylogenomics, please consider first reading ‘An anvi’o workflow for phylogenomics’.

To see a practical application of phylogenomics see this workflow’.

If you haven’t followed the previous sections of the tutorial, you will need the anvi’o merged profile database and the anvi’o contigs database for the IGD available to you. Before you continue, please click here, do everything mentioned there, and come back right here to continue following the tutorial from the next line when you read the directive go back.

Please run the following command in the IGD dir, so you have everything you need. We will simply import our previously generated collection of bins in the IGD dataset as the default collection:

anvi-import-collection additional-files/collections/merens.txt \
--bins-info additional-files/collections/merens-info.txt \
-p PROFILE.db \
-c CONTIGS.db \
-C default


At this point, you have in your anvi’o profile database a collection with multiple bins:

anvi-show-collections-and-bins -p PROFILE.db

Collection: "default"
===============================================
Collection ID ................................: default
Number of bins ...............................: 13
Number of splits described ...................: 4,451
Bin names ....................................: Aneorococcus_sp, C_albicans, E_facealis, F_magna, L_citreum, P_acnes, P_avidum, P_rhinitidis, S_aureus, S_epidermidis, S_hominis, S_lugdunensis, Streptococcus


Putting genomes in a phylogenomic context is one of the common ways to compare them to each other. The common practice is to concatenate aligned sequences of single-copy core genes for each genome of interest, and generate a phylogenomic tree by analyzing the resulting alignment.

Let’s assume we want to run a phylogenomic analysis on all genome bins we have in the collection merens in the IGD (you may have your own collections somewhere, that is fine too).

In order to do the phylogenomic analysis, we will need a FASTA file of concatenated genes. And to get that FASTA file out of our anvi’o databases, we will primarily use the program anvi-get-sequences-for-hmm-hits.

### Selecting genes from an HMM Profile

We first need to identify an HMM profile, and then select some gene names from this profile to play with.

Going back to the IGD, let’s start by looking at what HMM profiles available to us:

anvi-get-sequences-for-hmm-hits -c CONTIGS.db \
-p PROFILE.db \
-o seqs-for-phylogenomics.fa \
--list-hmm-sources

* Bacteria_71 [type: singlecopy] [num genes: 71]
* Archaea_76 [type: singlecopy] [num genes: 76]
* Protista_83 [type: singlecopy] [num genes: 83]
* Ribosomal_RNAs [type: Ribosomal_RNAs] [num genes: 12]


As you know, you can use anvi-run-hmms program with custom made HMM profiles to add your own HMMs into the contigs database.

Alright. We have two. Let’s see what genes do we have in Bacteria_71:

anvi-get-sequences-for-hmm-hits -c CONTIGS.db \
-p PROFILE.db \
-o seqs-for-phylogenomics.fa \
--hmm-source Bacteria_71 \
--list-available-gene-names

* Bacteria_71 [type: singlecopy]: ADK, AICARFT_IMPCHas, ATP-synt, ATP-synt_A,
Chorismate_synt, EF_TS, Exonuc_VII_L, GrpE, Ham1p_like, IPPT, OSCP, PGK,
Pept_tRNA_hydro, RBFA, RNA_pol_L, RNA_pol_Rpb6, RRF, RecO_C, Ribonuclease_P,
Ribosom_S12_S23, Ribosomal_L1, Ribosomal_L13, Ribosomal_L14, Ribosomal_L16,
Ribosomal_L17, Ribosomal_L18p, Ribosomal_L19, Ribosomal_L2, Ribosomal_L20,
Ribosomal_L21p, Ribosomal_L22, Ribosomal_L23, Ribosomal_L27, Ribosomal_L27A,
Ribosomal_L28, Ribosomal_L29, Ribosomal_L3, Ribosomal_L32p, Ribosomal_L35p,
Ribosomal_L4, Ribosomal_L5, Ribosomal_L6, Ribosomal_L9_C, Ribosomal_S10,
Ribosomal_S11, Ribosomal_S13, Ribosomal_S15, Ribosomal_S16, Ribosomal_S17,
Ribosomal_S19, Ribosomal_S2, Ribosomal_S20p, Ribosomal_S3_C, Ribosomal_S6,
Ribosomal_S7, Ribosomal_S8, Ribosomal_S9, RsfS, RuvX, SecE, SecG, SecY, SmpB,
TsaE, UPF0054, YajC, eIF-1a, ribosomal_L24, tRNA-synt_1d, tRNA_m1G_MT,
Adenylsucc_synt


OK. A lot. Good for you, Bacteria_71.

Bacteria_71 is a collection that anvi’o developers curated by taking Mike Lee’s bacterial single-copy core gene collection first released in GToTree, which is an easy-to-use phylogenomics workflow.

For the sake of this simple example, let’s assume we want to use a bunch of ribosomal genes for our phylogenomic analysis: Ribosomal_L1, Ribosomal_L2, Ribosomal_L3, Ribosomal_L4, Ribosomal_L5, Ribosomal_L6.

OK. The following command will give us all these genes from all bins described in the collection default:

anvi-get-sequences-for-hmm-hits -c CONTIGS.db \
-p PROFILE.db \
-o seqs-for-phylogenomics.fa \
--hmm-source Bacteria_71 \
-C default \
--gene-names Ribosomal_L1,Ribosomal_L2,Ribosomal_L3,Ribosomal_L4,Ribosomal_L5,Ribosomal_L6

Init .........................................: 4451 splits in 13 bin(s)
Hits .........................................: 1358 hits for 1 source(s)
Filtered hits ................................: 65 hits remain after filtering for 6 gene(s)
Mode .........................................: DNA seqeunces
Genes are concatenated .......................: False
Output .......................................: seqs-for-phylogenomics.fa


If you look at the resulting FASTA file, you will realize that it doesn’t look like an alignment (I trimmed the output, you will see the full output when you look at the file yourself):

less seqs-for-phylogenomics.fa
>Ribosomal_L1___Bacteria_71___9a3cc bin_id:E_facealis|source:Bacteria_71|e_value:8.4e-59|contig:Day17a_QCcontig1|gene_callers_id:116|start:106378|stop:107068|length:690
ATGGCTAAAAAGAGCAAAAAAATGCAAGAAGCATTGAAAAAAGTTGATGCTACAAAAGCTTACTCAGTTGAAGAAGCAGTAGCTTTAGCAAAAGATACAAACATCGCGAAATTTGACGCA
(...)
>Ribosomal_L6___Bacteria_71___05162 bin_id:S_epidermidis|source:Bacteria_71|e_value:1.1e-43|contig:Day17a_QCcontig7|gene_callers_id:2325|start:13243|stop:13780|length:537
ATGAGTCGTGTTGGTAAGAAAATTATTGACATTCCTAGTGACGTAACAGTAACTTTTGACGGAAGTCATGTCACTGTAAAAGGTCCAAAAGGTGAATTAGAAAGAACTTTAAATGAAAGA
(...)
>Ribosomal_L5___Bacteria_71___ec17f bin_id:S_epidermidis|source:Bacteria_71|e_value:1.8e-31|contig:Day17a_QCcontig7|gene_callers_id:2328|start:14441|stop:14981|length:540
TTGAACCGTTTAAAAGAAAAATTTAATACAGAAGTTACTGAAAACTTAGTGAAAAAATTCAATTATAGTTCAGTGATGGAAGTACCAAAAATTGAGAAAATCGTTGTGAATATGGGTGTA
(...)


To exit less mode, press q.

Every sequence for every HMM hit is for itself :/ Hmm.

Concatenated we stand divided we fall.

### Concatenating genes

Although you can see how and why the previous output could be very useful for many other purposes, it is kinda useless for a phylogenomic analysis since we need a single concatenated alignment of gene sequences per genome.

If you look at the help menu of anvi-get-sequences-for-hmm-hits, you will see that there is a flag, --concatenate-genes, to get your genes of interest to be concatenated. Let’s do that:

anvi-get-sequences-for-hmm-hits -c CONTIGS.db \
-p PROFILE.db \
-o seqs-for-phylogenomics.fa \
--hmm-source Bacteria_71 \
-C default \
--gene-names Ribosomal_L1,Ribosomal_L2,Ribosomal_L3,Ribosomal_L4,Ribosomal_L5,Ribosomal_L6 \
--concatenate-genes

Config Error: If you want your genes to be concatenated into a multi-alignment file, you must
also ask for the best hit (using the --return-best-hit) flag to avoid issues
if there are more than one hit for a gene in a given genome. Anvi'o could have
set this flag on your behalf, but it just is not that kind of a platform :/


Well. That didn’t go well.

The reason why it didn’t go so well is because even in the most complete genomes, there may be multiple HMM hits for a given ‘single-copy gene’. Here is an evidence for that coming from gold standard genomes for skeptics. As a solution to this problem, anvi’o asks you to use the --return-best-hit flag, which will return the most significant HMM hit if there are more than one gene that matches to the HMM of a given gene in a given genome. Fine. Let’s do that, then:

anvi-get-sequences-for-hmm-hits -c CONTIGS.db \
-p PROFILE.db \
-o seqs-for-phylogenomics.fa \
--hmm-source Bacteria_71 \
-C default \
--gene-names Ribosomal_L1,Ribosomal_L2,Ribosomal_L3,Ribosomal_L4,Ribosomal_L5,Ribosomal_L6 \
--concatenate-genes \
--return-best-hit


If you take a look at the output file, you can see that we are getting somewhere.

But the output is in DNA alphabet, which may not be the best option for phylogenomic analyses, especially if the genomes you have are coming from distant clades (which happens to be the case for IGD). Fortunately, you can easily switch to AA alphabet with an additional flag --get-aa-sequences (and no, there is no end to anvi’o flags, and the earlier you start getting used to the idea of reading those help menus, the sooner you will master your anvi’o game).

anvi-get-sequences-for-hmm-hits -c CONTIGS.db \
-p PROFILE.db \
-o seqs-for-phylogenomics.fa \
--hmm-source Bacteria_71 \
-C default \
--gene-names Ribosomal_L1,Ribosomal_L2,Ribosomal_L3,Ribosomal_L4,Ribosomal_L5,Ribosomal_L6 \
--concatenate-genes \
--return-best-hit \
--get-aa-sequences


If you look at the resulting file again, you will see how everything looks just so lovely. Congratulations. You did it.

### Computing the phylogenomic tree

Once you have your concatenated genes, which you now have them in seqs-for-phylogenomics.fa if you followed the previous section, it is time to perform the phylogenomic analysis.

There are multiple ways to do this. Here we will use the program anvi-gen-phylogenomic-tree, which accepts a FASTA file and uses one of the programs it knows about to compute the tree. Currently the only option is FastTree, which infers approximately-maximum-likelihood phylogenetic trees from FASTA files that look like yours. Send us your favorite, and we will happily consider expanding the collection of available tools for this analysis.

Computing a phylogenomic tree from our FASTA file is as simple as this:

anvi-gen-phylogenomic-tree -f seqs-for-phylogenomics.fa \
-o phylogenomic-tree.txt


If anvi’o complains that you don’t have FastTree, you need to do two things. First, feel ashamed. Second, get FastTree: click me. (If you are using a mac and have brew, quietly type brew install fasttree and pretend it never happened).

The resulting file phylogenomic-tree.txt is a proper newick tree. If you want, you can visualize it immediately with anvi-interactive in manual mode:

anvi-interactive --tree phylogenomic-tree.txt \
-p temp-profile.db \
--title "Pylogenomics of IGD Bins" \
--manual


Which should give you this (after clicking Draw):

You can replace the colors with the bin names by selecting Text from Main > Layers > bin_name and re-clicking Draw:

We can do much more with this phylogenomic tree of our bins than visualizing it in manual mode.

For instance, we could use it immediately to organize our bins in our collection while showing their distribution across samples.

anvi-interactive -p PROFILE.db \
-c CONTIGS.db \
-C default \
--tree phylogenomic-tree.txt


Which would give you the following display after selecting the ‘Phylogenomic tree’ from the ‘orders’ combo box in the ‘Settings’ tab.

The tree in the middle shows the phylogenomic organization of bins we identified in the IGD.

Now you know how to organize distantly related genomes using universally conserved genes.

## Chapter IV: Pangenomics

Both phylogenomics and pangenomics are strategies under the umbrella of comparative genomics, and they are inherently very similar despite their key differences. In this chapter we will discuss pangenomics and use anvi’o to have a small pangenomic analysis using our famous E. faecalis bin we recovered from the infant gut dataset and a bunch of others from the interwebs.

You can find a comprehensive tutorial on the anvi’o pangenomic workflow here.

If you haven’t followed the previous sections of the tutorial, you will need the anvi’o merged profile database and the anvi’o contigs database for the IGD available to you. Before you continue, please click here, do everything mentioned there, and come back right here to continue following the tutorial from the next line when you read the directive go back.

Please run following commands in the IGD dir. They will set the stage for us to take a look at the E. faecalis bin:

anvi-import-collection additional-files/collections/e-faecalis.txt \
--bins-info additional-files/collections/e-faecalis-info.txt \
-p PROFILE.db \
-c CONTIGS.db \
-C E_faecalis


### Generating a genome storage

For this example I downloaded 6 E. faecalis, and 5 E. faecium genomes to analyze them together with our E. faecalis bin. For each of these 11 external genomes, I generated anvi’o contigs databases. You can find all of them in the additional files directory:

ls additional-files/pangenomics/external-genomes/*db
additional-files/pangenomics/external-genomes/Enterococcus_faecalis_6240.db
additional-files/pangenomics/external-genomes/Enterococcus_faecalis_6250.db
additional-files/pangenomics/external-genomes/Enterococcus_faecalis_6255.db
additional-files/pangenomics/external-genomes/Enterococcus_faecalis_6512.db
additional-files/pangenomics/external-genomes/Enterococcus_faecalis_6557.db
additional-files/pangenomics/external-genomes/Enterococcus_faecalis_6563.db
additional-files/pangenomics/external-genomes/Enterococcus_faecium_6589.db
additional-files/pangenomics/external-genomes/Enterococcus_faecium_6590.db
additional-files/pangenomics/external-genomes/Enterococcus_faecium_6601.db
additional-files/pangenomics/external-genomes/Enterococcus_faecium_6778.db
additional-files/pangenomics/external-genomes/Enterococcus_faecium_6798.db


The post Accessing and including NCBI genomes in ‘omics analyses in anvi’o explains how to download sets of genomes you are interested in from the NCBI and turn them into anvi’o contigs databases.

There also are two files in the additional-files/pangenomics directory to describe how to access to the external genomes:

name contigs_db_path
E_faecalis_6240 external-genomes/Enterococcus_faecalis_6240.db
E_faecalis_6250 external-genomes/Enterococcus_faecalis_6250.db
E_faecalis_6255 external-genomes/Enterococcus_faecalis_6255.db
E_faecalis_6512 external-genomes/Enterococcus_faecalis_6512.db
E_faecalis_6557 external-genomes/Enterococcus_faecalis_6557.db
E_faecalis_6563 external-genomes/Enterococcus_faecalis_6563.db
E_faecium_6589 external-genomes/Enterococcus_faecium_6589.db
E_faecium_6590 external-genomes/Enterococcus_faecium_6590.db
E_faecium_6601 external-genomes/Enterococcus_faecium_6601.db
E_faecium_6778 external-genomes/Enterococcus_faecium_6778.db
E_faecium_6798 external-genomes/Enterococcus_faecium_6798.db

and the internal one:

name bin_id collection_id profile_db_path contigs_db_path
E_faecalis_SHARON E_faecalis E_faecalis ../../PROFILE.db ../../CONTIGS.db

It is this simple to combine MAGs and isolates.

So everything is ready for an analysis, and the first step in the pangenomic workflow is to generate an anvi’o genomes storage.

anvi-gen-genomes-storage -i additional-files/pangenomics/internal-genomes.txt \
-e additional-files/pangenomics/external-genomes.txt \
-o Enterococcus-GENOMES.db


### Computing and visualizing the pangenome

Now we have the genomes storage, we can characterize the pangenome:

anvi-pan-genome -g Enterococcus-GENOMES.db \
-n Enterococcus \
-o PAN \
--num-threads 10


Now the pangenome is ready to display. This is how you can run it:

anvi-display-pan -g Enterococcus-GENOMES.db \
-p PAN/Enterococcus-PAN.db \
--title "Enterococccus Pan"


To get this ugly looking display:

Never settle for bad-looking figures

Well, you can do whatever you want, but just take a look at this anvi’o figure:

This is a pangenomic analysis visualized with anvi’o by Zhemin Zhou and his colleagues. Here is more information about this excellent study: An 800-years-old Salmonella genome, and a lovely anvi’o figure.

I am not arguing that every figure should look like that one, but I would like you to consider the fact that if you would like to do better, there are ways. Here is a post about working with SVG files anvi’o generate. Finally, please visit this reproducible workflow to see a step-by-step for a more humanly-possible prettification of a pangenome:

OK. I made the previous display a bit prettier for you. If you kill the server, and import the state file the following way, and re-run the server,

anvi-import-state -p PAN/Enterococcus-PAN.db \
--state additional-files/state-files/state-pan.json \
--name default

anvi-display-pan -g Enterococcus-GENOMES.db \
-p PAN/Enterococcus-PAN.db \
--title "Enterococccus Pan"


It will look much more reasonable:

Now not only can we see how our E. faecalis genome looks like compared to available genomes, we can also see that it is not missing or carrying a great number of proteins compared to other genomes. The clustering of genomes based on gene clusters indicate that it is most similar to the genome Enterococcus faecalis 6512, which, according to the 00_INFO_ABOUT_EXTERNAL_GENOMES.txt under additional-files/pangenomics/external-genomes directory, corresponds to the assembly ID ASM17257v2 if you were to be interested in exploring further.

### Adding average nucleotide identity

The pangenome tells us about the similarities and dissimilarities between those genomes given the amino acid sequences of open reading frames we identified within each one of them. We could also compare them to each other by computing the average nucleotide identity between them. Anvi’o comes with a program to do that, which uses PyANI by Pritchard et al. to pairwise align DNA sequences between genomes to estimate similarities between them. See this tutorial for details.

We can compute average nucleotide identity among our genomes, and add them to the pan database the following way:

anvi-compute-genome-similarity -e additional-files/pangenomics/external-genomes.txt \
-i additional-files/pangenomics/internal-genomes.txt \
--program pyANI \
-o ANI \
-T 6 \
--pan-db PAN/Enterococcus-PAN.db


For people in a rush

Computing ANI can take a long time especially if you don’t have many threads to share. Your data-pack contains the ANI percent identity file that emerges from the previous command line that you can import into your pan database if you don’t want to run the command and move on:

anvi-import-misc-data additional-files/pangenomics/ANI_percentage_identity.txt \
-p PAN/Enterococcus-PAN.db \
--target-data-table layers \
--target-data-group ANI_percentage_identity


If you visualize your pangenome again once you have your ANI results in your database,

anvi-display-pan -g Enterococcus-GENOMES.db \
-p PAN/Enterococcus-PAN.db \
--title "Enterococccus Pan"


And take a careful look at your ‘Layers’ tab, you can see visualize your pangenome with the new ANI results:

### Organizing gene clusters based on a forced synteny

By default, anvi’o will organize gene clusters in a given pangenome based on their distribution patterns across genomes. Which, by definition, will break gene synteny, i.e., two gene clusters that are next to each other in the display will not necessarily carry genes that are next to each other in respective genomes. If you have a relatively complete genome, you can ask anvi’o to organize gene clusters based on the organization of genes in that genome.

Here, for instance, we can force the organization of gene clusters based on the synteny of genes in our E. faecalis genome:

And here we go:

Please take a look at this study and its reproducible workflow if you would like to see a useful practical application of this strategy by Julie Reveillaud.

### Functional enrichment analyses

One of the things we often are interested in is this question: which functions are associated with a particular organization of genomes due to their phylogenomic or pangenomic characteristics.

For instance, genomes in this pangenome are organized into two distinct groups based on differential distribution of gene clusters. That is of course not surprising, since these genomes are classified into two distinct ‘species’ within the genus Enterococcus. So you can certainly imagine more appropriate or interesting examples where you may be wondering about functional enrichment across groups of genomes that do not have such clear distinctions at the level of taxonomy. The actual tutorial to make sense of functions in pangenomes is here. Here we will make a quick pass to demonstrate its relevance.

First, you need to define how you would like to group your genomes in the pangenome so anvi’o can find out which functions are characteristic of each group. Here is our TAB-delimited file to divide these genomes into two groups:

|name|clade|
|:--|:--:|
|E_faecalis_6240|faecalis|
|E_faecalis_6250|faecalis|
|E_faecalis_6255|faecalis|
|E_faecalis_6512|faecalis|
|E_faecalis_6557|faecalis|
|E_faecalis_6563|faecalis|
|E_faecalis_SHARON|faecalis|
|E_faecium_6589|faecium|
|E_faecium_6590|faecium|
|E_faecium_6601|faecium|
|E_faecium_6778|faecium|
|E_faecium_6798|faecium|


Which is also in your data-pack. Let’s import it into the pan database as an additional layer data:

anvi-import-misc-data -p PAN/Enterococcus-PAN.db \
--target-data-table layers \
additional-files/pangenomics/additional-layers-data.txt


Once it is imported, we can re-run the pangenome,

anvi-display-pan -g Enterococcus-GENOMES.db \
-p PAN/Enterococcus-PAN.db \
--title "Enterococccus Pan"


and see the new layer there that correspond to our clades:

Now we can ask anvi’o to identify and report functions that are enriched in either of these clades along with the gene clusters they are associated with:

anvi-get-enriched-functions-per-pan-group -p PAN/Enterococcus-PAN.db \
-g Enterococcus-GENOMES.db \
--category clade \
--annotation-source COG_FUNCTION \
-o functional-enrichment.txt


Which would generate a new file, functional-enrichment.txt, in our work directory that is just filled with stuff like this:

COG_FUNCTION enrichment_score unadjusted_p_value adjusted_q_value associated_groups function_accession gene_clusters_ids p_faecium p_faecalis N_faecium N_faecalis
Alpha-galactosidase 12.00000001 5.32E-04 0.003981046 faecium COG3345 GC_00002469 1 0 5 7
Zinc transporter ZupT 12.00000001 5.32E-04 0.003981046 faecium COG0428 GC_00002604 1 0 5 7
Phenolic acid decarboxylase 12.00000001 5.32E-04 0.003981046 faecium COG3479 GC_00002605 1 0 5 7
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…)
Molybdenum cofactor biosynthesis enzyme 12.00000001 5.32E-04 0.003981046 faecalis COG0315 GC_00002162 0 1 5 7
Propanediol utilization protein 12.00000001 5.32E-04 0.003981046 faecalis COG4869 GC_00001620 0 1 5 7
Killer toxin insensitivity 12.00000001 5.32E-04 0.003981046 faecalis COG0251 GC_00001637 0 1 5 7
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…)
Rhodanese-related sulfurtransferase -7.70E-42 1 1 NA COG0607 GC_00000313 1 1 5 7
Serine/threonine protein phosphatase PrpC -7.70E-42 1 1 NA COG0631 GC_00000572 1 1 5 7
Tryptophan-rich sensory protein (mitochondrial benzodiazepine receptor homolog) -7.70E-42 1 1 NA COG3476 GC_00000886 1 1 5 7
(…) (…) (…) (…) (…) (…) (…) (…) (…) (…) (…)

So it turns out killer toxin insensitivity protein is only encoded by E. faecalis… I am not surprised. One should expect anything from these microbes :(

### Binning gene clusters

There are multiple ways to identify gene clusters that match to a given set of criteria. If you like, you can use a combinations of filters that are available through the interface:

The command line program anvi-get-sequences-for-gene-clusters can also give you access to these filters and more to get very precise reports. Another option is the good’ol interactive interface, and using the dendrogram it produces to organize gene clusters based on their distribution across genomes. From this display you can make manual selections of gene clusters. I already made some selections and stored them in a file for your convenience. If you import them the following way,

anvi-import-collection additional-files/pangenomics/pan-collection.txt \
--bins-info additional-files/pangenomics/pan-collection-info.txt \
-p PAN/Enterococcus-PAN.db \
-C default


and re-run the interactive interface,

anvi-display-pan -g Enterococcus-GENOMES.db \
-p PAN/Enterococcus-PAN.db \
--title "Enterococccus Pan"


you will see the following selections:

We used collections to store bins of contigs in the first section (and that’s how we identified that E. faecalis population from the Sharon et al. metagenomes anyway), and now the same concept serves us as a way to store bins of gene clusters.

If you right-click on any of the gene clusters, you will see a menu,

which, among other things, will give you access to the inspection page:

Inspecting gene clusters can be a lot of fun and very depressing at the same time. But of course it is hard to get a comprehensive understanding by going through things one by one. But can we summarize the pangenome?

Yes we can!

### Summarizing a pangenome

For instance, in my tentative selection above, there is a bin called CORE ALL, which describes all gene clusters that seems to be in all genomes in this analysis. You can in fact summarize the collection default to access all the information about each gene described in each gene cluster selected as CORE ALL.

You can summarize the pangenome using the collection we have the following way:

anvi-summarize -p PAN/Enterococcus-PAN.db \
-g Enterococcus-GENOMES.db \
-C default \
-o PAN_SUMMARY


If you run the the following command on a Mac system (or simply open the index.html file in your browser),

open PAN_SUMMARY/index.html


You will see a web page with some information about your pangenome:

The most important part of this output is this one:

You can unzip this file,

gzip -d PAN_SUMMARY/Enterococcus_protein_clusters_summary.txt.gz


And play with it to see how it will solve all your problems. You can import it into R, or open it in EXCEL to have a quick look at its contents. But here is a quick look at the first 10 lines of this file that contains 35,175 gene entries:

unique_id protein_cluster_id bin_name genome_name gene_callers_id COG_CATEGORY_ACC COG_CATEGORY COG_FUNCTION_ACC COG_FUNCTION aa_sequence
1 PC_00000001   E_faecalis_6240 26 X X COG2826 Transposase and inactivated derivatives IS30 family MTYKHLTIDELTMIESYYLQHNKPVEIANRMGRAIQTIYNVVNKFKQGKTALDYWHQYKENKKKCGRKVIQLPAHEVDYIKEKVTLGWTPDVIIGRKERPVSCGMRTLYRLFSKGIFDIDTLPMKGKRKPNGHQEKRGKQQYQRSIHDRPDNYPDFNSEFGHLEGDTIVGIHHKSAVITLVERLSKVIITIKPNGRKALDIETALNQWFSRFPKNFFKSITFDCGKEFSNWKAISNQHDIDIYFADPGTPSQRPLNENSNGILRRNGLPKSMDFREVNQTFISSVSNQRNHIPRKSLNYRTPIEIFLSYVQEAFYSNLI
2 PC_00000001   E_faecalis_6240 615 X X COG2826 Transposase and inactivated derivatives IS30 family MTYKHLTIDELTMIESYYLQHNKPVEIANRMGRAIQTIYNVVNKFKQGKTALDYWHQYKENKKKCGRKVIQLPAHEVDYIKEKVTLGWTPDVIIGRKERPVSCGMRTLYRLFSKGIFDIDTLPMKGKRKPNGHQEKRGKQQYQRSIHDRPDNYPDFNSEFGHLEGDTIVGIHHKSAVITLVERLSKVIITIKPNGRKALDIETALNQWFSRFPKNFFKSITFDCGKEFSNWKAISNQHDIDIYFADPGTPSQRPLNENSNGILRRNGLPKSMDFREVNQTFISSVSNQRNHIPRKSLNYRTPIEIFLSYVQEAFYSNLI
3 PC_00000001   E_faecalis_6240 2032 X X COG2826 Transposase and inactivated derivatives IS30 family MTYKHLTIDELTMIESYYLQHNKPVEIANRMGRAIQTIYNVVNKFKQGKTALDYWHQYKENKKKCGRKVIQLPAHEVDYIKEKVTLGWTPDVIIGRKERPVSCGMRTLYRLFSKGIFDIDTLPMKGKRKPNGHQEKRGKQQYQRSIHDRPDNYPDFNSEFGHLEGDTIVGIHHKSAVITLVERLSKVIITIKPNGRKALDIETALNQWFSRFPKNFFKSITFDCGKEFSNWKAISNQHDIDIYFADPGTPSQRPLNENSNGILRRNGLPKSMDFREVNQTFISSVSNQRNHIPRKSLNYRTPIEIFLSYVQEAFYSNLI
4 PC_00000001   E_faecalis_6240 2200 X X COG2826 Transposase and inactivated derivatives IS30 family MTYKHLTIDELTMIESYYLQHNKPVEIANRMGRAIQTIYNVVNKFKQGKTALDYWHQYKENKKKCGRKVIQLPAHEVDYIKEKVTLGWTPDVIIGRKERPVSCGMRTLYRLFSKGIFDIDTLPMKGKRKPNGHQEKRGKQQYQRSIHDRPDNYPDFNSEFGHLEGDTIVGIHHKSAVITLVERLSKVIITIKPNGRKALDIETALNQWFSRFPKNFFKSITFDCGKEFSNWKAISNQHDIDIYFADPGTPSQRPLNENSNGILRRNGLPKSMDFREVNQTFISSVSNQRNHIPRKSLNYRTPIEIFLSYVQEAFYSNLI
5 PC_00000001   E_faecalis_6240 2747 X X COG2826 Transposase and inactivated derivatives IS30 family MTYKHLTIDELTMIESYYLQHNKPVEIANRMGRAIQTIYNVVNKFKQGKTALDYWHQYKENKKKCGRKVIQLPAHEVDYIKEKVTLGWTPDVIIGRKERPVSCGMRTLYRLFSKGIFDIDTLPMKGKRKPNGHQEKRGKQQYQRSIHDRPDNYPDFNSEFGHLEGDTIVGIHHKSAVITLVERLSKVIITIKPNGRKALDIETALNQWFSRFPKNFFKSITFDCGKEFSNWKAISNQHDIDIYFADPGTPSQRPLNENSNGILRRNGLPKSMDFREVNQTFISSVSNQRNHIPRKSLNYRTPIEIFLSYVQEAFYSNLI
6 PC_00000001   E_faecalis_6240 2925 X X COG2826 Transposase and inactivated derivatives IS30 family MTYKHLTIDELTMIESYYLQHNKPVEIANRMGRAIQTIYNVVNKFKQGKTALDYWHQYKENKKKCGRKVIQLPAHEVDYIKEKVTLGWTPDVIIGRKERPVSCGMRTLYRLFSKGIFDIDTLPMKGKRKPNGHQEKRGKQQYQRSIHDRPDNYPDFNSEFGHLEGDTIVGIHHKSAVITLVERLSKVIITIKPNGRKALDIETALNQWFSRFPKNFFKSITFDCGKEFSNWKAISNQHDIDIYFADPGTLSQRPLNENSNGILRHNGLPKSMDFREVNQTFISSVSNQRNHIPRKSLNYRTPIEIFLSYVQEAFYSNLI
7 PC_00000001   E_faecalis_6240 2902 X X COG2826 Transposase and inactivated derivatives IS30 family MTYKHLTIDELTMIESYYLQHNKPVEIANRMGRAIQTIYNVVNKFKQGKTALDYWHQYKENKKKCGRKVIQLPAHEVDYIKEKVTLGWTPDVIIGRKERPVSCGMRTLYRLFSKGIFDIDTLPMKGKRKPNGHQEKRGKQQYQRSIHDRPDNYPDFNSEFGHLGGDTIVGIHHKSAVITLVERLSKVIITIKPNGRKALDIETALNQWFSRFPKNFFKSITFDCGKEFSNWKAISNQHDIDIYFADPGTLSQRPLNENSNGILRHNGLPKSMDFREVNQTFISSVSNQRNHIPRKSLNYRTPIEIFLSYVQEAFYSNLI
8 PC_00000001   E_faecalis_6240 2674 X X COG2826 Transposase and inactivated derivatives IS30 family MTYKHLTIDELTMIESYYLQHNKPVEIANRMGRAIQTIYNVVNKFKQGKTALDYWHQYKENKKKCDRKVIQLPAHEVDYIKEKVTLGWTPDVIIGRKERPVSCGMRTLYRLFSKGIFDIDTLPMKGKRKPNGHQEKRGKQQYQRSIHDRPDNYPDFNSEFGHLGGDTIVGIHHKSAVITLVERLSKVIITIKPNGRKALDIETALNQWFSRFPKNFFKSITFDCGKEFSNWKAISNQHDIDIYFADPGTLSQRPLNENSNGILRHNGLPKSMDFREVNQTFISSVSNQRNHIPRKSLNYRTPIEIFLSYVQEAFYSNLI
9 PC_00000001   E_faecalis_6240 25 X X COG2826 Transposase and inactivated derivatives IS30 family MTYTHLTSNELAMIEAYYNNHQSVAKTAVLLNRSRQTIHKVYQFFKTGHNALDYFNQYKKNKTRCGRRPIVLSDEQTEYIQKRVVQGWTPDVIVGRAEFSISCSMRTLYRMFKQGVFEVTHLPMKGKRKANGHKETRGKQSFRRSLRDRGNDYSKFNQEFGHLEGDTIVGKKHKSAVITLVERLSKVIITLQPEGRRAIDIENRLNQWMQSVPKHLFKSMTFDCGKEFSNWKSISNINDIDIYFADPGTPSQRGLNENSNGLLRKDGLPKQMDFNEVDESFIQSIASKRNNIPRKSLNYKTPIEVFLSHICKEELSNLI

I’m sure you need no help to know what to do with this file.

## Chapter V: Microbial Population Genetics

Here we will profile the single-nucleotide variations (SNVs) in the E. faecalis bin found in Sharon et al.’s Infant Gut Dataset (IGD).

This is more of a practical tutorial for hands on experience to recover and make sense of SNVs. For a more theoretical one on the same topic, please consider first reading the tutorial Analyzing sequence variants with anvi’o.

If you haven’t followed the previous sections of the tutorial, you will need the anvi’o merged profile database and the anvi’o contigs database for the IGD available to you. Before you continue, please click here, do everything mentioned there and come back right here to continue following the tutorial from the next line when you read the directive go back.

First of all, if you haven’t already, run this command to load the collection containing the E. faecalis bin (no harm done running it twice):

anvi-import-collection additional-files/collections/merens.txt \
--bins-info additional-files/collections/merens-info.txt \
-p PROFILE.db \
-c CONTIGS.db \
-C default


Please run the following commands in the IGD dir. They will set the stage for us to take a look at the E. faecalis bin:

# importing taxonomy for gene calls
anvi-import-taxonomy-for-genes -c CONTIGS.db \
-i additional-files/centrifuge-files/centrifuge_report.tsv \
additional-files/centrifuge-files/centrifuge_hits.tsv \
-p centrifuge

# importing the state file so things look pretty
anvi-import-state --state additional-files/state-files/state-merged.json \
--name default \
-p PROFILE.db


OK. Let’s first remind ourselves where this bin is.

You remember the E. faecalis bin from previous sections of this tutorial:

The red selection in the most outer layer represents the E. faecalis bin, which is very abundant in every sample, and it is stored in the collection merens under the bin name E_facealis (yes, the name has a typo, we know, it is all Tom’s fault). In this section we will focus on this bin, and in fact we don’t really need the rest of this metagenome.

Since we have identified the genome bin of focus, we can set aside the rest of the metagenome, and ‘split’ that bin out of this profile database to create a self-contained anvi’o profile that only describes this single bin. This is also a great way to share your research with others. Here is how we can get that bin out of this metagenome:

anvi-split -p PROFILE.db \
-c CONTIGS.db \
-C default \
-b E_facealis \
-o MAGs


Once it is done, this is what you will see in the resulting directory:

ls MAGs/
E_facealis
ls MAGs/E_facealis/
AUXILIARY-DATA.h5 CONTIGS.db        CONTIGS.h5        PROFILE.db


Look familiar? This generated a new anvi’o profile database and anvi’o contigs database that pertains only to the E_facealis bin. While the program anvi-split does its magic, it also adds into the resulting profile database a collection (called DEFAULT), with a single bin in it (called ALL_SPLITS) that contains all the contigs in this MAG. This way we can access to its contents from all anvi’o programs. Here is an example:

anvi-estimate-genome-completeness -p MAGs/E_facealis/PROFILE.db \
-c MAGs/E_facealis/CONTIGS.db \
-C DEFAULT

╒════════════╤══════════╤══════════════╤════════════════╤════════════════╤══════════════╤════════════════╕
│ bin name   │ domain   │   confidence │   % completion │   % redundancy │   num_splits │   total length │
╞════════════╪══════════╪══════════════╪════════════════╪════════════════╪══════════════╪════════════════╡
│ ALL_SPLITS │ BACTERIA │            1 │            100 │           4.23 │          140 │        2865861 │
╘════════════╧══════════╧══════════════╧════════════════╧════════════════╧══════════════╧════════════════╛


OK. If you were to visualize this bin,

anvi-interactive -p MAGs/E_facealis/PROFILE.db \
-c MAGs/E_facealis/CONTIGS.db


this is what you should get:

Expert question. Do you see that one split that has a different taxonomy? Who does not think it is not contamination? Why?

Why anvi-split is great for open science? In a recent large-scale study we discovered very abundant, yet previously unrecognized microbial populations in the surface ocean that have the capacity to ‘fix’ nitrogen. Here is the study, and here is the behind the paper blog post.

We did the scariest of all in the name of open science by making sure anyone could ‘see’ what is going on in our genome bins with as little effort as possible.

For instance, if you would you like to see the distribution and nucleotide-level coverage patterns of HBD-06, one of our nitrogen-fixing populations in our study across Tara Oceans metagenomes, you can simply copy-paste the following commands to your terminal:

wget https://ndownloader.figshare.com/files/8248433 \
-O TARA_ANW_MAG_00006.tar.gz

tar -zxvf TARA_ANW_MAG_00006.tar.gz

# we had compressed the auxiliary data to save space,
# you need to unzip it:
gzip -d TARA_ANW_MAG_00006/AUXILIARY-DATA.h5.gz

# if you are not using anvi'o v2.3.0, you need to migrate
# anvi'o databases to your version:
ANVIO_SAMPLES_DB=SKIP anvi-migrate-db TARA_ANW_MAG_00006/*db

anvi-interactive -p TARA_ANW_MAG_00006/PROFILE.db \
-c TARA_ANW_MAG_00006/CONTIGS.db


Anyone who is running these commands on a computer with any version of anvi’o (well, v2.3.0 or later) will see an interactive interface in their browser that shows the coverage of each contig across all Tara Oceans Project surface ocean samples:

Yep. Because it 2018. That’s why.

Alright.

Here is a different representation of the coverage of this bin across samples (which is a screenshot from our paper):

See, it really is abundant (every dot here is the coverage of a nucleotide position that was reported as a variable position).

OK. Clearly, we have no way of knowing the extent of variation within this bin through this perspective. But anvi-gen-variability-profile is exactly for that purpose, and that’s what we will do here using two different methods to visualize its report (using R, and using anvi’o).

Let’s first generate the SNV profile output file, details of which were described extensively here. Here it is:

anvi-gen-variability-profile -c MAGs/E_facealis/CONTIGS.db \
-p MAGs/E_facealis/PROFILE.db \
-C DEFAULT \
-b ALL_SPLITS \
--samples-of-interest additional-files/samples.txt \
--min-coverage-in-each-sample 20 \
--min-occurrence 3 \
--include-split-names \
--quince-mode \
-o E-faecalis-SNVs.txt


This command simply requests anvi-gen-variability-profile to select all E. faecalis nucleotide positions that were identified as ‘variable’ in at least one sample, and (1) covered by more than 20X in all 8 samples of interest (2) and display a minimum scattering power of 3 (minimum scattering power is a very simple attribute of a nucleotide position, and described here). Instead of --min-scatter 3 you could use --min-occurrence 3, however, in that case the program would have returned every SNV position that was reported in more than 3 samples, which would have included nucleotide positions that were variable in every sample. Finally we also used the --samples-of-interest parameter with the following file:

cat additional-files/samples.txt
DAY_15B
DAY_16
DAY_17B
DAY_18
DAY_19
DAY_22A
DAY_23
DAY_24


which contains only the sample names with larger number of reads for every duplicate day to simplify things.

OK. Our beloved anvi-gen-variability-profile will kindly store results from 466 nucleotide positions in the output file, E-faecalis-SNVs.txt. Because we used the flag --quince-mode, there will be a total of 3,728 entries in the file (=466 x 8).

At this point we now know that the E. faecalis is not a monoclonal bin, and does maintain some level of heterogeneity. What we don’t know is whether it is just noise, or if there is any signal that may tell us something about the nature of this variability.

Now we will visualize the information two different ways.

### Visualizing SNV profiles using R

First, we will use R to recapitulate what we did in our paper. Of course, please feel absolutely free to look at the Figure 3 behind that link if you are fine with RUINING THE SURPRISE :(

For this step we will need this R script Meren had written before. You can download it the following way:

wget https://raw.githubusercontent.com/meren/anvio-methods-paper-analyses/a57b0cee07e9dd6fc59892114f2ad5bb9df78215/SHARON_et_al/VARIABILITY_REPORTS/02_GEN_FIGURE_SUMMARY.R \
-O visualize-SNVs.R
chmod +x visualize-SNVs.R


Here is an extra step, and why we need it: E-faecalis-SNVs.txt contains filtered SNVs, but we also want to know about each sample’s variation density, the number of nucleotide positions reported as a variable position for each 1,000 nts. The R script can already do it, but we need another file that reports ALL SNVs for every sample, without any filters to get the raw counts. Let’s first create that the following way:

anvi-gen-variability-profile -c MAGs/E_facealis/CONTIGS.db \
-p MAGs/E_facealis/PROFILE.db \
-C DEFAULT \
-b ALL_SPLITS \
--samples-of-interest additional-files/samples.txt \
-o E-faecalis-SNV-density.txt


Now we can use the R script to visualize this information the following way,

This R script will require some libraries to be installed. You can install all of them by typing R in your terminal, and then entering this command: install.packages(c('ggplot2', 'reshape2', 'reshape', 'gridExtra', 'grid', 'plyr', 'gtools')). After everything installs, exit R by typing q(). My R version that worked with this script was v3.2.2.

./visualize-SNVs.R E-faecalis-SNVs.txt E-faecalis-SNV-density.txt 150 2870000


where 150 is the number of random positions we want to show in the heatmap, and 2780000 is the genome size to calculate the SNV density per sample. If you have all necessary R libraries installed, you should see a new PDF file in the directory that looks like this:

The figure displays for the E. faecalis bin in each sample (from top to bottom), (1) average coverage values for all splits, (2) variation density (number of variable positions reported during the profiling step per kilo base pairs), (3) heatmap of variable nucleotide positions, (4) ratio of variable nucleotide identities, and finally (5) the ratio of transitions (mutations that occur from A to G, or T to C, and vice versa) versus transversions. In the heatmap, each row represents a unique variable nucleotide position, where the color of each tile represents the nucleotide identity, and the shade of each tile represents the square root-normalized ratio of the most frequent two bases at that position (i.e., the more variation in a nucleotide position, the less pale the tile is).

With some rearrangement and polishing using Inkscape, you can see how this output fit into our Figure 3:

The figure shows that the variation density changes quite dramatically from one day to another, despite the rather stable coverage. Plus, the emergence of this pattern is not random: the heatmap shows that the nucleotide positions that show variation re-occur, and competing base identities remains the same.

Investigating what causes this, is of course when things start to get exciting. However, we will not go there. Instead, we would like to leave you with this thought: by using patterns of variability, we can start characterizing changes in microbial population structures across environments, and generate better-informed hypotheses to investigate mechanisms that drive these shifts.

### Visualizing SNV profiles using anvi’o

R visualization is useful, but the heatmap in that figure can’t effectively visualize more than a couple hundred positions. That’s why there is a random subsampling step. But we can use the anvi’o interactive interface to display up to 25,000 nucleotide positions easily.

For this, we have a little program called anvi-script-snvs-to-interactive which reads an anvi’o variability profile, and creates an anvi’o interactive interface-compatible output from it.

So when you run this script the following way,

anvi-script-snvs-to-interactive E-faecalis-SNVs.txt -o e_faecalis_snvs


it will do its magic, and create an output directory with material that can directly be used with anvi-interactive with the --manual flag.

A little note for people who are interested in programming: Feel free to take a look at the relevant line of the source code of this script to see how easy it is to generate an anvi’o-compatible visualizable output from any TAB-delimited matrix file.

If you run the interactive interface on these results the following way,

anvi-interactive --profile e_faecalis_snvs/profile.db \
--tree e_faecalis_snvs/tree.txt \
--view-data e_faecalis_snvs/view.txt \
--title "SNV Profile for the E. faecalis bin" \
--manual


You will get this view:

This view can definitely be improved. I prepared a state file to match colors of competing nucleotides to the R results. If you import that state file and run the interactive interface the following way,

anvi-import-state -p e_faecalis_snvs/profile.db \
--state additional-files/state-files/state-snvs.json \
--name default

anvi-interactive -d e_faecalis_snvs/view.txt \
-t e_faecalis_snvs/tree.txt \
-p e_faecalis_snvs/profile.db \
--title "SNV Profile for the E. faecalis bin" \
--manual


This time you will get this display:

As we’ve seen before, occurrence of SNVs follow a bi-daily fashion. Not that it needs any further convincing, but just to show off here, if you were to click Samples > Sample order > view_data > Draw, you can see that even days and odd days nicely separate from each other:

For instance while I was on this page, I selected two SNV positions that showed different patterns:

Position 80 (the selection towards the right) shows variability only in odd days, position 686 shows variability only in even days. This will probably at some point will become a right-click menu function, but even now we can do some tricks to explore the context of these SNVs. Because that is what anvi’o is all about. Exploring, sometimes to disturbing depths. Well, here are two tiny AWK one-liner to get some specific information about these positions from our files. This is for the one on the left:

awk '{if(NR == 1 || $2 == 686) print$2 " " $3 " "$4 " " $6 " "$14 " " $15 " "$25}' E-faecalis-SNVs.txt | column -t
unique_pos  sample_id  pos   gene_call  departure_from_ref  competing_nts  split_name
686         DAY_18     6333  1826       0.0617647058824     CT             Day17a_QCcontig4_split_00022
686         DAY_22A    6333  1826       0.0697674418605     CT             Day17a_QCcontig4_split_00022
686         DAY_16     6333  1826       0.0805369127517     CT             Day17a_QCcontig4_split_00022
686         DAY_24     6333  1826       0.0863636363636     CT             Day17a_QCcontig4_split_00022
686         DAY_17B    6333  1826       0                   CC             Day17a_QCcontig4_split_00022
686         DAY_19     6333  1826       0                   CC             Day17a_QCcontig4_split_00022
686         DAY_15B    6333  1826       0                   CC             Day17a_QCcontig4_split_00022
686         DAY_23     6333  1826       0                   CC             Day17a_QCcontig4_split_00022


And this is for the one on the right:

awk '{if(NR == 1 || $2 == 80) print$2 " " $3 " "$4 " " $6 " "$14 " " $15 " "$25}' E-faecalis-SNVs.txt | column -t
unique_pos  sample_id  pos   gene_call  departure_from_ref  competing_nts  split_name
80          DAY_17B    7955  233        0.122591943958      GT             Day17a_QCcontig1_split_00012
80          DAY_19     7955  233        0.0752688172043     GT             Day17a_QCcontig1_split_00012
80          DAY_15B    7955  233        0.109271523179      GT             Day17a_QCcontig1_split_00012
80          DAY_23     7955  233        0.11377245509       GT             Day17a_QCcontig1_split_00012
80          DAY_18     7955  233        0                   TT             Day17a_QCcontig1_split_00012
80          DAY_22A    7955  233        0                   TT             Day17a_QCcontig1_split_00012
80          DAY_16     7955  233        0                   TT             Day17a_QCcontig1_split_00012
80          DAY_24     7955  233        0                   TT             Day17a_QCcontig1_split_00012


Good, everything checks out. Now since we know the split names and positions in splits, we can in fact see where they actually are using the interactive interface to visualize the merged profile database again, and look at the wider context using the ‘inspect’ option. Which I have already done for you:

There are many directions you can go once you have the gene caller IDs associated with a question you have. Just take a look at this post and see some of the hidden talents of anvi’o: Musings over a Nitrospira genome that can do complete nitrification.

Here I will stop, but still we have a lot to talk about!

### Visualizing SNV profiles as a network

Finally, you can generate an XML description of the SNV profiles you have generated using anvi-gen-variability-profile program, using the program anvi-gen-variability-network:

anvi-gen-variability-network -i E-faecalis-SNVs.txt \
-o E-faecalis-SNVs.gexf


You can use Gephi to play with the resulting file to visualize or to analyze the network properties of nucleotide positions further.

Here is a screenshot from Gephi for SNV profiles in the E. faecalis genome bin:

OK.

You just read about three different ways to visualize the single-nucleotide variability profiles. We are done here!

## Chapter VI: A genome across metagenomes

In the previous sections of this tutorial we recovered an E. faecalis genome from an infant gut metagenome.

Increasing availability of shotgun metagenomes offer quite powerful ways to characterize MAGs and isolate genomes across different systems. Harnessing the power of available metagenomes, in fact we could ask the question “how does our E. faecalis MAG look like in other infant gut metagenomes”. To provide a preliminary answer to this question, and how similar questions can be answered using in anvi’o, here we will use another Banfield group study by Brooks et al.

In your data-pack you have a merged anvi’o profile that describes the read recruitment statistics of our E. faecalis MAG across 20 gut metagenomes from the Brooks et al. study.

Snakemake workflows to the rescue Your data-pack already contains an anvi’o contigs database and a merged profile database that puts the E. faecalis genome in the context of 20 gut metagenomes, but here is some extra info for those of you who are curious to know how did we generated these data.

To generate these mapping results and to profile them using anvi’o, we used the anvi’o snakemake workflow for metagenomics.

Please read this article for an extensive tutorial on anvi’o snakemake workflows.

Basically, this is how our samples.txt file looked like:

sample r1 r2
SRR5405923 SRR5405923_R1.fastq SRR5405923_R1.fastq
SRR5405692 SRR5405692_R1.fastq SRR5405692_R1.fastq
SRR5405915 SRR5405915_R1.fastq SRR5405915_R1.fastq
SRR5405913 SRR5405913_R1.fastq SRR5405913_R1.fastq
SRR5405912 SRR5405912_R1.fastq SRR5405912_R1.fastq
SRR5405829 SRR5405829_R1.fastq SRR5405829_R1.fastq
SRR5405918 SRR5405918_R1.fastq SRR5405918_R1.fastq
SRR5405707 SRR5405707_R1.fastq SRR5405707_R1.fastq
SRR5405710 SRR5405710_R1.fastq SRR5405710_R1.fastq
SRR5405922 SRR5405922_R1.fastq SRR5405922_R1.fastq
SRR5405920 SRR5405920_R1.fastq SRR5405920_R1.fastq
SRR5405783 SRR5405783_R1.fastq SRR5405783_R1.fastq
SRR5405711 SRR5405711_R1.fastq SRR5405711_R1.fastq
SRR5405693 SRR5405693_R1.fastq SRR5405693_R1.fastq
SRR5405708 SRR5405708_R1.fastq SRR5405708_R1.fastq
SRR5405784 SRR5405784_R1.fastq SRR5405784_R1.fastq
SRR5405958 SRR5405958_R1.fastq SRR5405958_R1.fastq
SRR5405957 SRR5405957_R1.fastq SRR5405957_R1.fastq
SRR5405770 SRR5405770_R1.fastq SRR5405770_R1.fastq
SRR5405712 SRR5405712_R1.fastq SRR5405712_R1.fastq

Which simply lists names and paths for 20 metagenomes we downloaded from the study above.

Next, the fasta.txt file, which gives the path for our FASTA file for the MAG:

reference path
E_facealis E_facealis_MAG.fa

And this is our config.json, which describes how to do the read recruitment:

{
"samples_txt": "samples.txt",
"fasta_txt": "fasta.txt",
"references_mode": true,
"import_percent_of_reads_mapped": {
"run": false
},
"bowtie": {
"additional_params": "--no-unal",
"threads": 4
},
"anvi_profile": {
"--overwrite-output-destinations": true,
"--min-contig-length": 0,
"--profile-SCVs": true,
"--skip-SNV-profiling": false,
"threads": 4
},
"anvi_script_reformat_fasta": {
"run": true
},
"anvi_run_ncbi_cogs": {
"run": true,
"threads": 15
},
"output_dirs": {
"FASTA_DIR": "02_FASTA",
"LOGS_DIR": "00_LOGS",
"QC_DIR": "01_QC",
"PROFILE_DIR": "05_ANVIO_PROFILE",
"MERGE_DIR": "06_MERGED",
"MAPPING_DIR": "04_MAPPING",
"CONTIGS_DIR": "03_CONTIGS"
},
"anvi_run_hmms": {
"run": false
},
"iu_filter_quality_minoche": {
"run": false
},
"anvi_merge": {
"--skip-concoct-binning": true
}
}


And we run the following command, which gave us the anvi’o output in your data-pack:

anvi-run-workflow -w metagenomics \
-c config.json


Magic.

You can visualize the merged profile in your data-pack by typing the following command,

anvi-interactive -p additional-files/e_faeealis_across_hmp/PROFILE.db \
-c additional-files/e_faeealis_across_hmp/CONTIGS.db


This output simply puts our contigs in the E. faecalis genome we recovered from Sharon et al.’s infant gut in the context of 20 other infant gut metagenomes.

### The gene mode: studying distribution patterns at the gene-level

Furthermore, one could want to do it per gene basis for a more higly resolved understanding of distribution patterns. Luckily, if you have a collection and a bin in your profile database, anvi’o allows that.

If you don’t have a collection, anvi-script-add-default-collection will help you add one.

The ‘gene mode’ can be invoked using the following command:

anvi-interactive -p additional-files/e_faeealis_across_hmp/PROFILE.db \
-c additional-files/e_faeealis_across_hmp/CONTIGS.db \
-C DEFAULT \
-b EVERYTHING \
--gene-mode


which should produce the following output, where the coverage of each gene is independently visualized in the context of other metagenomes:

If you right click any of the genes, you can see that this menu has more options,

Including the inspection of a given gene in its context with upstream and downstream coverage information:

## Chapter VII: Single-amino acid variants to protein structures

This section of the tutorial uses the data described in the above section Putting the E. faecalis genome in the context of HMP gut metagenomes. You already have this data, but if you’re wondering “where did this come from?”, you should read that section and then come back.

This is a brief demonstration to showcase protein structure visualization that can be carried out by anvi’o. For a more detailed and in depth tutorial dedicated to this topic, please refer to this post

In the previous sections of this tutorial we used single-nucleotide variants (SNVs) to explore sequence heterogeneity of an E. faecalis population. In this section of the tutorial we explore in what ways these variants alter the encoded protein synthesized downstream, and how such variation is shaped by principles of protein evolution. When analyzing sequence variation in the context of proteins, it makes a lot more sense to characterize sequence variation with single codon variants (SCVs) or single amino acid variants (SAAVS) rather than SNVs. If you are unfamiliar with these concepts, or want to learn more, click me.

Show/hide Evan's note

It’s worth noting that this E. faecalis MAG recruits reads from populations with relatively low amounts of intra-population diversity. In more diverse microbial environments such as marine or soil ecosystems, visual patterns of variants on protein structures become much more striking. While the infant gut tutorial is a good medium to get you familiarized with various features of anvi’o, you should check out this paper if you want to see more striking patterns of structural sequence variation. Here is a figure from that study:

Evan.

### The structure database

To analyze and visualize where nucleotide-level variation occurs in the three-dimensional structure of the encoded protein, you need a protein structure. For this, anvi’o has a dedicated structure database that stores all structural information requested by the user. Users can create their own structure database by providing a contigs database, and a list of gene IDs in the contigs database they want to predict structures for.

Very briefly, structures are predicted using MODELLER, which attempts to find homologous proteins with experimentally-solved structures. If any exist, they are used as templates to model your proteins of interest.

You already have a pre-made structure database in the data-pack, and you can make sure it exists with the following command:

ls additional-files/e_faeealis_across_hmp/STRUCTURE.db


Make your own structure database For your information, you can re-create this structure database with the following command, although you will have to install some dependencies (anvi’o will tell you what to do if you try but don’t have the dependencies):

anvi-gen-structure-database -c additional-files/e_faeealis_across_hmp/CONTIGS.db \
--max-number-templates 1 \
--gene-caller-ids 58,620,2105,2371,2602 \
-o additional-files/e_faeealis_across_hmp/STRUCTURE_NEW.db


### Visualizing proteins with anvi-display-structure

With a structure database in hand, you are ready to visualize SAAVs and SCVs on protein structures.

anvi-display-structure -s additional-files/e_faeealis_across_hmp/STRUCTURE.db \
-p additional-files/e_faeealis_across_hmp/PROFILE.db \
-c additional-files/e_faeealis_across_hmp/CONTIGS.db \
--min-departure-from-consensus 0


After running the above command, you should see the following view.

This is the predicted structure for gene 58. It’s a 4-methyl-5-(beta-hydroxyethyl)thiazole kinase, and the template protein used to model the structure was crystallized from an E. faecalis isolate by the Joint Center for Structural Genomics. The template shares 96.5% sequence identity with our gene, and so we are very confident in the accuracy of the predicted structure. Each red sphere indicates the position of a SAAV that occurred in at least one of the metagenomes. You can learn more about each SAAV by hovering the mouse above it.

Under ‘Reference info’ you can see that the MAG had an asparagine at this position, however under ‘Variant frequencies’, 88% of reads were threonine within the infants sampled. If you inspect the amino acid frequencies of more of these SAAVs, you will notice that most of them have insignificant levels of variation. These may be true biological variation in the environment, but they are at such low frequencies they are indistinguishable from sequencing error. To avoid fooling ourselves, we can filter out these reads by going to the Filters tab, and conservatively setting the minimum departure from reference value to 0.05:

To get a better idea of the characteristics of each SAAV, you can set the color and size for each according to various metrics of interest in the Perspectives tab. In the animation below, we color each SAAV according to its entropy to quantify the degree of variation at each site (click here and search for ‘entropy’ for an explanation):

The 4 white-ish spheres are sites in which the metagenomes disagree with the MAG, but where there is little to no variation within the environment. In contrast, the 3 red-ish spheres indicate sites in which the environment has a mixed population of 2 or more amino acids. Verify this for yourself by looking at the variant frequencies and comparing their identities to the reference residue.

### SAAVs and SCVs

So far we’ve focused exclusively on SAAVs, however we can also study variation with single codon variants (SCVs). You can visualize SCVs by clicking CDN (read: codon) in the Main tab:

In this view, the first thing you will notice is that there are many more SCVs than SAAVs. That’s because unlike SAAVs, SCVs also report on synonymous variation. For example, this residue is completely synonymous, but impressively houses 4 of the 6 arginine codons:

In the situation below, there exists both synonymous and non-synonymous transitions. Note that the synonymity correspondingly deviates from 1.

If you inspect enough of these genes, you’ll start to notice that SCVs at solvent inaccessible residues (relative solvent accessibility = 0) are almost always entirely synonymous (synonymity = 1). Do you have any explanation for why?

### Case study: sequence variation within a kinase binding pocket

Since gene 58 is a kinase, it may not surprise you it has a binding pocket for ADP. Unfortunately there is currently no way to visualize ligands, but we can get a sense for the binding pocket by adding a translucent surface to the protein in the Perspectives tab. The binding pocket is pointed out in the animation below, and if you are up for the challenge, you can try and find this pocket in your own interface.

Since binding pockets are meticulously constrained to maintain optimal binding kinetics, they are under extreme purifying selection, which is why we don’t see any SAAVs in the binding pocket:

However, when we switch to a view of SCVs, the pocket lights up like a Christmas tree.

Although visually convincing that these SCVs form part of the binding pocket, to prove we are not fooling ourselves, here are the residues determined crystallographically from the template protein to be in direct contact with ADP:

Click here to investigate for yourself. Of these residues, G190, V191, and G120 are SCVs in the infant guts, which confirms that mutations are occuring in the region of DNA encoding the binding pocket. Hovering the mouse over of each of these 3 SCVs, you can verify that these residues have synonymity values of 1, which is to say they are completely synonymous sites. This is a fine example of how non-synonymous sequence evolution is usually more highly constrained than synonymous sequence evolution due to fitness-decreasing alterations in the biophysical properties of the encoded protein. Since these synonymous SCVs do not influence the binding site, they have no fitness effect on the protein’s ability to function. It’s impossible to know without conducting fitness experiments, but most likely they are neutral mutations, and have entered the population through drift and/or draft.

### Grouping metagenomes

You now know the basics. Let’s talk about how to get the most out of this interface. Sometimes it is desirable to partition metagenomes into groups based on the experimental design. For example, these 20 metagenomes are from 7 infants, and it would be useful to group which metagenomes belong to which infant. Unless the user provides custom categories for groupings, only two are by default available: merged and samples. Currently you are using the merged category, which is why you see all variant data merged onto a single protein view. Go to the Main tab and select samples from within the Category dropdown menu. You should see this:

Now you can visualize each metagenome’s variants individually. Let’s import some metadata so we can create our own custom groupings. Return to your terminal and import the additional data table into the profile database:

anvi-import-misc-data additional-files/e_faeealis_across_hmp/additional_layers_table.txt \
-p additional-files/e_faeealis_across_hmp/PROFILE.db \
-t layers


This is just a TAB-delimited table that is compatible with anvi’o misc data table inputs:

samples dol cohort location infant cohort_infant birth_date_matched_to_study_day gestational_age_weeks sex birth_mode birth_weight_grams hospital day_of_life_range
SRR5405958 7 NIHY2 Gut 61 NIHY2_61 303.0 28.0 M Vaginal 1184.0 Magee-Womens 00-10
SRR5405957 8 NIHY2 Gut 61 NIHY2_61 303.0 28.0 M Vaginal 1184.0 Magee-Womens 00-10
SRR5405923 15 NIHY2 Gut 66 NIHY2_66 330.0 28.0 F Vaginal 1028.0 Magee-Womens 10-20
SRR5405922 17 NIHY2 Gut 66 NIHY2_66 330.0 28.0 F Vaginal 1028.0 Magee-Womens 10-20
SRR5405920 21 NIHY2 Gut 66 NIHY2_66 330.0 28.0 F Vaginal 1028.0 Magee-Womens 20-30
SRR5405918 31 NIHY2 Gut 66 NIHY2_66 330.0 28.0 F Vaginal 1028.0 Magee-Womens 30-40
SRR5405915 48 NIHY2 Gut 66 NIHY2_66 330.0 28.0 F Vaginal 1028.0 Magee-Womens 40-50
SRR5405913 60 NIHY2 Gut 66 NIHY2_66 330.0 28.0 F Vaginal 1028.0 Magee-Womens 60-70
SRR5405912 64 NIHY2 Gut 66 NIHY2_66 330.0 28.0 F Vaginal 1028.0 Magee-Womens 60-70
SRR5405829 6 Sloan2 Gut 4 Sloan2_4 805.0 30.0 M C-section 1138.0 Magee-Womens 00-10
SRR5405784 11 Sloan2 Gut 8 Sloan2_8 891.0 27.0 M Vaginal 1023.0 Magee-Womens 10-20
SRR5405783 12 Sloan2 Gut 8 Sloan2_8 891.0 27.0 M Vaginal 1023.0 Magee-Womens 10-20
SRR5405770 10 Sloan2 Gut 9 Sloan2_9 912.0 25.0 F Vaginal 930.0 Magee-Womens 10-20
SRR5405712 10 Sloan2 Gut 12 Sloan2_12 986.0 29.0 F C-section 1210.0 Magee-Womens 10-20
SRR5405711 11 Sloan2 Gut 12 Sloan2_12 986.0 29.0 F C-section 1210.0 Magee-Womens 10-20
SRR5405710 12 Sloan2 Gut 12 Sloan2_12 986.0 29.0 F C-section 1210.0 Magee-Womens 10-20
SRR5405708 14 Sloan2 Gut 12 Sloan2_12 986.0 29.0 F C-section 1210.0 Magee-Womens 10-20
SRR5405707 15 Sloan2 Gut 12 Sloan2_12 986.0 29.0 F C-section 1210.0 Magee-Womens 10-20
SRR5405693 9 Sloan2 Gut 14 Sloan2_14 1009.0 30.0 M C-section 1200.0 Magee-Womens 00-10
SRR5405692 10 Sloan2 Gut 14 Sloan2_14 1009.0 30.0 M C-section 1200.0 Magee-Womens 10-20

Afterwards, we can rerun the interactive display:

anvi-display-structure -s additional-files/e_faeealis_across_hmp/STRUCTURE.db \
-p additional-files/e_faeealis_across_hmp/PROFILE.db \
-c additional-files/e_faeealis_across_hmp/CONTIGS.db \
--min-departure-from-consensus 0


Now, there are many more grouping categories available to us.

cohort_infant categorizes metagenomes into the infants from which they were sampled. Select it to reveal one protein view for each of the 7 infants. Now the display can inform us about subtle differences in SAAVs between each infant.

We hope this tutorial give you enough motivation to explore your own data with this framework! Feel free to click on the other 4 genes in the Main tab and further explore, or better yet, do it with your own metagenomes.

Let’s face it: it is a lot easier to live in a world of A, C, T, and G. But in this restricted space we can’t explore the biochemical consequences of our sequences, and we therefore can’t understand why DNA sequences are the way they are. This branch of anvi’o has been designed to help make this undertaking more accessible, but it is only in its infancy. If you have suggestions, please let us know.

## Final words

Do you want more examples? Do you have questions? Please don’t hesitate to get in touch with us!