- Creating an anvi’o contigs database
- Profiling BAM files
- Working with anvi’o profiles
- Final words
The latest version of anvi’o is
v7.1. See the release notes.
The goal of this tutorial is to provide a brief overview of the anvi’o workflow for the analysis of assembly-based shotgun metagenomic data. Throughout this tutorial you will primarily learn about the following topics:
Process your contigs,
Profile your metagenomic samples and merge them,
Visualize your data, identify and/or refine genome bins interactively, and create summaries of your results.
If we are missing things, or parts of the tutorial is not clear, please let us know, and we will do our best to improve it.
Metagenomics is extremely rich, and this tutorial will not prepare you to be able to unlock the potential of your data. It will only give you an initial steps into the anvi’o software ecosystem and its philosophy dealing with data. Please feel free to come to anvi’o Discord and tell us about your project, and ask for best practices given your needs.
If you are here, you must have already installed the platform (hopefully without much trouble).
It is always a good idea to stick with stable versions of the platform, as the snapshots from the codebase can be very unstable and/or broken. However we also need people who like to live at the edge, and who would follow the development, test new features, join discussions, and push us to do better.
You probably will run into issues while using anvi’o. We apologize for that in advance. But when that happens, please consider taking a look at this post on how to get help. Regardless of the method to connect, please don’t forget to copy-paste the
anvi-interactive -v output, the operating system you are using, or any other details that may be relevant to the problem.
To run the anvi’o metagenomic workflow, you will need these files:
A FASTA file of your contigs. We shall call it
contigs.fathroughout this manual. We will assume that
contigs.facontains contigs from a co-assembly. However, it may also have been a reference genome from NCBI, a metagenome-assembled genome (MAG), or a bunch of genes you are interested in profiling. Regardless of what it contains, the following steps will not change too much.
BAM files for your samples. In fact you can use most of what anvi’o offers, including its binning capabilities, even if you don’t have any BAM files, but since you would be an outlier if that’s the case, let’s continue with the more conventional scenario. Where you have your contigs, and one or more BAM files.
This tutorial starts with BAM files and a FASTA file for your contigs. The reason for that is simple: there are many ways to get your contigs and BAM files for your metagenomes. But we have started implementing a tutorial that describes the workflow we use to generate these files regularly: “A tutorial on assembly-based metagenomics”. Please feel free to take a look at that one, as well. But this tutorial assumes that you have your BAM files and the FASTA file for your contigs ready.
To make things easier to follow, we will use three mock samples throughout this tutorial:
SAMPLE-03 (in fact, these are subsampled from a human gut metagenome time series). By clicking the following links, you can download the
contigs.fa and the three BAM files we generated by mapping short reads from each sample to
contigs.fa: contigs.fa, SAMPLE-01-RAW.bam, SAMPLE-02-RAW.bam, SAMPLE-03-RAW.bam. Save them into a directory, and run every command in that directory throughout the tutorial.
For the contigs and BAM files for your real data, there is one more thing you have to make sure you have: simple deflines. Keep reading.
Take a look at your FASTA file
Your FASTA file must have simple deflines, and if it doesn’t have simple deflines, you must fix your FASTA file prior to mapping. This is necessary, because the names in
contigs.fa must match the names in your BAM files. Unfortunately, different mapping software behave differently when they find a space character, or say a
| character in your FASTA file, and they proceed to change those characters in arbitrary ways. Therefore it is essential to keep the sequence IDs in your FASTA file as simple as possible before mapping. To avoid any problems later, take a look at your deflines prior to mapping now, and remove anything that is not a digit, an ASCII letter, an underscore, or a dash character. Here are some bad deflines:
>Contig-123 length:4567 >Another defline 42 >gi|478446819|gb|JN117275.2|
And here are some OK ones:
>Contig-123 >Another_defline_42 >gi_478446819_gb_JN117275_2
If you have bad deflines, you need to reformat your FASTA file, and do the mapping again (if you have done you mapping already, you can convert your BAM files into SAM files, edit the SAM file to correct deflines, and re-generate your BAM files with proper names, but these kind of error-prone hacks require a lot of attention to make sure you did not introduce a bug early on to your precious data).
Re-formatting your input FASTA
You can use the following anvi’o script to fix your deflines:
$ anvi-script-reformat-fasta contigs.fa -o contigs-fixed.fa -l 0 --simplify-names
This script will give you your FASTA file with simplified deflines. If you use the flag
--report-file, it will also create a TAB-delimited file for you to keep track of which defline in the new file corresponds to which defline in the original file. While you are here, it may also be a good idea to remove some very short contigs from your contigs file for a clean start. If you like that idea, you can run the same command this way to also remove sequences that are shorter than 1,000 nts:
$ anvi-script-reformat-fasta contigs.fa -o contigs-fixed.fa -l 1000 --simplify-names
Let’s just overwrite the
contigs-fixed.fa here to make things simpler:
$ mv contigs-fixed.fa contigs.fa
Creating an anvi’o contigs database
An anvi’o contigs database will keep all the information related to your contigs: positions of open reading frames, k-mer frequencies for each contigs, where splits start and end, functional and taxonomic annotation of genes, etc. The contigs database is an essential component of everything related to anvi’o metagenomic workflow.
$ anvi-gen-contigs-database -f contigs.fa -o contigs.db -n 'An example contigs database'
When you run this command,
Compute k-mer frequencies for each contig (the default is
4, but you can change it using
--kmer-sizeparameter if you feel adventurous).
Soft-split contigs longer than 20,000 bp into smaller ones (you can change the split size using the
--split-length). When the gene calling step is not skipped, the process of splitting contigs will consider where genes are and avoid cutting genes in the middle. For very very large assemblies this process can take a while, and you can skip it with
Identify open reading frames using Prodigal, the bacterial and archaeal gene finding program developed at Oak Ridge National Laboratory and the University of Tennessee. If you don’t want gene calling to be done, you can use the flag
--skip-gene-callingto skip it. If you have your own gene calls, you can provide them to be used to identify where genes are in your contigs. All you need to do is to use the parameter
--external-gene-calls(the format for external gene calls file is explained here).
Almost every anvi’o program comes with a help menu that explains available parameters in detail. Don’t forget to check them once in a while. If something is not clearly explained, please let us know so we can fix that:
$ anvi-gen-contigs-database --help
Once you have your contigs database, you can start importing things into it, or directly go to the profiling step.
Although this is absolutely optional, you shouldn’t skip this step and run the program anvi-run-hmms on any contigs-db you generate. Anvi’o can do a lot with hidden Markov models (HMMs provide statistical means to model complex data in probabilistic terms that can be used to search for patterns, which works beautifully in bioinformatics where we create models from known sequences, and then search for those patterns rapidly in a pool of unknown sequences to recover hits). To decorate your contigs database with hits from HMM models that ship with the platform (which, at this point, constitute multiple published bacterial single-copy gene collections), run this command:
$ anvi-run-hmms -c contigs.db
When you run this command (without any other parameters),
It will utilize multiple default bacterial single-copy core gene collections and identify hits among your genes to those collections using HMMER. If you have already run this once, and now would like to add an HMM profile of your own, that is easy. You can use
--hmm-profile-dirparameter to declare where should anvi’o look for it. Or you can use the
--installed-hmm-profileparameter to only run a specific default HMM profile on your contigs database.
Note that the program will use only one CPU by default, especially if you have multiple of them available to you, you should use the
--num-threadsparameter. It significantly improves the runtime, since HMMER is truly an awesome software.
Once you have your contigs database ready, and optionally your HMMs are run, you can take a quick look at it using the program anvi-display-contigs-stats:
$ anvi-display-contigs-stats contigs.db
This program shows you simple stats of your contigs database that may help you not only assess your assembly output, but also estimate the number of bacterial and archaeal genomes to recover.
This program accepts multiple anvi’o contigs databases to compare to each other. You can find some screenshots of how it looks like in release notes for anvi’o
Yet another optional step is to run the program anvi-run-ncbi-cogs to annotate genes in your contigs-db with functions from the NCBI’s Clusters of Orthologus Groups. You will be glad that you did it later!
Do not forget to use
--num-threads to specify how many cores you wish to use for that. See the help menu for other available options.
If you are running COGs for the first time, you will need to set them up on your computer using
anvi-setup-ncbi-cogs. Simply run it on a machine with an internet connection, and let it do its magic.
Anvi’o also can make good use of functional annotations you already have for your genes using the program anvi-import-functions.
Annotating genes with taxonomy can make things downstream more meaningful, and in some cases may improve the human guided binning and refinement steps. Please see this post to find out different ways to achieve that:
However, gene-level taxonomy is not reliable for making sense of the taxonomy of the resulting metagenome-assembled genomes. But you will also have anvi-estimate-scg-taxonomy under your belt to assing extremely quick taxonomy to your genomes and metagenomes. See this article for more information:
Profiling BAM files
If you are here, you must be done with your contigs database, and have your BAM files ready. Good! It is time to initialize your BAM file, and create an anvi’o single-profile database for your sample.
Anvi’o requires raw BAM files to be turned into a sorted and indexed bam-file. In most cases the BAM file you get back from your mapping software will not be sorted and indexed. This is why we named the BAM file for our mock samples as
SAMPLE-01-RAW.bam, instead of
If your BAM files already sorted and indexed (i.e., for each
.bam file you have, there also is a
.bam.bai file in the same directory), you can skip this step. Otherwise, you need to initialize your BAM files using the program anvi-init-bam:
$ anvi-init-bam SAMPLE-01-RAW.bam -o SAMPLE-01.bam
But of course it is not fun to do every BAM file you have one by one. So what to do?
A slightly better way to do would require you to do it in a
for loop. First, create a file called, say,
SAMPLE_IDs. For your samples (
Y) it will look like this:
$ cat SAMPLE_IDs SAMPLE-01 SAMPLE-02 SAMPLE-03
Then, you can run
anvi-init-bam on all of them by typing this:
for sample in `cat SAMPLE_IDs`; do anvi-init-bam $sample-RAW.bam -o $sample.bam; done
Of course, if you have a way to cluster your runs, you already know what to do.
One last note, anvi-init-bam uses samtools in the background to do sorting and indexing. While clearly a big thanks and all the credit go to samtools developers, this is also a reminder that you can get your BAM files sorted and indexed using the samtools command line client without using
In contrast to the contigs-db, an anvi’o single-profile-db stores sample-specific information about contigs. Profiling a BAM file with anvi’o using anvi-profile creates a single profile that reports properties for each contig in a single sample based on mapping results. Each single-profile-db links to a single contigs-db, and anvi’o can merge single profiles that link to the same contigs database into merged profile database (which will be covered later).
In other words, the profiling step makes sense of each BAM file separately by utilizing the information stored in the contigs database. It is one of the most critical (and also most complex and computationally demanding) steps of the metagenomic workflow.
The simplest form of the command that starts the profiling looks like this:
$ anvi-profile -i SAMPLE-01.bam -c contigs.db
When you run anvi-profile it will,
Process each contig that is longer than
2,500 ntsby default. You can change this value by using the
--min-contig-lengthflag. But you should remember that the minimum contig length should be long enough for tetra-nucleotide frequencies to have enough meaningful signal. There is no way to define a golden number for minimum length that would be applicable to genomes found in all environments. We empirically chose the default to be 2,500, and have been happy with it. You are welcome to experiment, but we advise you to never go below 1,000. You also should remember that the lower you go, the more time it will take to analyze all contigs. You can use the –list-contigs parameter to have an idea how many contigs would be discarded for a given
--min-contig-lengthparameter. If you have an arbitrary list of contigs you want to profile, you can use the flag
--contigs-of-interestto ignore the rest.
Make up some crappy output directory, and sample names for you. We encourage you to use the
--output-dirparameter to tell anvi’o where to store your output files, and the
--sample-nameparameter to give a meaningful, preferably not-so-long sample name to be stored in the profile database. This name will appear almost everywhere, and changing it later will be a pain.
Processing of contigs will include,
The recovery of mean coverage, standard deviation of coverage, and the average coverage for the inner quartiles (Q1 and Q3) for a given contig. Profiling will also create an HD5 file where the coverage value for each nucleotide position will be kept for each contig for later use. While the profiling recovers all the coverage information, it can discard some contigs with very low coverage declared by
--min-mean-coverageparameter (the default is 0, so everything is kept).
The characterization of single-nucleotide variants (SNVs) for every nucleotide position, unless you use
--skip-SNV-profilingflag to skip it altogether (you will definitely gain a lot of time if you do that, but then, you know, maybe you shouldn’t). By default, the profiler will not pay attention to any nucleotide position with less than
10Xcoverage. You can change this behavior via
--min-coverage-for-variabilityflag. Anvi’o uses a conservative heuristic to not report every position with variation: i.e., if you have 200X coverage in a position, and only one of the bases disagree with the reference or consensus nucleotide, it is very likely that this is due to a mapping or sequencing error, and anvi’o tries to avoid those positions. If you want anvi’o to report everything, you can use
--report-variability-fullflag. We encourage you to experiment with it, maybe with a small set of contigs, but in general you should refrain reporting everything (it will make your databases grow larger and larger, and everything will take longer for -99% of the time- no good reason).
Finally, because single profiles are rarely used for genome binning or visualization, and since the clustering step increases the profiling runtime for no good reason, the default behavior of profiling is to not cluster contigs automatically. However, if you are planning to work with single profiles, and if you would like to visualize them using the interactive interface without any merging, you can use the
--cluster-contigsflag to initiate clustering of contigs. In this case anvi’o would use default clustering configurations for single profiles, and store resulting trees in the profile database. You do not need to use this flag if you are planning to merge multiple profiles (i.e., if you have more than one BAM file to work with, which will be the case for most people).
Every anvi’o profile that will be merged later must be generated with the same exact parameters and against the same contigs database. Otherwise, anvi’o will complain about it later, and likely nothing will get merged. Just saying.
Working with anvi’o profiles
You have all your BAM files profiled! Did it take forever? Well, sorry about that. But now you are golden.
The next step in the workflow is to to merge all anvi’o profiles using the program anvi-merge.
This is the simplest form of the
$ anvi-merge SAMPLE-01/PROFILE.db SAMPLE-02/PROFILE.db SAMPLE-03/PROFILE.db -o SAMPLES-MERGED -c contigs.db
Or alternatively you can run it like this (if your work directory contains multiple samples to be merged):
$ anvi-merge */PROFILE.db -o SAMPLES-MERGED -c contigs.db
Please don’t forget to give a short, simple, and descriptive sample name using the
--sample-name parameter, because it will appear in a lot of places later.
When you run
It will merge everything and create a merged profile (yes, thanks, captain obvious),
It will attempt to create multiple clusterings of your splits using the default clustering configurations. Please take a quick look at the default clustering configurations for merged profiles –they are pretty easy to understand. By default, anvi’o will use euclidean distance and ward linkage algorithm to organize contigs; however, you can change those default values with the
--linkageparameters (if you give a wrong option to either of these parameters, the error message you will get will include all the available options). Hierarchical clustering results are necessary for comprehensive visualization and human guided binning; therefore, by default, anvi’o attempts to cluster your contigs using default configurations. You can skip this step by using
--skip-hierarchical-clusteringflag. But even if you don’t skip it, anvi’o will skip it for you if you have more than 20,000 splits, since the computational complexity of this process will get less and less feasible with increasing number of splits. That’s OK, though. There are many ways to recover from this. On the other hand, if you want to teach everyone who is the boss, you can force anvi’o try to cluster your splits regardless of how many of them are there by using
--enforce-hierarchical-clusteringflag. You have the power.
As of version 6+, anvi’o no longer runs a default binning program with
anvi-merge. Binning within anvi’o is now handled with
anvi-cluster-contigs, and/or external binning results can be imported as described in the next section.
$ anvi-import-collection binning_results.txt -p SAMPLES-MERGED/PROFILE.db -c contigs.db --source "SOURCE_NAME"
A collection is a very special and powerful concept in anvi’o, and you should read more about it by following the collection link.
The file format for
binning_results.txt is very simple. This is supposed to be a TAB-delimited file that contains information about which contig belongs to what bin. So each line of this TAB-delimited file should contain a contig name (or split name, see below), and the bin name it belongs to. If you would like to see some example files, you can find them here. They will help you see the difference between input files for splits and contigs after reading the following bullet points, and demonstrate the structure of the optional “bins information” file.
It is common that we use anvi-export-splits-and-coverages to export coverage and sequence composition information to bin our contigs with software that can work with coverage and sequence composition information. In this case, our
binning_results.txtcontains split names. But if you have contig names, you can import them using
anvi-import-collectionwith the flag
You can also use an information file with the
--bins-infoparameter to describe the source of your bins (and even assign them some colors to have some specific visual identifiers for any type of visualization downstream).
You can use anvi-script-get-collection-info to see completion and redundancy estimates for all bins in a given anvi’o collection.
Anvi’o interactive interface is one of the most sophisticated parts of anvi’o. In the context of the metagenomic workflow, the interactive interface allows you to browse your data in an intuitive way as it shows multiple aspects of your data, visualize the results of unsupervised binning, perform supervised binning, or refine existing bins.
The interactive interface of anvi’o is written from scratch, and can do much more than what is mentioned above. In fact, you don’t even need anvi’o profiles to visualize your data using the interactive interface. But since this is a tutorial for the metagenomic workflow, we will save you from these details. If you are interested in learning more, we have other resources that provide detailed descriptions of the anvi’o interactive interface and data formats it works with.
Most things you did so far (creating a contigs database, profiling your BAM files, merging them, etc) may have required you to work on a server. But anvi-interactive will be most useful if you to download the merged directory and your contigs databases to your own computer, because
anvi-interactive uses a browser to interact with you. If you don’t want to download anything, you can use an SSH tunnel to use your server to run anvi-interactive, and the browser on your computer to interact with it. See the post on visualizing from a server.
This is the simplest way to run the interactive interface on your merged profile-db:
$ anvi-interactive -p SAMPLES-MERGED/PROFILE.db -c contigs.db
This will work perfectly if your merged profile has its own trees (i.e., the hierarchical clustering mentioned in the
anvi-merge section was done).
If there are no clusterings available in your profile database
anvi-interactive will complain about the fact that it can’t visualize your profile. But if you have an anvi’o collection stored in your profile database, you can run the interactive interface in collection mode. If you are not sure whether you have a collection or not, you can see all available collections using the program anvi-script-get-collection-info:
$ anvi-script-get-collection-info -p SAMPLES-MERGED/PROFILE.db -c contigs.db --list-collections
Once you know the collection you want to work with, you can use this notation to visualize it:
$ anvi-interactive -p SAMPLES-MERGED/PROFILE.db -c contigs.db -C CONCOCT
When you run
anvi-interactive with a collection name, it will compute various characteristics of each bin on-the-fly, i.e., their mean coverage, variability, completion and redundancy estimates, and generate anvi’o views for them to display their distribution across your samples in the interactive interface. Briefly, each leaf of the anvi’o central dendrogram will represent a “bin” in your collection, instead of a “contig” in your metagenomic assembly. A dendrogram for bins will be generated for each view using euclidean distance and ward linkage automatically. When running the interactive interface in collection mode, you can change those defaults using the
--linkage parameters. If you have run
anvi-merge with the
--skip-hierarchical-clustering parameter due to the large number of contigs you had, but you have binning results available to you from an external resource, you can import those bins as described in the previous section, and run the interactive interface with that collection id to immediately see the distribution of bins across your samples.
In this mode each leaf of the tree will be a bin, along with the distribution of each bin across samples with their completion and redundancy estimates in the most outer layers. In this mode, the interface runs in reduced functionality, and selections will not have completion and contamination estimates. If you are interested in visualizing a specific bin with, say, high redundancy, then you can use the program anvi-refine with that bin.
Here is some additional information about the interactive interface (please see a full list of other options by typing
Storing visual settings. The interactive interface allows you to tweak your presentation of data remarkably, and store these settings in profile databases as “states”. If there is a state stored with the name
default, or if you specify a state when you are running the program via the
--stateparameter, the interactive interface will load it, and proceeds to visualize the data automatically (without waiting for you to click the Draw button). States are simply JSON formatted data, and you can export them from or import them into an anvi’o profile database using the
anvi-import-stateprograms. This way you can programmatically edit state files if necessary, and/or use the same state file for multiple projects.
Using additional data files. When you need to display more than what is stored in anvi’o databases for a project, you can pass additional data files to the interactive interface. If you have a newick tree as an alternative organization of your contigs, you can pass it using the
--treeparameter, and it would be added to the relevant combo box. If you have extra layers to show around the tree (see Figure 1 in this publication as an example), you can use the
--additional-layersparameter. Similarly, you can pass an extra view using the
--additional-viewparameter. Files for both additional layers and additional view are expected to be TAB-delimited text files and information is to be provided at the split-level (if you hate the way we do this please let us know, and we will be like “alright alright” and finally fix it). Please see the help menu for more information about the expected format for these files.
Setting a taxonomic level. When information about taxonomy is available in an anvi’o contigs database, anvi’o interactive interface will start utilizing it automatically and you will see a taxonomy layer in the interface for each of your contigs. By default the genus names will be used, however, you can change that behavior using the
Anvi’o profile databases allow you to add or remove additional data for your items or layers through a program anvi-import-misc-data and its sister programs. This is a very important functionality for better data exploration and communication. Please see this post for more information and to familiarize yourself with it.
This functionality will not be available to you if you are using anvi’o
v3 or earlier. Please make sure you are using the latest stable version of anvi’o, which is
A collection represents one or more bins with one or more contigs. Collections are stored in anvi’o databases can be imported from the results of external binning software, or saved through the anvi’o interactive after a human-guided binning effort. Once you have a collection, you can summarize it using the program anvi-summarize.
If you don’t know what collections and bins are available in a profile-db, you can use the program anvi-show-collections-and-bins, and if you would like to get a very quick list of completion estimates for your bins in a collection, you can use the program anvi-estimate-genome-completeness, or if you would like to learn mor about their taxonomy, you can use the program anvi-estimate-scg-taxonomy.
The result of anvi-summarize is a summary, which essentially is a static HTML output that you can visualize in your browser, send to your colleagues, put on your web page or attach to your submission as a supplementary data for review, since studying this summary does not require an anvi’o installation. When you run anvi-summarize,
All your splits will be merged back to contigs and stored as FASTA files,
completion estimates for each bin in a collection will be computed and stored in the output,
TAB-delimited matrix files will be generated for genome bins across your samples with respect to their mean coverage, variability, etc.
You can run this to summarize the bins saved under the collection id ‘CONCOCT’ into the SAMPLES-SUMMARY directory:
anvi-summarize -p SAMPLES-MERGED/PROFILE.db -c contigs.db -o SAMPLES-SUMMARY -C CONCOCT
If you are not sure which collections are available to you, you can always see a list of them by running this command:
anvi-summarize -p SAMPLES-MERGED/PROFILE.db -c contigs.db --list-collections
The summary process can take a lot of time. If you want to take a quick look to identify which bins need to be refined, you can run
--quick-summary flag to generate a quick-summary.
After running anvi-summarize, you may realize that you are not happy with one or more of your bins. This often is the case when you are working with very large datasets and when you are forced to skip the human guided binning step. The program anvi-refine gives you the ability to make finer adjustments to a bin that may be contaminated.
After you refine bins that need attention, you can re-generate your summary.
Speaking of which, please take a look at this post where Meren talks about assessing completion and contamination of metagenome-assembled genomes.
Please read this article for a comprehensive introduction to the refinement capacity of anvi’o. Plus, there are the following articles from Tom Delmont and Veronika Kivenson that you may consider reading:
This is JUST a beginning to getting yourself familiarized to anvi’o software ecosystem, and what this tutorial covers is by no means comprehensive or complete. If you would like to get some inspiration regarding all the things you can do with anvi’o, please browse the learning material and tutorials listed at https://anvio.org. Feel free to find us on Discord if you run into issues, or have questions regarding ‘omics analyses, or wish to understand if anvi’o is the right choice for your needs.
If you find a mistake on this page or would you like to update something in it, please feel free to edit its source by clicking the edit button at the top-right corner (which you will see if you are logged in to GitHub) 😇