- Creating an anvi’o contigs database
- Profiling BAM files
- Working with anvi’o profiles
The latest version of anvi’o is
v2.3.2. See the release notes.
This tutorial is tailored for anvi’o version 2.2.2, or later. You can learn which version you have on your computer by typing
anvi-profile --version in your terminal. The tutorial for older anvi’o releases from v1 family is here (but v1 is so 2015, and you should never use it ever again).
The goal of this tutorial is to provide a brief overview of the anvi’o workflow for the analysis of assembly-based shotgun metagenomic data. Throughout this tutorial you will primarily learn about the following topics:
Process your contigs,
Profile your metagenomic samples and merge them,
Visualize your data, identify and/or refine genome bins interactively, and create summaries of your results.
If we are missing things, or parts of the tutorial is not clear, please let us know, and we will do our best to improve it.
You should also feel free to contact us if you are not sure whether anvi’o is the platform to go after a certain question.
It is always a good idea to stick with stable versions of the platform, as the snapshots from the codebase can be very unstable and/or broken. However we also need people who like to live at the edge, and who would follow the development, test new features, join discussions, and push us to do better.
You probably will run into issues while using anvi’o. We apologize for that in advance. But when that happens, please consider contacting us to give us a chance to address the problem for you. You can post a comment down below, join the anvi’o discussion group, or open an issue. Regardless of the method to connect, please don’t forget to copy-paste the
anvi-interactive -v output, the operating system you are using, or any other details that may be relevant to the problem.
To run the anvi’o metagenomic workflow, you will need these files:
A FASTA file of your contigs. We shall call it
contigs.fathroughout this manual. We will assume that
contigs.facontains contigs from a co-assembly. However, it may also have been a reference genome from NCBI, a metagenome-assembled genome (MAG), or a bunch of genes you are interested in profiling. Regardless of what it contains, following steps will not change too much.
BAM files for your samples. In fact you can use most of what anvi’o offers, including its binning capabilities, even if you don’t have any BAM files, but since you would be an outlier if its the case, let’s continue with the more conventional scenario. Where you have your contigs, and one or more BAM files.
This tutorial starts with BAM files and a FASTA file for your contigs. The reason for that is simple: there are many ways to get your contigs and BAM files for your metagenomes. But we have started implementing a tutorial that describes the workflow we use to generate these files regularly: “A tutorial on assembly-based metagenomics”. Please feel free to take a look at that one, as well. But this tutorial assumes that you have your BAM files and the FASTA file for your contigs ready.
To make things easier to follow, we will use three mock samples throughout this tutorial:
SAMPLE-03 (in fact are subsampled from a human gut metagenome time series). By clicking the following links, you can download the
contigs.fa and the three BAM files we generated by mapping short reads from each sample to
contigs.fa: contigs.fa, SAMPLE-01-RAW.bam, SAMPLE-02-RAW.bam, SAMPLE-03-RAW.bam. Save them into a directory, and run every command in that directory throughout the tutorial.
For the contigs and BAM files for your real data, there is one more thing you have to make sure you have: simple deflines. Keep reading.
Take a look at your FASTA file
Your FASTA file must have simple deflines, and if it doesn’t have simple deflines, you must fix your FASTA file prior to mapping. This is necessary, because the names in
contigs.fa must match the names in your BAM files. Unfortunately, different mapping software behave differently when they find a space character, or say a
| character in your FASTA file, and they proceed to change those characters in arbitrary ways. Therefore it is essential to keep the sequence IDs in your FASTA file as simple as possible before mapping. To avoid any problems later, take a look at your deflines prior to mapping now, and remove anything that is not a digit, an ASCII letter, an underscore, or a dash character. Here are some bad deflines:
And here are some OK ones:
If you have bad deflines, you need to reformat your FASTA file, and do the mapping again (if you have done you mapping already, you can convert your BAM files into SAM files, edit the SAM file to correct deflines, and re-generate your BAM files with proper names, but these kind of error-prone hacks require a lot of attention to make sure you did not introduce a bug early on to your precious data).
Re-formatting your input FASTA
You can use the following anvi’o script to fix your deflines:
This script will give you your FASTA file with simplified deflines. If you use the flag
--report-file, it will also create a TAB-delimited file for you to keep track of which defline in the new file corresponds to which defline in the original file. While you are here, it may also be a good idea to remove some very short contigs from your contigs file for a clean start. If you like that idea, you can run the same command this way to also remove sequences that are shorter than 1,000 nts:
Let’s just overwrite the
contigs-fixed.fa here to make things simpler:
Creating an anvi’o contigs database
An anvi’o contigs database will keep all the information related to your contigs: positions of open reading frames, k-mer frequencies for each contigs, where splits start and end, functional and taxonomic annotation of genes, etc. The contigs database is an essential component of everything related to anvi’o metagenomic workflow.
The following is the simplest way of creating a contigs database:
When you run this command,
Compute k-mer frequencies for each contig (the default is
4, but you can change it using
--kmer-sizeparameter if you feel adventurous).
Soft-split contigs longer than 20,000 bp into smaller ones (you can change the split size using the
--split-length). When gene calling step is not skipped, the process splitting contigs will consider where genes are and avoid cutting genes in the middle. For very very large assemblies this process can take a while, and you can skip it with
Identify open reading frames using Prodigal, the bacterial and archaeal gene finding program developed at Oak Ridge National Laboratory and the University of Tennessee. If you don’t want gene calling to be done, you can use the flag
--skip-gene-callingto skip it. If you have your own gene calls, you can provide them to be used to identify where genes are in your contigs. All you need to do is to use the parameter
--external-gene-calls(see down below for the format).
Almost every anvi’o program comes with a help menu that explains available parameters in detail. Don’t forget to check them once in a while. If something is not clearly explained, please let us know so we can fix that:
Once you have your contigs database, you can start importing things into it, or directly go to the profiling step.
The TAB-delimited file for
--external-gene-calls should follow this format:
Please note: For a complete gene call where the value of
0, the number of nucleotides, i.e. the value of
stop - start, must be divided by three without any remainder. Any other gene call is not a complete gene call, and must be identified as such by setting the value of
1 so anvi’o does not try to translate their sequence. Otherwise anvi’o will be upset, and then frustrate you.
Please note: anvi’o follows the convention of string indexing and splicing that is identical the way one does it in Python or C.
The statement above means that the index of the first nucleotide in any contig should be
0. In other words, we start counting from
0, not from
stop positions in the input file should comply with this. Here is an example gene in a contig:
stop positions in the input file for this gene should be
26, respectively. If you think we should change this behavior, please let us know (here is a relevant issue). Thanks for your patience!
Although this is absolutely optional, you shouldn’t skip this step. Anvi’o can do a lot with hidden Markov models (HMMs provide statistical means to model complex data in probabilistic terms that can be used to search for patterns, which works beautifully in bioinformatics where we create models from known sequences, and then search for those patterns rapidly in a pool of unknown sequences to recover hits). To decorate your contigs database with hits from HMM models that ship with the platform (which, at this point, constitute multiple published bacterial single-copy gene collections), run this command:
When you run this command (without any other parameters),
It will utilize multiple default bacterial single-copy core gene collections and identify hits among your genes to those collections using HMMER. If you have already run this once, and now would like to add an HMM profile of your own, that is easy. You can use
--hmm-profile-dirparameter to declare where should anvi’o look for it.
Note that the program will use only one CPU by default, especially if you have multiple of them available to you, you should use the
--num-threadsparameter. It significantly improves the runtime, since HMMER is truly an awesome software.
Annotating genes with taxonomy makes things downstream much more meaningful, and improves the human guided binning and refinement steps later on. Please see this post to find out different ways to achieve that:
Anvi’o can make good use of functional annotations of genes. Please see the following post to see how to import functions into anvi’o:
Profiling BAM files
If you are here, you must be done with your contigs database, and have your BAM files ready. Good! It is time to initialize your BAM file, and create an anvi’o profile for your sample.
Anvi’o requires BAM files to be sorted and indexed. In most cases the BAM file you get back from your mapping software will not be sorted and indexed. This is why we named the BAM file for our mock samples as
SAMPLE-01-RAW.bam, instead of
If your BAM files already sorted and indexed (i.e., for each
.bam file you have there also is a
.bam.bai file in the same directory), you can skip this step. Otherwise, you need to initialize your BAM files:
But of course it is not fun to do every BAM file you have one by one. So what to do?
A slightly better way to do would require you to do it in a
for loop. First create a file called, say,
SAMPLE_IDs. For your samples (
Y) it will look like this:
Then, you can run
anvi-init-bam on all of them by typing this:
Of course if you have a way to cluster your runs, you already know what to do.
One last note,
anvi-init-bam uses samtools in the background to do sorting and indexing. While clearly a big thanks and all the credit go to samtools developers, this is also a reminder that you can get your BAM files sorted and indexed using the samtools command line client without using
In contrast to the contigs database, an anvi’o profile database stores sample-specific information about contigs. Profiling a BAM file with anvi’o using
anvi-profile creates a single profile that reports properties for each contig in a single sample based on mapping results. Each profile database links to a contigs database, and anvi’o can merge single profiles that link to the same contigs database into merged profiles (which will be covered later).
In other words, the profiling step makes sense of each BAM file separately by utilizing the information stored in the contigs database. It is one of the most critical (and also most complex and computationally demanding) steps of the metagenomic workflow.
The simplest form of the command that starts the profiling looks like this:
When you run
anvi-profile it will,
Process each contig that is longer than
2,500 ntsby default. You can change this value by using
--min-contig-lengthflag. But you should remember that the minimum contig length should be long enough for tetra-nucleotide frequencies to have enough meaningful signal. There is no way to define a golden number for minimum length that would be applicable to genomes found in all environments. We empirically chose the default to be 2,500, and have been happy with it. You are welcome to experiment, but we advise you to never go below 1,000. You also should remember that the lower you go, the more time it will take to analyze all contigs. You can use –list-contigs parameter to have an idea how many contigs would be discarded for a given
--min-contig-lengthparameter. If you have an arbitrary list of contigs you want to profile, you can use the flag
--contigs-of-interestto ignore the rest.
Make up some crappy output directory, and sample names for you. We encourage you to use
--output-dirparameter to tell anvi’o where to store your output files, and
--sample-nameparameter to give a meaningful, preferably not-so-long sample name to be stored in the profile database. This name will appear almost everywhere, and changing it later will be a pain.
Processing of contigs will include,
The recovery of mean coverage, standard deviation of coverage, and the average coverage for the inner quartiles (Q1 and Q3) for a given contig. Profiling will also create an HD5 file where the coverage value for each nucleotide position will be kept for each contig for later use. While the profiling recovers all the coverage information, it can discard some contigs with very low coverage declared by
--min-mean-coverageparameter (the default is 0, so everything is kept).
The characterization of single-nucleotide variants (SNVs) for every nucleotide position, unless you use
--skip-SNV-profilingflag to skip it altogether (you will definitely gain a lot of time if you do that, but then, you know, maybe you shouldn’t). By default, the profiler will not pay attention to any nucleotide position with less than
10Xcoverage. You can change this behavior via
--min-coverage-for-variabilityflag. Anvi’o uses a conservative heuristic to not report every position with variation: i.e., if you have 200X coverage in a position, and only one of the bases disagree with the reference or consensus nucleotide, it is very likely that this is due to a mapping or sequencing error, and anvi’o tries to avoid those positions. If you want anvi’o to report everything, you can use
--report-variability-fullflag. We encourage you to experiment with it, maybe with a small set of contigs, but in general you should refrain reporting everything (it will make your databases grow larger and larger, and everything will take longer for -99% of the time- no good reason).
Finally, because single profiles are rarely used for genome binning or visualization, and since clustering step increases the profiling runtime for no good reason, the default behavior of profiling is to not cluster contigs automatically. However, if you are planning to work with single profiles, and if you would like to visualize them using the interactive interface without any merging, you can use
--cluster-contigsflag to initiate clustering of contigs. In this case anvi’o would use default clustering configurations for single profiles, and store resulting trees in the profile database. You do not need to use this flag if you are planning to merge multiple profiles (i.e., if you have more than one BAM files to work with, which will be the case for most people).
Every anvi’o profile that will be merged later, must be generated with the same exact parameters and against the same contigs database. Otherwise anvi’o will complain about it later, and likely nothing will get merged. Just saying.
Working with anvi’o profiles
You have all your BAM files profiled! Did it take forever? Well, sorry about that. But now you are golden.
The next step in the workflow is to to merge all anvi’o profiles.
This is the simplest form of the
Or alternatively you can run it like this (if your work directory contains multiple samples to be merged):
Please don’t forget to give a short, simple, and descriptive sample name using
--sample-name parameter, because it will appear in a lot of places later.
When you run
It will merge everything and create a merged profile (yes, thanks, captain obvious),
It will attempt to create multiple clusterings of your splits using the default clustering configurations. Please take a quick look at the default clustering configurations for merged profiles –they are pretty easy to understand. By default, anvi’o will use euclidean distance and ward linkage algorithm to organize contigs, however, you can change those default values with
--linkageparameters (available options for distance metrics and linkage algorithms are listed in this release note). Hierarchical clustering results are necessary for comprehensive visualization, and human guided binning, therefore, by default, anvi’o attempts to cluster your contigs using default configurations. You can skip this step by using
--skip-hierarchical-clusteringflag. But even if you don’t skip it, anvi’o will skip it for you if you have more than 20,000 splits, since the computational complexity of this process will get less and less feasible with increasing number of splits. That’s OK, though. There are many ways to recover from this. On the other hand, if you want to teach everyone who is the boss, you can force anvi’o try to cluster your splits regardless of how many of them are there by using
--enforce-hierarchical-clusteringflag. You have the power.
It will attempt to run CONCOCT to bin your splits automatically. CONCOCT can deal with hundreds of thousands of splits. Which means, regardless of the number of splits you have, and even if you skip the hierarchical clustering step, there will be a collection in the merged profile database (which will be called
CONCOCT) with genome bins identified by CONCOCT in an automatic manner. From which you can generate a summary, or run the interactive interface with
--collection-name CONCOCTparameter (more later on these). But if you would like to skip default CONCOCT clustering, you can use
If you have your own binning of your contigs, you can easily import those results into the merged profile database as a collection:
The file format for
binning_results.txt is very simple. This is supposed to be a TAB-delimited file that contains information about which contig belongs to what bin. So each line of this TAB-delimited file should contain a contig name (or split name, see below), and the bin name they belong to. If you would like to see some example files, you can find them here. They will help you see the difference between input files for splits and contigs after reading the following bullet points, and demonstrate the structure of the optional “bins information” file.
It is common that we use
anvi-export-splits-and-coveragesto export coverage and sequence composition information to bin our contigs with software that can work with coverage and sequence composition information. In this case, our
binning_results.txtcontains split names. But if you have contig names, you can import them using
anvi-import-collectionwith the flag
You can also use an information file with the
--bins-infoparameter to describe the source of your bins (and even assign them some colors to have some specific visual identifiers for any type of visualization downstream).
You can use
anvi-export-collection to export collection information and import into other profiles. It becomes very handy when you are doing benchmarking between different approaches.
Anvi’o interactive interface is one of the most sophisticated parts of anvi’o. In the context of the metagenomic workflow, the interactive interface allows you to browse your data in an intuitive way as it shows multiple aspects of your data, visualize the results of unsupervised binning, perform supervised binning, or refine existing bins.
The interactive interface of anvi’o is written from scratch, and can do much more than what is mentioned above. In fact, you don’t even need anvi’o profiles to visualize your data using the interactive interface. But since this is a tutorial for the metagenomic workflow, we will save you from these details. If you are intersted to learn more, we have other resources that provide detailed descriptions of the anvi’o interactive interface and data formats it works with.
Most things you did so far (creating a contigs database, profiling your BAM files, merging them, etc) required you to work on a server. But
anvi-interactive will require you to download the merged directory, and your contigs databases to your own computer, because
anvi-interactive uses a browser to interact with you. If you don’t want to download anything, you can use dig an SSH tunnel to use your server to run
anvi-interactive, and the browser on your computer to interact with it. See the post on visualizing from a server.
This is the simplest way to run the interactive interface on your merged anvi’o profile:
This will work perfectly if your merged profile has its own trees (i.e., the hierarchical clustering mentioned in the
anvi-merge section was done).
If there are no clusterings available in your profile database
anvi-interactive will complain about the fact that it can’t visualize our profile. But if you have an anvi’o collection stored in your profile database you can run the interactive interface in collection mode. If you are not sure whether you have a collection or not, you can see all available collections using this command:
Once you know the collection you want to work with, you can use this notation to visualize it:
When you run
anvi-interactive with a collection name, it will compute various characteristics of each bin on-the-fly, i.e., their mean coverage, variability, completion and redundancy estimates, and generate anvi’o views for them to display their distribution across your samples in the interactive interface. Briefly, each leaf of the anvi’o central dendrogram will represent a “bin” in your collection, instead of a “contig” in your metagenomic assembly. A dendrogram for bins will be generated for each view using euclidean distance and ward linkage automatically. When running the interactive interface in collection mode, you can change those defaults using the
--linkage parameters. If you have run
--skip-hierarchical-clustering parameter due to the large number of contigs you had, but if you have binning results available to you from an external resource, you can import those bins as described in the previous section, and run the interactive interface with that collection id to immediately see the distribution of bins across your samples.
In this mode each leaf of the tree will be a bin, along with the distribution of each bin across samples with their completion and redundancy estimates in the most outer layers. In this mode, the interface runs in reduced functionality, and selections will not have completion and contamination estimates. If you are interested in visualizing a specific bin with, say, high redundancy, then you can use the program
anvi-refine with that bin.
Here is some additional information about the interactive interface (please see a full list of other options by typing
Storing visual settings. The interactive interface allows you to tweak your presentation of data remarkably, and store these settings in profile databases as “states”. If there is a state stored with the name
default, or if you specify a state when you are running the program via
--stateparameter, the interactive interface will load it, and proceeds to visualize the data automatically (without waiting for you to click the Draw button). States are simply JSON formatted data, and you can export them from or import them into an anvi’o profile database using
anvi-import-stateprograms. This way you can programmatically edit state files if necessary, and/or use the same state file for multiple projects.
Using additional data files. When you need to display more than what is stored in anvi’o databases for a project, you can pass additional data files to the interactive interface. If you have a newick tree as an alternative organization of your contigs, you can pass it using the
--treeparameter, and it would be added to the relevant combo box. If you have extra layers to show around the tree (see Figure 1 in this publication as an example), you can use the
--additional-layersparameter. Similarly, you can pass an extra view using the
--additional-viewparameter. Files for both additional layers and additional view is expected to be TAB-delimited text files and information to be provided at the split-level (if you hate this the way we do please let us know, and we will be like “alright alright” and finally fix it). Please see the help menu for more information about the expected format for these files.
Setting a taxonomic level. When information about taxonomy is available in an anvi’o contigs database, anvi’o interactive interface will start utilizing it automatically and you will see a taxonomy layer in the interface for each of your contigs. By default the genus names will be used, however, you can change that behavior using the
This is an optional database purely to improve visualization. Please see this article for more information:
A collection represents one or more bins with one or more contigs. Collections are stored in anvi’o databases can be imported from the results of external binning software, or saved through the anvi-interactive interface after a human-guided binning effort. Once you have a collection, you can summarize it using
The result of
anvi-summarize is a static HTML output you can browse in your browser, send to your colleagues, put it on your web page (an example from one of our papers is here), or attach it to your submission as a supplementary data for review. When you run
All your splits will be merged back to contigs and stored as FASTA files,
Completion and redundancy estimates for each bin in a collection will be computed and stored in the output,
TAB-delimited matrix files will be generated for genome bins across your samples with respect to their mean coverage, variability, etc.
You can run this to summarize the bins saved under the collection id ‘CONCOCT’ into MY_SUMMARY directory:
If you are not sure which collections are available to you, you can always see a list of them by running this command:
The summary process can take a lot of time. If you want to take a quick look to identify which bins need to be refined, you can run
anvi-summarize, you may realize that you are not happy with one or more of your bins. This often is the case when you are working with very large datasets and when you are forced to skip the human guided binning step.
anvi-summarize gives you the ability to make finer adjustments to a bin that may be contaminated.
After you refine bins that needs attention, you can re-generate your static summary.
Speaking of which, please take a look at this post where Meren talks about assessing completion and contamination of metagenome-assembled genomes.
Please read this article for a comprehensive introduction to refinement capacity of anvi’o.
I run into an issue, whose fault is it?
It is Tom’s. But you can always enter a bug report if you are certain that it needs to be fixed in the anvi’o code base. If you are not sure, please come to the anvi’o discussion group, and let’s talk about it!
There is something in this tutorial I want to fix
We make a lot of typos, sometimes parameters or versions slightly change, and we fail to keep tutorials up-to-date all the time. If you found a mistake on this page, or if you would like to change something in it, you can directly edit its source code by clicking “Edit this file” icon on the right top corner (which you will see if you have logged in to GitHub), and send us a ‘pull request’. We will be very thankful.