Export sequences from sequence sources and compute a similarity metric (e.g. ANI). If a Pan Database is given anvi'o will write computed output to misc data tables of Pan Database.
This program uses the user’s similarity metric of choice to calculate the similarity between the input genomes.
The currently available programs for calculating similarity metrics include, chosen can be chosen with
- PyANI) to calculate the average nucleotide identity (ANI) (i.e. what portion of orthologous gene pairs align)
- fastANI also to calcualte the ANI but at a faster speed (at the drawback of a slight reduction in accuracy)
- sourmash to calculate the mash distance between genomes. Though we provide this option, we don’t recommend using sourmash for genome comparisons–it excels at other tasks–yet it remains as a legacy option.
The expected input is any combination of external-genomes, internal-genomes, and text files that contains paths to fasta files that describe each of your genomes. This is a tab-delimited file with two columns (
path to the fasta files, each of which is assumed to be a single genome).
The program outputs a directory with genome-similarity data. The specific contents will depend on how similarity scores are computed (specified with
--program), but generally contains tab-separated files of similarity scores between genomes and related metrics.
You also have the option to provide a pan-db, in which case the output data will additionally be stored in the database as misc-data-layers and misc-data-layer-orders data. This was done in the pangenomic tutorial.
Here is an example run with pyANI from an external-genomes without any parameter changes:
Genome similarity metrics: parameters
Parameters have been divided up based on which
--program you use.
You have the option to change any of the follow parameters:
- The method used for alignment. The options are:
The minimum alignment fraction (all percent identity scores lower than this will be set to 0). The default is 0.
If you want to keep alignments that are long, despite them not passing the minimum alignment fraction filter, you can supply a
- Similarly, you can discard all results less than some full percent identity (percent identity of aligned segments * aligned fraction).
You can change any of the following fastANI parameters:
The kmer size. The default is 16.
The fragment length. The default is 30.
The minimum number of fragments for a result to count. The default is 50.
You have the option to change the
kmer-size. This value should depend on the relationship between your samples. The default is 31 (as recommended by sourmash for genus-level distances, but we found that 13 most closely parallels the results from an ANI alignment.
You can also set the compression ratio for your fasta files. Decreasing this from the default (1000) will decrease sensitivity.
Once calculated, the similarity matrix is used to create dendrograms via hierarchical clustering, which are stored in the output directory (and in the pan-db, if provided). You can choose to change the distance metric or linkage algorithm used for this clustering.
If you’re getting a lot of debug/output messages, you can turn them off with
--just-do-it or helpfully store them into a file with
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the
__resources__ tag in this file to see an example.